Breast cancer is the most common form of cancer in women, and invasive ductal carcinoma (IDC) is the most common form of breast cancer. Accurately identifying and categorizing breast cancer subtypes is an important clinical task, and automated methods can be used to save time and reduce errors. The goal of this article is to identify IDC when it is present in otherwise unlabeled histopathology images. The consists of 277,524 50x50 pixel RGB digital image patches that were derived from 162 H&E-stained breast histopathology samples. These images are small patches that were extracted from digital images of breast tissue samples. dataset The breast tissue contains many cells but only some of them are cancerous. Patches that are labeled “1” contain cells that are characteristic of invasive ductal carcinoma. For more information about the data, see and . https://www.ncbi.nlm.nih.gov/pubmed/27563488 http://spie.org/Publications/Proceedings/Paper/10.1117/12.2043872 Dataset Download Link: https://www.kaggle.com/paultimothymooney/breast-histopathology-images Let’s start working on the dataset. Step 1: Import Libraries Step 2: Explore Data The module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order. glob imagePatches = glob( , recursive= ) filename imagePatches[ : ]: print(filename) '/kaggle/input/IDC_regular_ps50_idx5/**/*.png' True for in 0 10 Now make a function that can plot an image. image_name = image = cv2.imread(image_name) image = cv2.resize(image, ( , )) plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB)); plt.axis( ) plotImage(image_name) "/kaggle/input/IDC_regular_ps50_idx5/9135/1/9135_idx5_x1701_y1851_class1.png" #Image to be used as query : def plotImage (image_location) 50 50 'off' return I created a function plotImage that takes the image path and plot the image, here is the result of the function. Let’s plot sample images of the dataset. bunchOfImages = imagePatches i_ = plt.rcParams[ ] = ( , ) plt.subplots_adjust(wspace= , hspace= ) l bunchOfImages[: ]: im = cv2.imread(l) im = cv2.resize(im, ( , )) plt.subplot( , , i_+ ) plt.imshow(cv2.cvtColor(im, cv2.COLOR_BGR2RGB)); plt.axis( ) i_ += # Plot Multiple Images 0 'figure.figsize' 10.0 10.0 0 0 for in 25 50 50 5 5 1 #.set_title(l) 'off' 1 r = random.sample(a, ) plt.figure(figsize=( , )) plt.subplot( ) plt.imshow(cv2.imread(r[ ])) plt.subplot( ) plt.imshow(cv2.imread(r[ ])) plt.subplot( ) plt.imshow(cv2.imread(r[ ])); randomImages(imagePatches) : def randomImages (a) 4 16 16 131 0 132 1 133 2 Step 3: Preprocess Data For training and testing, we need to pre-process our , because of the dataset is not in the desired format. data Right now we have a bunch of images and we don’t have any idea of which image is IDC(+) and IDC(-). So we can not set the value of Y in our training. To find which image IDC(+) and IDC(-) we need to process the name of the image, if the end of the image name has then the image belongs to IDC(-) and it has then it belongs to IDC(+). class0.png class1.png patternZero = patternOne = classZero = fnmatch.filter(imagePatches, patternZero) classOne = fnmatch.filter(imagePatches, patternOne) print( ,classZero[ : ], ) print( ,classOne[ : ]) '*class0.png' '*class1.png' "IDC(-)\n\n" 0 5 '\n' "IDC(+)\n\n" 0 5 Now let’s make a function called proc_images that process every image and return and . x y where is our image and labels if images are in then label will be otherwise x y class0 0 1. X,Y = proc_images( , ) df = pd.DataFrame() df[ ]=X df[ ]=Y X2=df[ ] Y2=df[ ] X2=np.array(X2) imgs0=[] imgs1=[] imgs0 = X2[Y2== ] # ( = no IDC, = IDC) imgs1 = X2[Y2== ] 0 90000 "images" "labels" "images" "labels" 0 0 1 1 Description of the dataset: def describeData(a,b): print( .format(len(a))) print( .format(np.sum(b== ))) print( .format(np.sum(b== ))) print( .format( *np.mean(b))) print( .format(a[ ].shape)) describeData(X2,Y2) 'Total number of images: {}' 'Number of IDC(-) Images: {}' 0 'Number of IDC(+) Images: {}' 1 'Percentage of positive images: {:.2f}%' 100 'Image shape (Width, Height, Channels): {}' 0 Results of the function: Total number images: IDC(-) Images: IDC(+) Images: Percentage positive images: % Image shape (Width, Height, Channels): ( , , ) of 90000 Number of 66025 Number of 23975 of 26.64 50 50 3 dict_characters = { : ‘IDC(-)’, : ‘IDC(+)’} print(df.head( )) print(“”) print(dict_characters) 0 1 10 Plotting sample images as IDC(-) and IDC(+) def plotOne(a,b): plt.subplot( , , ) plt.title( ) plt.imshow(a[ ]) plt.subplot( , , ) plt.title( ) plt.imshow(b[ ]) plotOne(imgs0, imgs1) "" " Plot one numpy array " "" 1 2 1 'IDC (-)' 0 1 2 2 'IDC (+)' 0 def plotTwo(a,b): row range( ): plt.figure(figsize=( , )) col range( ): plt.subplot( , ,col+ ) plt.title( ) plt.imshow(a[ +row+col]) plt.axis( ) plt.subplot( , ,col+ ) plt.title( ) plt.imshow(b[ +row+col]) plt.axis( ) plotTwo(imgs0, imgs1) "" " Plot a bunch of numpy arrays sorted by label " "" for in 3 20 10 for in 3 1 8 1 'IDC (-)' 0 'off' 1 8 4 'IDC (+)' 0 'off' Plotting histogram of the dataset: def plotHistogram(a): plt.figure(figsize=( , )) plt.subplot( , , ) plt.imshow(a) plt.axis( ) plt.title( Y[ ] ) histo = plt.subplot( , , ) histo.set_ylabel( ) histo.set_xlabel( ) n_bins = plt.hist(a[:,:, ].flatten(), bins= n_bins, lw = , color= , alpha= ); plt.hist(a[:,:, ].flatten(), bins= n_bins, lw = , color= , alpha= ); plt.hist(a[:,:, ].flatten(), bins= n_bins, lw = , color= , alpha= ); plotHistogram(X2[ ]) "" " Plot histogram of RGB Pixel Intensities " "" 10 5 1 2 1 'off' 'IDC(+)' if 1 else 'IDC(-)' 1 2 2 'Count' 'Pixel Intensity' 30 0 0 'r' 0.5 1 0 'g' 0.5 2 0 'b' 0.5 100 Dataset split in Training and Testing: The is scaled from 0 to 256 but we want it to be scaled from 0 to 1. This will make the data compatible with a wide variety of different classification algorithms. We also want to set aside 20% of the data for testing. This will make the trained model less prone to overfitting. And finally, we will use an oversampling strategy to deal with the imbalanced class sizes. data X=np.array(X X=X/ X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size= print( , X_train.shape) print( , X_test.shape) 255.0 0.20 "Training Data Shape:" "Testing Data Shape:" Training Data Shape: (72000, 50, 50, 3) Testing Data Shape: (18000, 50, 50, 3) Converting Y in categorical Y_trainHot = to_categorical(Y_train, num_classes = ) Y_testHot = to_categorical(Y_test, num_classes = ) 2 2 Plotting Labels lab = df[ ] dist = lab.value_counts() sns.countplot(lab) print(dict_characters) { : , : } 'labels' 0 'IDC(-)' 1 'IDC(+)' We have a class imbalanced dataset, let’s deal with it. Dealing with class imbalanced { : , : } 0 'IDC(-)' 1 'IDC(+)' Step 4: Define Helper Functions for the Classification Task sklearn.utils class_weight class_weight = class_weight.compute_class_weight( , np.unique(Y_train), Y_train) print( ,class_weight) sklearn.utils class_weight class_weight2 = class_weight.compute_class_weight( , np.unique(Y_trainRos), Y_trainRos) print( ,class_weight2) Old Class Weights: [ ] New Class Weights: [ ] from import 'balanced' "Old Class Weights: " from import 'balanced' "New Class Weights: " 0.68126337 1.87920864 1. 1. Step 5: Traning of Model For training our model Batch size = 128 Epochs = 15 After 8 Epochs val_acc did not improve so model training stopped. Step: 6 Evaluate Classification Model score = model.evaluate(X_testRosReshaped,Y_testRosHot, verbose= ) print( , score[ ], ) y_pred = model.predict(X_testRosReshaped) map_characters = { : , : } print( , sklearn.metrics.classification_report(np.where(Y_testRosHot > )[ ], np.argmax(y_pred, axis= ), target_names=list(map_characters.values())), sep= ) Y_pred_classes = np.argmax(y_pred,axis= ) Y_true = np.argmax(Y_testRosHot,axis= ) 1 '\nKeras CNN #1C - accuracy:' 1 '\n' 0 'IDC(-)' 1 'IDC(+)' '\n' 0 1 1 '' 1 1 plot_learning_curve(history) plt.show() confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) plot_confusion_matrix(confusion_mtx, classes = list(dict_characters.values())) plt.show() It does not look too be to overfit or too biased based on the learning curve and confusion matrix. In the future, I will improve the score by optimizing the data augmentation step as well as the network architecture. Related Articles: Sentiment analysis of Amazon product reviews License Plate Detection (ANPR) Part2 All You Need To Know About ANPR Part1 If you found this article useful please clap and follow me, it will encourage me to write articles on tech. That’s all about this article I hope you liked it. Also published behind a paywall on Medium's subdomain .