Breast Cancer Detection Using Histopathology Images

Breast cancer is the most common form of cancer in women, and invasive ductal carcinoma (IDC) is the most common form of breast cancer. Accurately identifying and categorizing breast cancer subtypes is an important clinical task, and automated methods can be used to save time and reduce errors.

The goal of this article is to identify IDC when it is present in otherwise unlabeled histopathology images.

The dataset consists of 277,524 50x50 pixel RGB digital image patches that were derived from 162 H&E-stained breast histopathology samples. These images are small patches that were extracted from digital images of breast tissue samples.

The breast tissue contains many cells but only some of them are cancerous. Patches that are labeled “1” contain cells that are characteristic of invasive ductal carcinoma. For more information about the data, see https://www.ncbi.nlm.nih.gov/pubmed/27563488 and http://spie.org/Publications/Proceedings/Paper/10.1117/12.2043872.

Dataset Download Link: https://www.kaggle.com/paultimothymooney/breast-histopathology-images

Let’s start working on the dataset.

Step 1: Import Libraries

Step 2: Explore Data

The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.

imagePatches = glob('/kaggle/input/IDC_regular_ps50_idx5/**/*.png', recursive=True)
for filename in imagePatches[0:10]:
    print(filename)

Now make a function that can plot an image.

image_name = "/kaggle/input/IDC_regular_ps50_idx5/9135/1/9135_idx5_x1701_y1851_class1.png" #Image to be used as query
def plotImage(image_location):
    image = cv2.imread(image_name)
    image = cv2.resize(image, (50,50))
    plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB)); plt.axis('off')
    return
plotImage(image_name)

I created a function plotImage that takes the image path and plot the image, here is the result of the function.

Let’s plot sample images of the dataset.

# Plot Multiple Images
bunchOfImages = imagePatches
i_ = 0
plt.rcParams['figure.figsize'] = (10.0, 10.0)
plt.subplots_adjust(wspace=0, hspace=0)
for l in bunchOfImages[:25]:
    im = cv2.imread(l)
    im = cv2.resize(im, (50, 50)) 
    plt.subplot(5, 5, i_+1) #.set_title(l)
    plt.imshow(cv2.cvtColor(im, cv2.COLOR_BGR2RGB)); plt.axis('off')
    i_ += 1

def randomImages(a):
    r = random.sample(a, 4)
    plt.figure(figsize=(16,16))
    plt.subplot(131)
    plt.imshow(cv2.imread(r[0]))
    plt.subplot(132)
    plt.imshow(cv2.imread(r[1]))
    plt.subplot(133)
    plt.imshow(cv2.imread(r[2])); 
randomImages(imagePatches)

Step 3: Preprocess Data

For training and testing, we need to pre-process our data, because of the dataset is not in the desired format.

Right now we have a bunch of images and we don’t have any idea of which image is IDC(+) and IDC(-).

So we can not set the value of Y in our training.

To find which image IDC(+) and IDC(-) we need to process the name of the image, if the end of the image name has class0.png then the image belongs to IDC(-) and it has class1.png then it belongs to IDC(+).

patternZero = '*class0.png'
patternOne = '*class1.png'
classZero = fnmatch.filter(imagePatches, patternZero)
classOne = fnmatch.filter(imagePatches, patternOne)
print("IDC(-)\n\n",classZero[0:5],'\n')
print("IDC(+)\n\n",classOne[0:5])

Now let’s make a function called proc_images that process every image and return x and y.

where x is our image and y labels if images are in class0 then label will be 0 otherwise 1.

X,Y = proc_images(0,90000)
df = pd.DataFrame()
df["images"]=X
df["labels"]=Y
X2=df["images"]
Y2=df["labels"]
X2=np.array(X2)
imgs0=[]
imgs1=[]
imgs0 = X2[Y2==0] # (0 = no IDC, 1 = IDC)
imgs1 = X2[Y2==1]

Description of the dataset:

def describeData(a,b):
    print('Total number of images: {}'.format(len(a)))
    print('Number of IDC(-) Images: {}'.format(np.sum(b==0)))
    print('Number of IDC(+) Images: {}'.format(np.sum(b==1)))
    print('Percentage of positive images: {:.2f}%'.format(100*np.mean(b)))
    print('Image shape (Width, Height, Channels): {}'.format(a[0].shape))
describeData(X2,Y2)

Results of the function:

Total number of images: 90000
Number of IDC(-) Images: 66025
Number of IDC(+) Images: 23975
Percentage of positive images: 26.64%
Image shape (Width, Height, Channels): (50, 50, 3)

dict_characters = {0: ‘IDC(-)’, 1: ‘IDC(+)’}
print(df.head(10))
print(“”)
print(dict_characters)

Plotting sample images as IDC(-) and IDC(+)

def plotOne(a,b):
    """
    Plot one numpy array
    """
    plt.subplot(1,2,1)
    plt.title('IDC (-)')
    plt.imshow(a[0])
    plt.subplot(1,2,2)
    plt.title('IDC (+)')
    plt.imshow(b[0])
plotOne(imgs0, imgs1)

def plotTwo(a,b): 
    """
    Plot a bunch of numpy arrays sorted by label
    """
    for row in range(3):
        plt.figure(figsize=(20, 10))
        for col in range(3):
            plt.subplot(1,8,col+1)
            plt.title('IDC (-)')
            plt.imshow(a[0+row+col])
            plt.axis('off')       
            plt.subplot(1,8,col+4)
            plt.title('IDC (+)')
            plt.imshow(b[0+row+col])
            plt.axis('off')
plotTwo(imgs0, imgs1)

Plotting histogram of the dataset:

def plotHistogram(a):
    """
    Plot histogram of RGB Pixel Intensities
    """
    plt.figure(figsize=(10,5))
    plt.subplot(1,2,1)
    plt.imshow(a)
    plt.axis('off')
    plt.title('IDC(+)' if Y[1] else 'IDC(-)')
    histo = plt.subplot(1,2,2)
    histo.set_ylabel('Count')
    histo.set_xlabel('Pixel Intensity')
    n_bins = 30
    plt.hist(a[:,:,0].flatten(), bins= n_bins, lw = 0, color='r', alpha=0.5);
    plt.hist(a[:,:,1].flatten(), bins= n_bins, lw = 0, color='g', alpha=0.5);
    plt.hist(a[:,:,2].flatten(), bins= n_bins, lw = 0, color='b', alpha=0.5);
plotHistogram(X2[100])

Dataset split in Training and Testing:

The data is scaled from 0 to 256 but we want it to be scaled from 0 to 1. This will make the data compatible with a wide variety of different classification algorithms. We also want to set aside 20% of the data for testing. This will make the trained model less prone to overfitting. And finally, we will use an oversampling strategy to deal with the imbalanced class sizes.

X=np.array(X
X=X/255.0
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20
print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

Training Data Shape: (72000, 50, 50, 3)

Testing Data Shape: (18000, 50, 50, 3)

Converting Y in categorical

Y_trainHot = to_categorical(Y_train, num_classes = 2)
Y_testHot = to_categorical(Y_test, num_classes = 2)

Plotting Labels

lab = df['labels']
dist = lab.value_counts()
sns.countplot(lab)
print(dict_characters)
{0: 'IDC(-)', 1: 'IDC(+)'}

We have a class imbalanced dataset, let’s deal with it.

Dealing with class imbalanced

{0: 'IDC(-)', 1: 'IDC(+)'}

Step 4: Define Helper Functions for the Classification Task

from sklearn.utils import class_weight
class_weight = class_weight.compute_class_weight('balanced', np.unique(Y_train), Y_train)
print("Old Class Weights: ",class_weight)
from sklearn.utils import class_weight
class_weight2 = class_weight.compute_class_weight('balanced', np.unique(Y_trainRos), Y_trainRos)
print("New Class Weights: ",class_weight2)
Old Class Weights:  [ 0.68126337  1.87920864]
New Class Weights:  [ 1.  1.]

Step 5: Traning of Model

For training our model

Batch size = 128

Epochs = 15

After 8 Epochs val_acc did not improve so model training stopped.

Step: 6 Evaluate Classification Model

score = model.evaluate(X_testRosReshaped,Y_testRosHot, verbose=1)
print('\nKeras CNN #1C - accuracy:', score[1],'\n')
y_pred = model.predict(X_testRosReshaped)
map_characters = {0: 'IDC(-)', 1: 'IDC(+)'}
print('\n', sklearn.metrics.classification_report(np.where(Y_testRosHot > 0)[1], np.argmax(y_pred, axis=1), 
target_names=list(map_characters.values())), sep='')    
Y_pred_classes = np.argmax(y_pred,axis=1) 
Y_true = np.argmax(Y_testRosHot,axis=1)

plot_learning_curve(history)
plt.show()

confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) 
plot_confusion_matrix(confusion_mtx, classes = list(dict_characters.values())) 
plt.show()

It does not look too be to overfit or too biased based on the learning curve and confusion matrix. In the future, I will improve the score by optimizing the data augmentation step as well as the network architecture.

Related Articles:

If you found this article useful please clap and follow me, it will encourage me to write articles on tech.

That’s all about this article I hope you liked it.

Also published behind a paywall on Medium's subdomain.