Efficient Implementation of MobileNet and YOLO Object Detection Algorithms for Image Annotation

The objective of the problem is to implement classification and localization algorithms to achieve high object classification and labelling accuracies, and train models readily with as least data and time as possible. The solution to the problem is considered in the following blog.

The efficiency of a model is dependent on various parameters, including the architecture of the model, number of weight parameters in the model, number of images the net has been trained on, and the computational power available to test the models in real time. The third parameter can’t be controlled, thus leaving us dependent on the first two parameters. Thus transfer learning works the best in this scenario, for the pre-trained weights are adjusted according to our dataset, although minimal errors and reliable accuracies are obtained.

Model

For image classification, we use a keras model with the model summary obtained by running the code below. The model’s parameters are tuned to suit the maximum change in information for as minimum data as possible. Thus, we have the batch normalization layers, that randomly shake up the weights to make the model generalized.

Experiment

We use the MobileNet model for training on our dataset. The dataset has been taken from HackerEarth deep learning challenge to classify animals.

If you need any other domain-specific dataset:

You can find thousands of such open datasets here.

We choose 10 random classes from the dataset and change the number of images per class and the size of the fully connected layers, and report the results.

from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import Dropout, Flatten, Dense, GlobalAveragePooling2D
from keras import backend as k
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, TensorBoard, EarlyStopping
from keras.models import load_model
import os
import pickle
from keras.models import model_from_json
import matplotlib.pyplot as plt

image_width, image_height= 256, 256

nb_train_samples= 11000
nb_validation_sample=2000
batch_size = 8

model = applications.mobilenetv2(weights= "imagenet", include_top=False, input_shape=(image_height, image_width,3))

x=model.layers[7].output
#take the first 5 layers of the model
x=Flatten()(x)
x=Dense(1024, activation="relu")(x)
x=Dropout(0.5)(x)
x=Dense(384, activation="relu")(x)
x=Dropout(0.5)(x)
x=Dense(96, activation="relu")(x)
x=Dropout(0.5)(x)
predictions = Dense(30, activation="softmax")(x)


model_final =Model(input=model.input, output=predictions)

model_final = load_model("weights_Mobile_Net.h5")
model_final.compile(loss="categorical_crossentropy", optimizer=optimizers.nadam(lr=0.00001), metrics=["accuracy"])

train_datagen = ImageDataGenerator(rescale = 1./255,
                                   shear_range = 0.2,
                                   zoom_range = 0.2,
                                   horizontal_flip = True,
                                   fill_mode="nearest",
                                   width_shift_range=0.3,
                                   height_shift_range=0.3,
                                   rotation_range=30)

test_datagen = ImageDataGenerator(rescale = 1./255,
horizontal_flip = True,
fill_mode = "nearest",
zoom_range = 0.3,
width_shift_range = 0.3,
height_shift_range=0.3,
rotation_range=30)

training_set = train_datagen.flow_from_directory('./HE_Chal', target_size = (256, 256), batch_size = 8,class_mode = 'categorical')
test_set = test_datagen.flow_from_directory('./Validation', target_size = (256, 256), batch_size = 8, class_mode = 'categorical') 
model_final.fit_generator(training_set, steps_per_epoch = 1000,epochs = 80, validation_data = test_set,validation_steps=1000)
print(model.summary())
#uncomment the follwoing to save your weights and model.
'''model_json=model_final.to_json()

with open("model.json", "w") as json_file:
    json_file.write(model_json)
model_final.save_weights("weights_VGG.h5")
model_final.save("model_27.h5")
#model_final.predict(test_set, batch_size=batch_size)
'''
'''
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("weights_VGG.h5",by_name=True)
print("Loaded model from disk")
 
# evaluate loaded model on test data
loaded_model.compile(loss='categorical_crossentropy', optimizer='nadam', metrics=['accuracy'])

#print(loaded_model.summary())
loaded_model.fit_generator(training_set,                         steps_per_epoch = 1000,epochs = 100,                         validation_data = test_set,validation_steps=1000)
#score = loaded_model.evaluate(training_set,test_set , verbose=0)
'''

Results

The models were run for 15 epochs on an Intel i7 Processor.

Model 1: Mobilenet : 1000 steps/epoch: Larger FC Layers: Training Time: 18 mins/epoch : dataset: 50 images/average: accuracy : 82.2%
Model 2: Mobilenet: 500 steps/epoch: Smaller FC layers: Training time : 12 mins/epoch: dataset: 50 images/average : 82.47%
Model 3: Mobilnet 500 steps/epoch: Smaller FC layers: Training time: 11 mins/epoch: dataset: 30 images per class: accuracy: 76%

Image Detection

There are a few methods that pose detection as a regression problem. Two of the most popular ones are YOLO and SSD. These detectors are also called single shot detectors. Let’s have a look at them:

You Only Look Once.

YOLO divides each image into a grid of S x S and each grid predicts N bounding boxes and confidence. The confidence reflects the accuracy of the bounding box and whether the bounding box actually contains an object(regardless of class). YOLO also predicts the classification score for each box for every class in training. You can combine both the classes to calculate the probability of each class being present in a predicted box.

So, total SxSxN boxes are predicted. However, most of these boxes have low confidence scores and if we set a threshold say 30% confidence, we can remove most of them as shown in the example below.

(YOLO predicts one type of class in one grid! Hence small objects are not identified…)

Single Shot Detectors

SSD runs a convolutional network on input image only once and calculates a feature map. Now, we run a small 3×3 sized convolutional kernel on this feature map to predict the bounding boxes and classification probability. SSD also uses anchor boxes at various aspect ratio similar to Faster-RCNN and learns the off-set rather than learning the box. In order to handle the scale, SSD predicts bounding boxes after multiple convolutional layers. Since each convolutional layer operates at a different scale, it is able to detect objects of various scales.

We compared two models, initially YOLO(darknet) and later SSDs and compared their accuracies and speeds. Since our inputs are images, the FPS parameter is not used to differentiate the models. Moreover, the SSDs are a balance between the Faster — RCNN model and the YOLO model. Let’s see what the experiment tells us?

The SSD model is implemented using the dnn module, with the help of Adrian Rosebrock, in openCV’s library.

# import the necessary packages
import numpy as np
import argparse
import cv2

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
	help="path to input image")
ap.add_argument("-p", "--prototxt", required=True,
	help="path to Caffe 'deploy' prototxt file")
ap.add_argument("-m", "--model", required=True,
	help="path to Caffe pre-trained model")
ap.add_argument("-c", "--confidence", type=float, default=0.2,
	help="minimum probability to filter weak detections")
args = vars(ap.parse_args())
# load our serialized model from disk
print("[INFO] loading model...")
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])

# load the input image and construct an input blob for the image
# by resizing to a fixed 300x300 pixels and then normalizing it
# (note: normalization is done via the authors of the MobileNet SSD
# implementation)
image = cv2.imread(args["image"])
(h, w) = image.shape[:2]
blob = cv2.dnn.blobFromImage(cv2.resize(image, (300, 300)), 0.007843,
	(300, 300), 127.5)
  
# pass the blob through the network and obtain the detections and
# predictions
print("[INFO] computing object detections...")
net.setInput(blob)
detections = net.forward()

# loop over the detections
for i in np.arange(0, detections.shape[2]):
	# extract the confidence (i.e., probability) associated with the
	# prediction
	confidence = detections[0, 0, i, 2]

	# filter out weak detections by ensuring the `confidence` is
	# greater than the minimum confidence
	if confidence > args["confidence"]:
		# extract the index of the class label from the `detections`,
		# then compute the (x, y)-coordinates of the bounding box for
		# the object
		idx = int(detections[0, 0, i, 1])
		box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
		(startX, startY, endX, endY) = box.astype("int")

		# display the prediction
		label = "{}: {:.2f}%".format(CLASSES[idx], confidence * 100)
		print("[INFO] {}".format(label))
		cv2.rectangle(image, (startX, startY), (endX, endY),
			COLORS[idx], 2)
		y = startY - 15 if startY - 15 > 15 else startY + 15
		cv2.putText(image, label, (startX, y),
			cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLORS[idx], 2)
      
     # show the output image
cv2.imshow("Output", image)
cv2.waitKey(0)

The YOLO pre-trained weights were downloaded from the author’s website where we choose the YOLOv3 model. Since it is the darknet model, the anchor boxes are different from the one we have in our dataset. Hence we initially convert the bounding boxes from VOC form to the darknet form using code from here. Then we train the network by changing the config file.

Results:

SSDs: IOU= 0.74, mAP: 0.83 Time /epoch: 12 minutes
YOLOs: IOU= 0.69, mAP: 0.85 Time/epoch: 11 minutes

Output Images

(SSDs used for Vehicle Detection)

(Output with YOLOv3 Pretrained Weights)

Conclusion

The overall problem is stated as one where we need to trade off the speed and accuracy. The overall solution is proposed as two different models for various types of images.

The trade-off between speed and accuracy is accompanied with computational power available. The YOLO model is suitable for high-speed outputs, where accuracy is not that high… whereas SSDs provide higher accuracies with high-speed outputs with a higher computation time.

Hence choose SSDs on good microprocessors, else YOLO is the goto for microprocessor-based computations.

Shameless plugin: We are an online tool to make it super easy for you to build ML datasets like doing image bounding boxes, NER tagging etc. Check us out: the best image labeling tool!