Student at Duke University studying computer science and statistics
These days, machine learning and computer vision are all the craze. We’ve all seen the news about self-driving cars and facial recognition and probably imagined how cool it’d be to build our own computer vision models. However, it’s not always easy to break into the field, especially without a strong math background. Libraries like PyTorch and TensorFlow can be tedious to learn if all you want to do is experiment with something small.
In this tutorial, I present a simple way for anyone to build fully-functional object detection models with just a few lines of code. More specifically, we’ll be using Detecto, a Python package built on top of PyTorch that makes the process easy and open to programmers at all levels.
To demonstrate how simple it is to use Detecto, let’s load in a pre-trained model and run inference on the following image:
First, download the Detecto package using pip:
pip3 install detecto
Then, save the image above as “fruit.jpg” and create a Python file in the same folder as the image. Inside the Python file, write these 5 lines of code:
from detecto import core, utils, visualize image = utils.read_image('fruit.jpg') model = core.Model() labels, boxes, scores = model.predict_top(image) visualize.show_labeled_image(image, boxes, labels)
After running this file (it may take a few seconds if you don’t have a CUDA-enabled GPU on your computer; more on that later), you should see something similar to the plot below:
Awesome! We did all that with just 5 lines of code. Here’s what we did in each:
Detecto uses a Faster R-CNN ResNet-50 FPN from PyTorch’s model zoo, which is able to detect about 80 different objects such as animals, vehicles, kitchen appliances, etc. However, what if you wanted to detect custom objects, like Coke vs. Pepsi cans, or zebras vs. giraffes?
You’ll be glad to know that training a Detecto model on a custom dataset is just as easy; again, all you need is 5 lines of code, as well as either an existing dataset or some time spent labeling images.
In this tutorial, we’ll start from scratch by building our own dataset. I recommend that you do the same, but if you want to skip this step, you can download a sample dataset here (modified from Stanford’s Dog Dataset).
For our dataset, we’ll be training our model to detect an underwater alien, bat, and witch from the RoboSub competition, as shown below:
Ideally, you’ll want at least 100 images of each class. The good thing is that you can have multiple objects in each image, so you could theoretically get away with 100 total images if each image contains every class of object you want to detect. Also, if you have video footage, Detecto makes it easy to split that footage into images that you can then use for your dataset:
from detecto.utils import split_video split_video('video.mp4', 'frames/', step_size=4)
The code above takes every 4th frame in “video.mp4” and saves it as a JPEG file in the “frames” folder.
Once you’ve produced your training dataset, you should have a folder that looks something like the following:
images/ | image0.jpg | image1.jpg | image2.jpg | ...
If you want, you can also have a second folder containing a set of validation images.
Now comes the time-consuming part: labeling. Detecto supports the PASCAL VOC format, in which you have XML files containing label and position data for each object in your images. To create these XML files, you can use the open-source LabelImg tool as follows:
pip3 install labelImg # Download LabelImg using pip labelImg # Launch the application
You should now see a window pop up. On the left, click the “Open Dir” button and select the folder of images that you want to label. If things worked correctly, you should see something like this:
To draw a bounding box, click the icon in the left menu bar (or use the keyboard shortcut “w”). You can then drag a box around your objects and write/select a label:
When you’ve finished labeling an image, use CTRL+S or CMD+S to save your XML file (for simplicity and speed, you can just use the default file location and name that they auto-fill). To label the next image, click “Next Image” (or use the keyboard shortcut “d”).
Once you’re done with the entire dataset, your folder should look something like this:
images/ | image0.jpg | image0.xml | image1.jpg | image1.xml | ...
We’re almost ready to start training our object detection model!
First, check whether your computer has a CUDA-enabled GPU. Since deep learning uses a lot of processing power, training on a typical CPU can be very slow. Thankfully, most modern deep learning frameworks like PyTorch and Tensorflow can run on GPUs, making things much faster. Make sure you have PyTorch downloaded (you should already have it if you installed Detecto), and then run the following 2 lines of code:
import torch print(torch.cuda.is_available())
If it prints True, great! You can skip to the next section. If it prints False, don’t fret. Follow the below steps to create a Google Colaboratory notebook, an online coding environment that comes with a free, usable GPU. For this tutorial, you’ll just be working from within a Google Drive folder rather than on your computer.
1. Log in to Google Drive
2. Create a folder called “Detecto Tutorial” and navigate into this folder
3. Upload your training images (and/or validation images) to this folder
4. Right-click, go to “More”, and click “Google Colaboratory”:
You should now see an interface like this:
5. Give your notebook a name if you want, and then go to Edit ->Notebook settings -> Hardware accelerator and select GPU
6. Type the following code to “mount” your Drive, change directory to the current folder, and install Detecto:
import os from google.colab import drive drive.mount('/content/drive') os.chdir('/content/drive/My Drive/Detecto Tutorial') !pip install detecto
To make sure everything worked, you can create a new code cell and type
to check that you’re in the right directory.
Finally, we can now train a model on our custom dataset! As promised, this is the easy part. All it takes is 4 lines of code:
from detecto import core, utils, visualize dataset = core.Dataset('images/') model = core.Model(['alien', 'bat', 'witch']) model.fit(dataset)
Let’s again break down what we’ve done with each line of code:
This can take anywhere from 10 minutes to 1+ hours to run depending on the size of your dataset, so make sure your program doesn’t exit immediately after finishing the above statements (i.e. you’re using a Jupyter/Colab notebook that preserves state while active).
Now that you have a trained model, let’s test it on some images. To read images from a file path, you can use the
function from the
module (you could also use an image from the Dataset you created above):
# Specify the path to your image image = utils.read_image('images/image0.jpg') predictions = model.predict(image) # predictions format: (labels, boxes, scores) labels, boxes, scores = predictions # ['alien', 'bat', 'bat'] print(labels) # xmin ymin xmax ymax # tensor([[ 569.2125, 203.6702, 1003.4383, 658.1044], # [ 276.2478, 144.0074, 579.6044, 508.7444], # [ 277.2929, 162.6719, 627.9399, 511.9841]]) print(boxes) # tensor([0.9952, 0.9837, 0.5153]) print(scores)
As you can see, the model’s predict method returns a tuple of 3 elements: labels, boxes, and scores. In the above example, the model predicted an alien (
) at the coordinates [569, 204, 1003, 658] (
) with a confidence level of 0.995 (
From these predictions, we can plot the results using the
module. For example:
visualize.show_labeled_image(image, boxes, labels)
Running the above code with the image and predictions you received should produce something that looks like this:
If you have a video, you can run object detection on it:
visualize.detect_video(model, 'input.mp4', 'output.avi')
This takes in a video file called “input.mp4” and produces an “output.avi” file with the given model’s predictions. If you open this file with VLC or some other video player, you should see some promising results!
Lastly, you can save and load models from files, allowing you to save your progress and come back to it later:
model.save('model_weights.pth') # ... Later ... model = core.Model.load('model_weights.pth', ['alien', 'bat', 'witch'])
You’ll be happy to know that Detecto isn’t just limited to 5 lines of code. Let’s say for example that the model didn’t do as well as you hoped. We can try to increase its performance by augmenting our dataset with torchvision transforms and defining a custom DataLoader:
from torchvision import transforms augmentations = transforms.Compose([ transforms.ToPILImage(), transforms.RandomHorizontalFlip(0.5), transforms.ColorJitter(saturation=0.5), transforms.ToTensor(), utils.normalize_transform(), ]) dataset = core.Dataset('images/', transform=augmentations) loader = core.DataLoader(dataset, batch_size=2, shuffle=True)
This code applies random horizontal flips and saturation effects on images in our dataset, increasing the diversity of our data. We then define a DataLoader object with
; we’ll pass this to
instead of the Dataset to tell our model to train on batches of 2 images rather than the default of 1.
If you created a separate validation dataset earlier, now is the time to load it in during training. By providing a validation dataset, the
method returns a list of the losses at each epoch, and if
, then it will also print these out during the training process itself. The following code block demonstrates this as well as customizes several other training parameters:
import matplotlib.pyplot as plt val_dataset = core.Dataset('validation_images/') losses = model.fit(loader, val_dataset, epochs=10, learning_rate=0.001, lr_step_size=5, verbose=True) plt.plot(losses) plt.show()
The resulting plot of the losses should be more or less decreasing:
For even more flexibility and control over your model, you can bypass Detecto altogether; the
method returns the underlying torchvision model used, which you can mess around with as much as you see fit.
In this tutorial, we showed that computer vision and object detection don’t need to be challenging. All you need is a bit of time and patience to come up with a labeled dataset.
Previously published at https://medium.com/@alankbi/build-a-custom-trained-object-detection-model-with-5-lines-of-code-713ba7f6c0fb