One place to find tools, platforms, tutorials to explore, process, model data.
Take a look at the
I’m open to suggestions, questions, and criticism — let’s start a conversation.
I have broken up the report into the following blogs:
These days training and deploying a model is possible to do in a matter of minutes. You can click several buttons to train and deploy your model as an API fairly quickly now. The challenging part of training object detection models on custom datasets is the data.
The next sub-section of “Process” will talk about the phases of data preparate or data processing involved in vision, object detection, and specifically COCO-related tasks. This sub-section will mainly introduce some tools and platforms that are used to perform a given data processing phase.
Below are some of the most important phases we think about during the data preparation or data processing phase (in the context of vision or object detection):
The first thing you need to do to perform object detection on your custom datasets is to collect and label your own dataset. Object detection tasks require bigger datasets than common image classification tasks and you end up needing to collect your own data. Below you will see some tools and tutorials you can use to achieve this. Some of these are computer vision specific, others are object detection specific and others are COCO dataset specific. Note that any object detection specific tools and tutorials can readily be used for COCO.
Below are python scripts that allow you to search and download images from Google images.
While it’s true that the most important prerequisite for training a model is data, raw data is not useful in solving object detection problems. You need to process data, annotate images, transform data to make it useful. You need to make sure that the items you need your models to detect are annotated properly, i.e. marked with labels and bounding boxes. This is an extremely time-consuming process so you need to make sure the people you work with for this step are well-trained and rigorous and the tools and platforms you choose provide the people with insights to make the best decisions.
You will get the most benefit from focusing on data processing and in particular on data collection and annotation. Even incremental improvement in your data and making your dataset more balanced is a much better way than focusing on models. You will save quite a bit of time on training if you are focusing on being data-centric.
Below are a few different types of annotations that are common in computer vision:
CVAT is free, online, interactive video and image annotation tool for computer vision. See tutorial. [See the above section for more tutorials.]
LabelImg is a graphical image annotation tool and label object bounding boxes in images. See tutorial.
Label Studio is an open source data labeling tool for labeling and exploring multiple types of data. You can perform different types of labeling with many data formats as well as export annotation data in COCO format. See tutorials. Also see tutorial for semantic segmentation (not object detection).
Makesense.ai is a free to use online tool for labeling photos. Here’s the Github link. See tutorial.
OpenLabeling is an open source tool to label images and videos for Computer Vision applications. See tutorial.
The all-in-one end-to-end cloud-based annotation platform covering the entire data management cycle. See tutorial.
A fully managed data annotation solution to source and label training data for AI / ML Models. [Note. Hive also has Hive Models as part of the whole solution. Hive Models include cloud-hosted deep learning models.]
Labelbox’s training data platform is designed to help accelerate model development and improve its performance by iterating faster. With Labelbox, you can annotate data, prepare and orchestrate model training jobs, diagnose model performance, and prioritize data to label.see tutorial. There are many other tutorials on their .
Scale offers a data platform that enables annotations of large volumes of 3D sensor, image, and video data. It provides ML-powered pre-labeling, an automated quality assurance system, dataset management, document processing, and AI-assisted data annotation eschewed towards data processing for autonomous driving.
The end-to-end platform to annotate, version, and manage ground truth data for your AI. [See the above section for tutorials.]
V7 is an automated annotation platform combining dataset management, image and video annotation, and autoML model training to automatically complete labeling tasks. As an example see V-COCO — V7 Open Datasets: Verbs in COCO (V-COCO) is a dataset that builds off COCO for human-object interaction detection. V-COCO provides 10,346 images (2,533 for training, 2,867 for validating and 4,946 for testing) and 16,199 person instances. Each person has annotations for 29 action categories and there are no interaction labels including objects.
Using the platform, train custom object detection models to identify any object, such as people, cars, particles in the water, imperfections of materials, objects of the same shape, size, or color. See How to Prepare Data for Object Detection? | by Michal Lukac tutorials
[More annotation tools can be found here: Dataset list — Annotation tools]
We’ve heard this many times and it’s almost always true: Better Data > Fancier Algorithms. Suppose you want to build your own object detection solution using a custom dataset, then it is not enough to collect data and label data, you’ll need to clean data. Below are some high level steps you can take to clean data:
Here is a to see an example of data cleaning for an image classification task. The tutorial uses a technique called Haar cascades. If you go to the Github page of Haar cascade you will see that there is a particular xml file containing the feature set to detect the smile, eye, frontal-face, full body, lower-body, and more.
Data augmentation is a common way to increase the number of training samples. Data augmentation allows you to create different variations of the same image, i.e you can apply a set of techniques to artificially increase the amount of data you have by generating new images from existing data. This could mean making small changes to your images or using deep learning models to generate new images. This can help in two ways: improve your model performance and reduce the cost of collecting and labeling data.
However while performing data augmentation, both in computer vision in general and object detection specifically, you need to be careful about the decisions you make and make sure they don’t alter the ground truth. You can apply your transformations and test the performance of your models for different augmentation decisions. This will allow you to pick a more robust model before you deploy. Here are two articles (here and here) to better understand data augmentation.
Albumentations is an image augmentation library. Here’s the official documentation and here is where you can see how the library is used for the object detection task. Also see the “Examples” section for Jupyter Notebooks that demonstrate how to use various features of Albumentations. Each notebook includes a link to Google Colab, where you can run the code by yourself.
Augmentor is a Python package designed to aid the augmentation and artificial generation of image data for machine learning tasks. It is primarily a data augmentation tool, but will also incorporate basic image pre-processing functionality. Here’s the official documentation. A Julia version of the package is also being actively developed. If you prefer to use Julia, you can find it here.
AutoAlbument is an AutoML tool that learns image augmentation policies from data using the Faster AutoAugment algorithm. It relieves the user from manually selecting augmentations and tuning their parameters. AutoAlbument provides a complete ready-to-use configuration for an augmentation pipeline. Here’s the official documentation.
The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It provides a collection of highly optimized building blocks for loading and processing image, video and audio data. Here’s the official documentation.
A data augmentation library for object detection. Here’s the official documentation. See this article to see how they design an input pipeline that serves images and annotations from the COCO dataset, with augmentations applied on the fly.
Image augmentation library for machine learning experiments. Here’s the official documentation.
Inspired by existing packages, this library is composed by a subset of packages containing operators that can be inserted within neural networks to train models to perform image transformations, epipolar geometry, depth estimation, and low-level image processing such as filtering and edge detection that operate directly on tensors. Here’s the official documentation.
Mxnet also has a built-in augmentation library called Transforms which is fairly similar to the PyTorch Transforms library. See documentation here. Also, here is a tutorial to play with data augmentation techniques.
Here’s a tutorial performing data augmentation using TensorFlow.
Transforms library is the augmentation part of the torchvision package. It consists of datasets, model architectures, and common image transformations for Computer Vision tasks.
The library contains image transformations that can be chained together using the Compose method. Here’s the official documentation.
Many of the labeling and augmentation tools provide ways to do visual analyses on your dataset. Some of the following types of analyses help with vision tasks [based on this paper.]
As mentioned earlier, you can scrape the web to obtain small to medium sized datasets for image classification. For object detection, however, you need both the images but also the annotation files. Earlier we covered how you can label and annotate the data you collect using a set of open source libraries and paid platforms. You also saw how data augmentation can help increase the dataset. In many cases even after all this, there might be objects for which there may not be many (or any) good open source detection datasets available and collecting new data might not be feasible (or it might not be the best option). There are situations where generating synthetic data for object detection may be the best option. Some easy ways to do this are a) paste existing objects of interest onto new backgrounds and randomly change the object’s position, scale, or orientation, b) use realistic 3D rendering engines, c) try to use GAN for data generation (but this requires that you already have a network (the discriminator in GAN) that can already detect the object in question). See this blog on using synthetic data for object detection.
See this blog and referenced papers to get started. “The easiest and most straightforward approach was taken by Rao and Zhang in their paper “Cut and Paste: Generate Artificial Labels for Object Detection” (appeared on ICVIP 2017). They simply took an object detection dataset (VOC07 and VOC12), cut out objects according to their ground truth labels and pasted them onto images with different backgrounds.”
“A similar but slightly less naive approach to cutting and pasting was introduced, also in 2017, by researchers from the Carnegie Mellon University. In “Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection” (ICCV 2017), Dwibedi et al. use the same basic idea but instead of just placing whole bounding boxes they go for segmentation masks.”
“In “Synthesizing Training Data for Object Detection in Indoor Scenes”, Georgakis et al. from the George Mason University and University of North Carolina at Chapel Hill applied similar ideas to pasting objects into scenes rather than just text.”
CVEDIA builds “synthetic AI models”. Here’s an article that talks about synthetic data in vision.
Synthetic Data combines advanced Machine Learning (ML) techniques and Computer-Generated Imagery (CGI) models to create large sets of perfectly labeled data that are optimized for Computer Vision models by design.
A synthetic data platform for ML engineers to enable the development of more capable AI models. Here’s a blog on synthetic data in object detection. The company released the first book on synthetic data, produced the first white paper surrounding facial analysis with synthetic data, published the first industry survey, and launched the first self-serve product (HumanAPI) in the space that has delivered well over 10 million generated images.
Game engine company Unity offers a tool called Unity Perception which provides a toolkit for generating large-scale datasets for computer vision training and validation. It is focused on a handful of camera-based use cases for now and will ultimately expand to other forms of sensors and machine learning tasks.
UnrealROX is an extremely photorealistic virtual reality environment built over Unreal Engine 4 and designed for generating synthetic data for various robotic vision tasks. This virtual reality environment enables robotic vision researchers to generate realistic and visually plausible data with full ground truth for a wide variety of problems such as class and instance semantic segmentation, object detection, depth estimation, visual grasping, navigation, and more.
The quality of data determines the quality of your model. Below are some data quality issues to be aware of and some potential approaches to tackle these issues.
[Thanks to How to Prioritize Data Quality for Computer Vision and papers referenced in it, including From Data Quality to Model Quality, Unbiased Look at Dataset Bias, and A Tour of Visualization Techniques for Computer Vision Datasets.]
Collecting samples among classes so that some classes are not overly represented in the training set.
Suggesting a min/max threshold on the optimal number of samples required to train the model for the specific task.
Collecting data from multiple sources to decrease selection bias.
Collecting a larger representation from the world (or context) for any set of images.
Collect data from a variety of angles, environments, settings, etc.
Identifying label errors and providing sufficient quality control to fix them.
Design rigorous labeling guidelines with vetted personnel and build in quality control to negate label bias.
Add negatives from other datasets or use algorithms to actively mine hard negatives from a huge unlabeled set to remedy negative set bias.
Adding noise to data samples in the training set to help reduce generalization error and improve model accuracy on the test set.
Performing multiple data transformations to reduce capture bias.
Performing visual analyses: