70-Page Report on the COCO Dataset and Object Detection [Part 3] by@samin

70-Page Report on the COCO Dataset and Object Detection [Part 3]

One place to find tools, platforms, tutorials to explore, process, model data. Take a look at the report to quickly find common resources and/or assets for a given dataset and a specific task, in this case dataset=COCO, task=object detection. We are building a dataset-first marketplace focusing on the end-to-end machine learning pipeline, one where data and assets can be shared and traded. The marketplace will contain all that the report contains (and much more for a lot more datasets).
Shreya Amin HackerNoon profile picture

Shreya Amin

Math/physics to AI & data (15+ years in data science/products www.reasonets.com/marketplace medium.com/@aminshreyaa

One place to find tools, platforms, tutorials to explore, process, model data.

Take a look at the report to quickly find common resources and/or assets for a given dataset and a specific task, in this case dataset=COCO, task=object detection. We are building a dataset-first marketplace focusing on the end-to-end machine learning pipeline, one where data and assets can be shared and traded. The marketplace will contain all that the **report **contains (and much more for a lot more datasets).

I’m open to suggestions, questions, and criticism — let’s start a conversation.

I have broken up the report into the following blogs:

  1. Part 1: COCO Summary Card. Each link will take you to the longer report where you can learn more. The next 3 parts represent a specific section in the report.
  2. Part 2: This part is about COCO and examples and tutorials of tools and platforms used to work with COCO (or object detection tasks).
  3. Part 3 (this one): Process: This part is about the tools and platforms that can be used for different phases of data preparate or data processing involved in vision, object detection, and specifically COCO-related tasks. It will also discuss synthetic data and data quality.
  4. Part 4: Models: This part is about a quick introduction to some pre-trained models and some corresponding readings.

These days training and deploying a model is possible to do in a matter of minutes. You can click several buttons to train and deploy your model as an API fairly quickly now. The challenging part of training object detection models on custom datasets is the data.

The next sub-section of “Process” will talk about the phases of data preparate or data processing involved in vision, object detection, and specifically COCO-related tasks. This sub-section will mainly introduce some tools and platforms that are used to perform a given data processing phase.

Phases of data processing

Below are some of the most important phases we think about during the data preparation or data processing phase (in the context of vision or object detection):

Data Collection

The first thing you need to do to perform object detection on your custom datasets is to collect and label your own dataset. Object detection tasks require bigger datasets than common image classification tasks and you end up needing to collect your own data. Below you will see some tools and tutorials you can use to achieve this. Some of these are computer vision specific, others are object detection specific and others are COCO dataset specific. Note that any object detection specific tools and tutorials can readily be used for COCO.

Search and download images to build your custom datasets

Below are python scripts that allow you to search and download images from Google images.

  • google-images-download: aPython Script for searching and downloading hundreds of Google images to the local hard disk. It searches for Google Images and downloads images based on the inputs you provided. You can specify search parameters such as keywords, number of images, image format, image size, and usage rights.
  • simple-image-download · PyPI: aPython script that lets you search for urls of images from Google images using your tags and/or download them automatically onto your computer
  • : aPython script (from scratch) that uses Selenium web automation and testing library to web scrape and download images from Google Images. We’ll dive deep into understanding the (very simple) code behind it. Python code that accompanies a video about downloading Google Images using Python and Selenium

Data Labeling and Annotation

While it’s true that the most important prerequisite for training a model is data, raw data is not useful in solving object detection problems. You need to process data, annotate images, transform data to make it useful. You need to make sure that the items you need your models to detect are annotated properly, i.e. marked with labels and bounding boxes. This is an extremely time-consuming process so you need to make sure the people you work with for this step are well-trained and rigorous and the tools and platforms you choose provide the people with insights to make the best decisions.

You will get the most benefit from focusing on data processing and in particular on data collection and annotation. Even incremental improvement in your data and making your dataset more balanced is a much better way than focusing on models. You will save quite a bit of time on training if you are focusing on being data-centric.

Below are a few different types of annotations that are common in computer vision:

  1. Bounding Boxes: the most commonly used type of annotation in computer vision. Bounding boxes are rectangular boxes used to define the location of the target object. They are generally used in object detection and localization tasks.
  2. Polygonal Segmentation: complex polygons are used instead of rectangles to define the shape and location of the object in a much more precise way.
  3. Semantic Segmentation: a pixel-wise annotation, where every pixel in the image is assigned to a class and each pixel carries semantic meaning. (note: Semantic segmentation is primarily used in cases where environmental context is very important — such as self-driving cars and robotics.)
  4. Key-Point and Landmark: to detect small objects and shape variations by creating dots across the image. (useful for detecting facial features, facial expressions, emotions, human body parts, and poses).

Label or Annotate images: Open Source tools


CVAT is free, online, interactive video and image annotation tool for computer vision. See tutorial. [See the above section for more tutorials.]


LabelImg is a graphical image annotation tool and label object bounding boxes in images. See tutorial.

Label Studio

Label Studio is an open source data labeling tool for labeling and exploring multiple types of data. You can perform different types of labeling with many data formats as well as export annotation data in COCO format. See tutorials. Also see tutorial for semantic segmentation (not object detection).


Makesense.ai is a free to use online tool for labeling photos. Here’s the Github link. See tutorial.


OpenLabeling is an open source tool to label images and videos for Computer Vision applications. See tutorial.

Label or Annotate images: Paid Platforms


The all-in-one end-to-end cloud-based annotation platform covering the entire data management cycle. See tutorial.

Hive Data

A fully managed data annotation solution to source and label training data for AI / ML Models. [Note. Hive also has Hive Models as part of the whole solution. Hive Models include cloud-hosted deep learning models.]


Labelbox’s training data platform is designed to help accelerate model development and improve its performance by iterating faster. With Labelbox, you can annotate data, prepare and orchestrate model training jobs, diagnose model performance, and prioritize data to label.see tutorial. There are many other tutorials on their .

Scale AI

Scale offers a data platform that enables annotations of large volumes of 3D sensor, image, and video data. It provides ML-powered pre-labeling, an automated quality assurance system, dataset management, document processing, and AI-assisted data annotation eschewed towards data processing for autonomous driving.


The end-to-end platform to annotate, version, and manage ground truth data for your AI. [See the above section for tutorials.]


V7 is an automated annotation platform combining dataset management, image and video annotation, and autoML model training to automatically complete labeling tasks. As an example see V-COCO — V7 Open Datasets: Verbs in COCO (V-COCO) is a dataset that builds off COCO for human-object interaction detection. V-COCO provides 10,346 images (2,533 for training, 2,867 for validating and 4,946 for testing) and 16,199 person instances. Each person has annotations for 29 action categories and there are no interaction labels including objects.


Using the platform, train custom object detection models to identify any object, such as people, cars, particles in the water, imperfections of materials, objects of the same shape, size, or color. See How to Prepare Data for Object Detection? | by Michal Lukac tutorials

[More annotation tools can be found here: Dataset list — Annotation tools]

Data Cleaning

We’ve heard this many times and it’s almost always true: Better Data > Fancier Algorithms. Suppose you want to build your own object detection solution using a custom dataset, then it is not enough to collect data and label data, you’ll need to clean data. Below are some high level steps you can take to clean data:

  • Use one of the label and annotation tools or any other cleaning tools to remove unwanted observations from the dataset. You’ll want to check for duplicate or irrelevant data or labels and remove or correct these.
  • Check for mislabeled classes; these are separate classes that should be the same.
  • Check for missing data or labels.
  • Check for imbalanced data or labels to see if they really are imbalanced.

Here is a to see an example of data cleaning for an image classification task. The tutorial uses a technique called Haar cascades. If you go to the Github page of Haar cascade you will see that there is a particular xml file containing the feature set to detect the smile, eye, frontal-face, full body, lower-body, and more.

Data Transformations

Data augmentation is a common way to increase the number of training samples. Data augmentation allows you to create different variations of the same image, i.e you can apply a set of techniques to artificially increase the amount of data you have by generating new images from existing data. This could mean making small changes to your images or using deep learning models to generate new images. This can help in two ways: improve your model performance and reduce the cost of collecting and labeling data.

What is Data Augmentation? Techniques & Examples in 2022 (aimultiple.com)

What is Data Augmentation? Techniques & Examples in 2022 (aimultiple.com)

However while performing data augmentation, both in computer vision in general and object detection specifically, you need to be careful about the decisions you make and make sure they don’t alter the ground truth. You can apply your transformations and test the performance of your models for different augmentation decisions. This will allow you to pick a more robust model before you deploy. Here are two articles (here and here) to better understand data augmentation.



Albumentations is an image augmentation library. Here’s the official documentation and here is where you can see how the library is used for the object detection task. Also see the “Examples” section for Jupyter Notebooks that demonstrate how to use various features of Albumentations. Each notebook includes a link to Google Colab, where you can run the code by yourself.


AugLy is a data augmentations library (by Facebook) that currently supports four modalities (audio, image, text & video) and over 100 augmentations. Here’s the official documentation.


Augmentor is a Python package designed to aid the augmentation and artificial generation of image data for machine learning tasks. It is primarily a data augmentation tool, but will also incorporate basic image pre-processing functionality. Here’s the official documentation. A Julia version of the package is also being actively developed. If you prefer to use Julia, you can find it here.


AutoAlbument is an AutoML tool that learns image augmentation policies from data using the Faster AutoAugment algorithm. It relieves the user from manually selecting augmentations and tuning their parameters. AutoAlbument provides a complete ready-to-use configuration for an augmentation pipeline. Here’s the official documentation.


The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It provides a collection of highly optimized building blocks for loading and processing image, video and audio data. Here’s the official documentation.

Data Augmentation For Object Detection

A data augmentation library for object detection. Here’s the official documentation. See this article to see how they design an input pipeline that serves images and annotations from the COCO dataset, with augmentations applied on the fly.


Image augmentation library for machine learning experiments. Here’s the official documentation.


Inspired by existing packages, this library is composed by a subset of packages containing operators that can be inserted within neural networks to train models to perform image transformations, epipolar geometry, depth estimation, and low-level image processing such as filtering and edge detection that operate directly on tensors. Here’s the official documentation.


Mxnet also has a built-in augmentation library called Transforms which is fairly similar to the PyTorch Transforms library. See documentation here. Also, here is a tutorial to play with data augmentation techniques.


Here’s a tutorial performing data augmentation using TensorFlow.

Transforms (PyTorch)

Transforms library is the augmentation part of the torchvision package. It consists of datasets, model architectures, and common image transformations for Computer Vision tasks.

The library contains image transformations that can be chained together using the Compose method. Here’s the official documentation.

Data Visualization

Many of the labeling and augmentation tools provide ways to do visual analyses on your dataset. Some of the following types of analyses help with vision tasks [based on this paper.]

  • Pixel-component analysis: to see which image features are behind significant variations in the dataset
  • Spatial analysis: to assess data augmentation methods to mitigate any skewness in the spatial distribution
  • Average image analysis: to compare subsets of images of the same nature and semantics to reveal visual cues in the dataset that the models can use as “shortcuts” instead of learning robust semantic features
  • Metadata analysis: to assess the diversity of a dataset to guide the curation and labeling of representative dataset

Synthetic Data

As mentioned earlier, you can scrape the web to obtain small to medium sized datasets for image classification. For object detection, however, you need both the images but also the annotation files. Earlier we covered how you can label and annotate the data you collect using a set of open source libraries and paid platforms. You also saw how data augmentation can help increase the dataset. In many cases even after all this, there might be objects for which there may not be many (or any) good open source detection datasets available and collecting new data might not be feasible (or it might not be the best option). There are situations where generating synthetic data for object detection may be the best option. Some easy ways to do this are a) paste existing objects of interest onto new backgrounds and randomly change the object’s position, scale, or orientation, b) use realistic 3D rendering engines, c) try to use GAN for data generation (but this requires that you already have a network (the discriminator in GAN) that can already detect the object in question). See this blog on using synthetic data for object detection.

Cut and paste

See this blog and referenced papers to get started. “The easiest and most straightforward approach was taken by Rao and Zhang in their paper “Cut and Paste: Generate Artificial Labels for Object Detection” (appeared on ICVIP 2017). They simply took an object detection dataset (VOC07 and VOC12), cut out objects according to their ground truth labels and pasted them onto images with different backgrounds.”

“A similar but slightly less naive approach to cutting and pasting was introduced, also in 2017, by researchers from the Carnegie Mellon University. In “Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection” (ICCV 2017), Dwibedi et al. use the same basic idea but instead of just placing whole bounding boxes they go for segmentation masks.”

“In “Synthesizing Training Data for Object Detection in Indoor Scenes”, Georgakis et al. from the George Mason University and University of North Carolina at Chapel Hill applied similar ideas to pasting objects into scenes rather than just text.”

3D Rendering Engines


CVEDIA builds “synthetic AI models”. Here’s an article that talks about synthetic data in vision.


Synthetic Data combines advanced Machine Learning (ML) techniques and Computer-Generated Imagery (CGI) models to create large sets of perfectly labeled data that are optimized for Computer Vision models by design.

Synthesis AI

A synthetic data platform for ML engineers to enable the development of more capable AI models. Here’s a blog on synthetic data in object detection. The company released the first book on synthetic data, produced the first white paper surrounding facial analysis with synthetic data, published the first industry survey, and launched the first self-serve product (HumanAPI) in the space that has delivered well over 10 million generated images.


Game engine company Unity offers a tool called Unity Perception which provides a toolkit for generating large-scale datasets for computer vision training and validation. It is focused on a handful of camera-based use cases for now and will ultimately expand to other forms of sensors and machine learning tasks.


UnrealROX is an extremely photorealistic virtual reality environment built over Unreal Engine 4 and designed for generating synthetic data for various robotic vision tasks. This virtual reality environment enables robotic vision researchers to generate realistic and visually plausible data with full ground truth for a wide variety of problems such as class and instance semantic segmentation, object detection, depth estimation, visual grasping, navigation, and more.

Data Quality

The quality of data determines the quality of your model. Below are some data quality issues to be aware of and some potential approaches to tackle these issues.

[Thanks to How to Prioritize Data Quality for Computer Vision and papers referenced in it, including From Data Quality to Model Quality, Unbiased Look at Dataset Bias, and A Tour of Visualization Techniques for Computer Vision Datasets.]

Data quality issues

  1. Dataset equilibrium refers to equilibrium degree of samples among classes and deviation of the sample distribution. E.g: We delete all data of one specific category. However this could be extended to data collection. For example, say we need data for all digits, but decide not to collect as many “1s” or “7s”.
  2. Dataset size is measured by the number of samples, large-scale datasets usually have better sample diversity. E.g.: We modify the dataset size by randomly deleting specific percent of data in the training set. This could be extended to data collection. For example, say we collect a very small amount of data of typed numbers only and not handwritten numbers, then this could lead to problems if inference will involve both typed and handwritten numbers.
  3. Selection bias refers to datasets being often composed of particular types of images.
  4. Capture bias refers to how photographers (or those collecting data) tend to capture objects in similar ways. For example, there are usually more images of an object from the front, directly facing the camera.
  5. Dataset equilibrium refers to equilibrium degree of samples among classes and deviation of the sample distribution.
  6. Quality of label refers to whether the labels of the dataset are complete and accurate. For example, we randomly change the label to the wrong one and see the effect on model robustness.
  7. Label bias refers to semantic categories being poorly defined or being labeled differently by different labelers. For example, establishing ground truth may not be possible.
  8. Negative set bias refers to having an unbalanced dataset where the world is underrepresented and hence models may be overconfident when interacting with such objects.
  9. Dataset contamination refers to the degree of data errors or malicious data artificially added to datasets. For example, we use different methods such as contrast modification and noise injection to add some contamination to the images to see the effect on model robustness.
  10. Not investigating datasets visually leads to unintended errors that can negatively impact model performance

Potential approaches to resolve data quality issues

  1. Collecting samples among classes so that some classes are not overly represented in the training set.

  2. Suggesting a min/max threshold on the optimal number of samples required to train the model for the specific task.

  3. Collecting data from multiple sources to decrease selection bias.

  4. Collecting a larger representation from the world (or context) for any set of images.

  5. Collect data from a variety of angles, environments, settings, etc.

  6. Identifying label errors and providing sufficient quality control to fix them.

  7. Design rigorous labeling guidelines with vetted personnel and build in quality control to negate label bias.

  8. Add negatives from other datasets or use algorithms to actively mine hard negatives from a huge unlabeled set to remedy negative set bias.

  9. Adding noise to data samples in the training set to help reduce generalization error and improve model accuracy on the test set.

  10. Performing multiple data transformations to reduce capture bias.

  11. Performing visual analyses:

    1. Pixel-component analysis: which image features are behind significant variations in the dataset
    2. Spatial analysis: assess data augmentation methods to mitigate any skewness in the spatial distribution
    3. Average image analysis: compare subsets of images of the same nature and semantics to reveal visual cues in the dataset that the models can use as “shortcuts” instead of learning robust semantic features
    4. Metadata analysis: assess the diversity of a dataset to guide the curation and labeling of representative dataset
    5. Analysis using trained models

If you have feedback please review this link (Marketplace — Coming Soon | ReasoNets) and email me at [email protected] Looking forward to starting a conversation.

Next, Part 4.

react to story with heart
react to story with light
react to story with boat
react to story with money
. . . comments & more!