3,169 reads

Object Detection Frameworks That Will Dominate 2023 and Beyond

by Vladimir MikhelevJuly 21st, 2023

Too Long; Didn't Read

Vladimir Mikhelev is a Machine Learning Engineer specializing mainly in computer vision. In this article, he reviews the current object-detection frameworks for machine learning.

featured image - Object Detection Frameworks That Will Dominate 2023 and Beyond

Hello, I'm Vladimir Mikhelev, a Machine Learning Engineer specializing mainly in computer vision. Today, I present my review of the current object-detection frameworks and provide my personal perspective. Object detection has been a well-established task for many years, with numerous frameworks developed and a wealth of papers and articles written on the subject. You might even be using some of these in your current projects.

For years, I've primarily used TensorFlow as the main framework for various computer vision tasks, such as object detection, image classification, image re-identification, and multi-object tracking, to solve production problems. However, recently, I've found using TensorFlow less comfortable due to the challenging transition to TensorFlow 2.0 and the breaking changes to TF 1.x. As a result, many TF 1.x repositories have become abandoned and unmaintained. Some of them, even written in Python 2, are practically unusable. When you try to search for information, it often turns out to be outdated. However, there could be a resurgence for the TensorFlow Lite library since it is widely used for mobile devices, but that's a topic for another time.

On the other hand, the PyTorch framework, with Meta support, has emerged as the most popular framework. When you look at the statistics from recent papers and GitHub repositories, you'll find that PyTorch is the leading machine learning framework for machine learning tasks. I believe its popularity is due to its simplicity and eager-style execution. Moreover, the recent release of PyTorch 2.0 not only supports backward compatibility but also introduces a wealth of features to expedite your training process. Congratulations to the PyTorch team!

Now, formally, object detection can be defined as a "process in computer vision that involves identifying and locating objects in an image or video." With this in mind, let's review the currently available frameworks for object-detection that I've found online.

Requirements for a framework

Next, I'd like to discuss some essential attributes of a suitable framework for production use. Firstly, it should support the latest architectures (state-of-the-art) to ensure compatibility and take advantage of new, efficient methods. Support for FP16 (half-precision floating-point format) is critical for optimizing memory usage and improving computation speed.

The ability to convert models to the Open Neural Network Exchange (ONNX) and TensorRT formats is another crucial feature. ONNX provides an open-source format for AI models, enabling models to be transferred between various ML frameworks. TensorRT, on the other hand, is a high-performance deep learning inference optimizer and runtime library, which delivers low latency and high throughput for deep learning applications.

Furthermore, multi-GPU support is important for scalability and accelerating computation speed. It allows you to distribute your model's training on multiple GPUs, thus speeding up the training process significantly.

The chosen framework should also come with a reasonable license that allows commercial use without restrictive conditions. This is particularly important when considering deploying the models in a commercial product or service.

Frameworks

Tensorflow 2 Detection OD API - Link

The TensorFlow 2 Detection Object Detection API is a potent and adaptable toolkit for object detection. It's capable of supporting advanced model architectures like CenterNet, SSD, RetinaNet, FasterRCNN, etc. The API uses protobuf configs and tfrecords as its dataset. This "zero-code" framework requires minimal coding - simply prepare your data, configure your protobuf configs, and you're ready to start training. So, while this ease of use is a significant advantage, it could become a challenge if you need to implement custom features or modifications.

However, due to the current inadequate support for TensorFlow, the framework seems to be abandoned, as evidenced by the lack of recent commits (refer to attached image). Given its seemingly deprecated status, I wouldn't recommend its use for future projects. It's always advisable to opt for actively maintained and supported tools to ensure longevity and timely updates or fixes.

Model Zoo: Model Zoo

License: Apache 2

Backend: Tensorflow 1 and Tensorflow 2

Pros: Easy configurable via protobuf configs; You could start your training very easy.

Cons: Deprecated, Tensorflow for itself, bad support of ONNX and TensorRT, Lack of SOTA models. TFRecord format of dataset, you have to convert your data. Bad support of Mixed Precision (FP16)

TF-Vision - Link

TF-Vision, or TensorFlow Vision, is a set of libraries provided by the TensorFlow team to support machine learning in computer vision applications as . It offers various pre-trained models and APIs to facilitate the development and deployment of machine learning models for vision tasks.

Model Zoo: Model Zoo

License: Apache 2

Backend: Tensorflow 2

Pros: I don’t know. Don’t recommend.

Cons: Tensorflow for itself, very few models, TFRecord format of dataset, you have to convert your data.

Scenic - Link

From official repo “Scenic is a codebase with a focus on research around attention-based models for computer vision. Scenic has been successfully used to develop classification, segmentation, and detection models for multiple modalities including images, video, audio, and multimodal combinations of them. Scenic is developed in JAX and uses Flax.”

While such projects are great for exploring new techniques and ideas, they may not be the best fit for production-level workloads. Research projects can sometimes lack the robustness, support, or wide-ranging features that more established libraries offer. They may also be subject to significant changes as the research evolves, which could potentially lead to instability in a production setting.

In summary, if your focus is on studying and experimenting with transformer models specifically, Scenic could be a useful library. However, if your goal is to deploy robust models in a production environment, you might want to consider other, more mature libraries or frameworks.

Model Zoo: Model Zoo

License: Apache 2

Backend: JAX

Pros: wow JAX backend, transformers

Cons: seems research-only purposes

Torchvision - Link

Torchvision is indeed a fundamental library that provides utilities for working with image data in PyTorch. It is an extension of PyTorch that has become a must-have when dealing with vision-related tasks in deep learning, such as image classification, object detection, and semantic segmentation.

While torchvision may not have a vast variety of models compared to some other libraries, the models it does offer are widely used in the deep learning community. Moreover, these models serve as excellent starting points for many vision-based projects due to their robust performance on a variety of tasks. They can be used as they are for inference, or they can be fine-tuned for more specific tasks, facilitating the development of customized solutions.

By providing a solid foundation for vision-related tasks, torchvision helps to streamline the development process and allows researchers and developers to focus on designing and fine-tuning models rather than having to implement standard tools and models from scratch.

Model Zoo: Model Zoo

License: BSD 3

Backend: PyTorch

Pros: a lot of helper functions

Cons: seems useful only as helper library for PyTorch projects

Detectron2 - Link

Detectron2 is an open-source platform developed by Facebook's AI Research team (FAIR), aimed at providing a high-quality, high-performance codebase for object detection and segmentation. This second version of Detectron is built using PyTorch, and it includes features like panoptic segmentation, Densepose, Cascade R-CNN, rotated bounding boxes, and more. It's highly modular and flexible, enabling users to easily add custom components or features to meet their specific needs in research or deployment.

Training is easy like TF detection API. Just prepare config and run training command.

Model Zoo: Model Zoo

License: Apache 2

Backend: PyTorch

Pros: easy training, big community

Cons: very few models, more adopted for segmentation task

Yolo repositories

YOLO, which stands for 'You Only Look Once,' has become a household name in the realm of object detection due to its real-time performance and excellent accuracy. Over the years, there have been various versions of YOLO (v1-v8) from different developers. For example, YOLOv7 and YOLOv4 being two noteworthy examples. I won't describe how YOLO works itself. You can find a lot of materials on the internet.

YOLOv7, although it exhibits many state-of-the-art features and written in PyTorch, has licensing constraints that may limit its broad usage in commercial or other specific applications.

On the other hand, YOLOv4 employs a somewhat less commonly used, framework for training named Darknet.

Darknet, originally developed for YOLO models, but may seem unusual to those more accustomed to other popular frameworks such as TensorFlow or PyTorch. In particular, the sight of commands ending with '.exe' or screenshots from Visual Studio (which are characteristic of Windows-based development) may appear out of place for developers primarily used to Linux-based environments.

Model Zoo: Model Zoo( v4), Model Zoo (v7)

License: Various; Free to use (YOLOv4), GPLv3 (YOLOv7)

Backend: DarkNet, PyTorch

Pros: Very well-known, good community and support of ONNX, TensorRT, good trade-off between accuracy and inference speed, different models size;

Cons: latest versions of YOLOv5+ have license not appropriate for commercial use, a lot of realization, some repo trying to mimic other.

Super-gradients - Link

Super-gradients is a newly introduced framework that seems to focus on object detection tasks. They offer an intriguing model called YOLO-NAS (Neural Architecture Search), which they claim outperforms the existing versions of YOLO.

Neural Architecture Search (NAS) is an automated approach that assists in the discovery of high-performing models with minimal human intervention. The concept of combining YOLO with NAS could offer significant advancements, as it would theoretically allow the architecture to adapt and optimize itself to better performance.

However, as with any new tool or framework, it's crucial to thoroughly test and validate it according to your specific use case before deciding to use it for critical projects or production environments. It would be beneficial to review their documentation, experiment with their code, and compare the performance of their models with other well-established models and frameworks.

# run train in few lines

from super_gradients.training import models
from super_gradients.common.object_names import Models

from super_gradients.training import Trainer
from super_gradients.training import training_hyperparams
from super_gradients.training import dataloaders

train_dataloader = dataloaders.get("cifar10_train", dataset_params={}, dataloader_params={"num_workers": 2})
valid_dataloader = dataloaders.get("cifar10_val", dataset_params={}, dataloader_params={"num_workers": 2})

model = models.get(model_name=Models.RESNET18, num_classes=10, pretrained_weights="imagenet")

# you can see more recipes in super_gradients/recipes
training_params =  training_hyperparams.get("training_hyperparams/cifar10_resnet_train_params")

trainer.train(model=model, 
              training_params=training_params, 
              train_loader=train_dataloader,
              valid_loader=valid_dataloader)

Model Zoo: Model Zoo

License: Apache 2

Backend: PyTorch

Pros: Good license, less code, offer state-of-the-art models, integration with MLTools, full ONNX, TensorRT support

Cons: I don’t see. Maybe not big community. I’d like to try it!

MMdetection - Link

MMDetection is a comprehensive and open-source toolbox from MMLab that provides a diverse set of object detection and instance segmentation methods. It contains a wide range of models, including popular ones like FCOS and YOLOX, among others.

One unique aspect of MMDetection is its support for rotated object detection. This is useful in scenarios where objects are not necessarily aligned with the image axis, such as satellite imagery or document analysis, making it a versatile choice for various application scenarios.

However, it's worth noting that the installation of MMDetection involves managing quite a number of dependencies. This might complicate the setup process, particularly in environments with specific configurations. Nonetheless, the benefits offered by the MMDetection framework, in terms of model diversity and features, often outweigh this initial setup overhead for many users.

Model Zoo: Model Zoo

License: Apache 2

Backend: PyTorch

Pros: enormous number of available models (and very recent models!), ONNX, TensorRT, big community and own ecosystem.

Cons: a lot of dependencies (own ecosystem), think twice to use it on production MMDet has special packages for postprocessing models output `mmcv` etc.

IceVision - Link

IceVision is an open-source library that focuses on providing an end-to-end solution for object detection, allowing users to plug-and-play with different models and backbones. Its flexibility is one of its key strengths, with its claims to be agnostic to the underlying deep learning framework.

IceVision may not appear to have the same level of support as some larger projects, it may still be a viable option depending on the specific requirements of your project and the level of customizability you require.

Model Zoo: Model Zoo

License: Apache 2

Backend: PyTorch

Pros: an agnostic computer vision framework;

Cons: the repo looks abandoned, it don’t have recent commits, not big community;

TorchLightning Flash - Link

PyTorch Lightning Flash is a high-level deep learning framework built on PyTorch and PyTorch Lightning. It's designed to make it easy to get started with sophisticated deep learning models and techniques. This framework is still quite new but has been quickly gaining popularity due to its simplicity and extensibility.

One of the main advantages of PyTorch Lightning Flash is that it abstracts much of the boilerplate code needed for training and inference, allowing you to initiate these processes in just a few lines of code. This can drastically increase the speed of development and experimentation, especially for beginners or in projects with strict timelines.

The claimed support for IceVision models further enhances the versatility of PyTorch Lightning Flash, giving users access to a broader range of models for object detection tasks.

However, as with any high-level framework, there is a trade-off between ease of use and flexibility.

While PyTorch Lightning Flash simplifies a lot of the process, if you need to do something not supported out of the box, you may need to write additional code, potentially even delving into the lower-level aspects of the framework.

Model Zoo: Model Zoo

License: Apache 2

Backend: PyTorch

Pros: low boiler-plate code, you could do what you want, good structure, good documentation and community;

Cons: you have to code :)

Outdated frameworks

ImageAI: It's a python library that lets programmers and software developers easily build applications and systems with self-contained Computer Vision capabilities.
Gluon: It's a clear, concise API for defining machine learning models using a collection of pre-built, optimized neural network components. It is provided by Apache MXNet.
Tensorpack: A training interface based on TensorFlow, focusing on speed and efficiency.
SimpleDet: As its name suggests, SimpleDet is a simple and powerful toolbox for object detection research built on top of MXNet.

While these libraries might offer a unique approach or specialized functionality, your assessment indicates that they suffer from a lack of support and infrequent updates. For production-level projects or serious research, these factors can pose significant challenges.

Stale or infrequently updated projects may lack fixes for critical bugs or improvements that enhance efficiency and usability. Additionally, the limited model variety could restrict your ability to experiment with different architectures or could mean that the models provided are no longer state-of-the-art.

As a result, while these libraries may be of interest for certain niche use cases or for historical context, they may not be the best choices for current projects, especially when there are more widely-supported alternatives available. Always choose tools that have active community support, regular updates, and a wide variety of models to ensure the robustness and longevity of your projects.

Your own implementation

Certainly, you could use your own implementation or reimplementations (be careful about licensing). It largely depends on your specific requirements. The most recent papers may not have open implementations yet, or perhaps you have your own ideas, custom loss functions, and custom layers. However, I would recommend not reinventing the wheel. Try to use frameworks that minimize the number of tasks involved, like PyTorch Lightning.

Models Repositories

Hugging face - Link

Hugging Face is a large library of models. For detection, they reportedly have 375 different models. Moreover, these models might just be retrained on a different dataset. It's worth noting that while Hugging Face provides a large array of pre-trained models, these models can indeed be further fine-tuned or retrained on a specific task or dataset as per the user's requirements.

Generally, it's not exclusively for training as they can utilize different frameworks under the hood. However, it's beneficial to be aware of this. For instance, if you need a specific detector like a car wheel detector or a hard hat detector on people, there's a good chance it's available here.

Model Zoo: Model Zoo

License: depend on a model and repository.

Torchhub - Link

TorchHub, similar to Hugging Face, is a pre-trained model repository designed by PyTorch. It's a go-to place for researchers and developers who wish to access state-of-the-art machine learning models. These models, ready for inference, can be easily integrated into various projects. This resource significantly simplifies the task of applying sophisticated machine learning models and promotes the reproducibility of research. Furthermore, these models can serve as a solid starting point for further fine-tuning or transfer learning, making the development of custom solutions faster and more efficient.

Model Zoo: Model Zoo

License: depend on a model and repository.

About licensing:

The statement you provided is mostly accurate. Open source licensing can significantly impact the use of software in commercial applications.

The GNU General Public License (GPL) is a widely used free software license that guarantees end users the freedom to run, study, share, and modify the software. One of the key aspects of GPL (including GPL3.0) is the copyleft clause, which requires any modifications to the software, or software that includes GPL-licensed components, to be distributed under the same GPL license.

This could be challenging in a commercial setting where the source code is often proprietary and not meant for public release. On the other hand, the MIT License and the Apache License 2.0 are permissive licenses. They allow users more flexibility and are generally more compatible with proprietary software.

The MIT License is one of the simplest open-source licenses. It permits users to do anything with the code, given that the original copyright and license notice is included. This makes it highly suitable for commercial applications, as it doesn't require the disclosure of the source code.

The Apache License 2.0 also allows the user to freely use, modify, distribute, and sell software under the license. However, it's more complex than the MIT License because it includes additional stipulations, such as a patent grant clause, which makes it safer for commercial use.

Choosing a license for a software project depends on the goals of the project.

Conclusions

Frameworks for object detection and computer vision tasks are indeed numerous, and new ones continue to emerge. It can be challenging to keep up with all the available tools, hence it's beneficial to focus on the most active, robust, and well-supported ones.

PyTorch has become the go-to for many researchers and developers in the machine learning field due to its flexibility and ease of use. This has led to many of the latest works being published in the form of PyTorch-based repositories.

COCO (Common Objects in Context) is a widely used dataset format for object detection, segmentation, and captioning tasks. By ensuring your data generator can output in COCO format, you maintain maximum compatibility with various models and frameworks.

MMDetection is indeed a promising choice due to its extensive model zoo, active community, and robust features. Despite its extensive dependency list, with careful management, it's possible to avoid pulling in unnecessary packages. This makes MMDetection a worthy option to consider.

Super-Gradients and their YOLO-NAS (Neural Architecture Search) model present an exciting opportunity to experiment with state-of-the-art techniques in object detection. Trying out this repository could provide valuable insights and possibly improved model performance.

It's always essential to balance the exploration of new tools and techniques with the stability and support of more established options. Also, remember to match the choice of tool with the specific requirements of your project.