Computer vision (CV) is a sub-domain of the field of Artificial Intelligence that is aimed at helping machines to identify and understand content in digital video or images. Simply put, CV enabled machines to “see” the world like we humans do and use that knowledge to augment human efforts. This technology uses cameras and computers instead of the human eye to identify, track and measure targets for further image processing. The field of computer vision has progressed tremendously in the last decade thanks to:
exponential growth in the ability to complex machine learning or deep learning models through higher computation efficiency in the form of GPUs or purpose-built hardware accelerators,
decreasing hardware costs,
rapid advancement of new technologies (better algorithms, automation, and easy-to-use tooling), and
advancement of camera technology that allows higher quality images to be captured and used for CV tasks.
CV tasks can be performed by using either traditional image processing techniques or modern deep learning (DL) based techniques. Traditional image processing techniques used statistical methods and did not involve learning from patterns (i.e., did not need training data). OpenCV is a popular image-processing library and is widely used today to pre-process images. However, the performance of such models has been restricted by the scene, lighting, and illumination. It also doesn’t scale well with new use cases. Deep Learning based techniques offer much higher performance across a wide variety of scenes and lighting conditions but rely on large training data as well as powerful hardware to run the computations. DL-based techniques constitute the state-of-the-art for CV and we will focus on DL-based CV techniques only for the rest of the article.
On a very high level, computer vision can be described as a 3-step workflow:
In theory, any image from a camera can be used for CV tasks. However, even with the most light-weight algorithms, CV tasks are computationally intensive and require massive data sets to train the models on (it could cost hundreds of thousands of dollars to train a state-of-the-art DL model) so in many cases where real-time results are needed, the image or video’s resolution (thereby size) is reduced along with other pre-processing steps such as contrast enhancement, noise reduction, etc. to speed up computation.
Modern CV technologies rely heavily on deep learning models, specifically, deep Convolutional Neural Network (CNN) models, that have been trained on hundreds of millions of images and offer considerably good precision and recall (more on this later in the article). A CNN model helps understand the contents of the image by breaking it down into objects with labels that describe what the object is. The labels used are part of the dataset used to train the model. Today in many tasks, CV algorithms are surpassing human performance. Some state-of-the-art deep learning models for CV are discussed later in the article.
There are dozens of tasks that can be performed with Computer Vision. Some major tasks are described below, with the most important task being object detection.
Object Detection: Object detection includes 3 sub-tasks:
a) detecting an arbitrary number of objects in images or videos,
locating them within the image with a bounding box,
classifying which classes they belong to (humans, animals, vehicles etc.) and labeling them as such.
Object detection tasks help answer the question of What type of objects are present and where they are located.
Object detection is a super-set of tasks like person detection, vehicle detection, animal detection, plant detection, etc. An object detection model can detect many different object types, but detection performance might be different for different objects based on the data used to train the model.
Object detection is the primitive used for many other analytics tasks such as person counting, loitering, measuring queue length, average dwell times, etc.
Face detection: a type of object detection focused on detecting human faces within an image/video
Facial recognition: people present in the image/video are recognized by their names and by matching their faces with a database containing names of people and their faces.
Pose estimation: estimate the position and orientation of humans in the image/video. Pose detection involves detecting different “parts” of the human body (such as right elbow, left knees, right shoulders, etc.) and their pose relative to other parts to estimate the overall pose of the human body in order to determine the actions or activities (such as sitting, walking, running, hitting, etc.) that are being done by the humans in the image/video and get more in-depth understanding of the human body language to estimate complex higher order tasks such as determining human behavior, estimating stress, demeanor, mood, etc. Pose estimate is used in the world of online and remote fitness classes to determine if the exercises and activities are being performed in the optimal way to prevent injuries. Pose estimation also has other interesting applications such as training robots, motion tracking in augmented reality and virtual reality, fall detection, etc.
Optical character recognition: Identification of characters (alphanumeric numbers and symbols) in an image. Automatic License Plate Recognition is an application of optical character recognition techniques
There are many different CV model architectures with new versions coming out every few months. But each model has its own pros and cons and is good for certain use cases and tradeoffs between speed, detection accuracy, and efficiency.
YOLOv7: YOLO stands for You Only Look Once is one of the most popular detection algorithms that can detect objects in real-time at the expense of slightly reduced detection accuracy. It is widely used across the industry from large corporations to startups. The first generation of YOLO (Yolo v1) was launched in 2016 and it was the fastest object detector of its time. Since then, there have been multiple iterations with every new release improving the speed, accuracy, and efficiency of object detection. YOLOv7 was released in July 2022 and is the fastest and most accurate object detection model as of writing this article. YOLO also has many optimized versions for special use cases, such as YOLOv7-tiny which is a lightweight version of the main YOLOv7 and optimized for edge AI and efficient yet performant enough to run on edge GPU devices, and YOLOv7-pose that delivers state-of-the-art pose estimation performance.
Cascade-Mask R-CNN: stands for cascade-mask region-based convolutional neural network and is state-of-the-art in terms of image segmentation. Image segmentation is the task of clustering parts of an image together that belong to the same object class by annotating pixel by pixel. Image segmentation goes deeper than just object detection and gives a richer partitioning of the elements in the image. Mask R-CNN is the state of the art in terms of image segmentation and instance segmentation. Image segmentation has broad applications such as detecting tumors, road sign detection from different angles, autonomous driving assistance systems, precision agriculture, and use for sensing through drones and UAVs. MASK R-CNN is optimized for detection accuracy at the expense of speed of detection. However, YOLOv7 achieves slightly better accuracy (2% higher than Cascade-Mask R-CNN) at a significantly increased detection speed (509% faster).
MobileNet-SSD V2: is lightweight and designed for mobile CV applications that are restricted by available power, available computing, and hardware to run. MobileNet was a popular model but recent advances with YOLO architecture, especially YOLOv7 significantly improved detection speed even in processing power-constrained environments while still providing good accuracy.
OpenPose: detects human body, hand, facial, and foot key points (in total 135 key points) in multi-person images in real-time and is a must-have for pose estimation. Knowing pose and orientation has a lot of applications, especially when tracking the movement of the human body is involved. For example, in fitness, pose estimation is used to identify if the athlete is following the right movements and if the pose will prevent injuries. OpenPose detects multiple people so with decent accuracy and in real-time. OpenPose also enables single-person tracking that can be used to speed up detection and visually smooth images and video by predicting where the movement will be.
Edge computing is when computing happens near the source of data rather than in data centers optimized for large-scale compute workloads. For example, when you are typing in your mobile phone and it tries to predict the next word, that prediction is happening on the mobile phone itself close to the data (your previous chats or conversation) and is a type of edge computing.
Other examples of edge computation devices are laptops, TVs, special edge computing hardware, Raspberry-pi, etc. Edge computing has tremendous value for computer vision due to bandwidth reduction, increased speed, and privacy as well as other security benefits. Edge computing is also more efficient than cloud computing (since data is not passed between the edge device and the cloud). Because computation happens so close to the user, edge computation allows for real-time and ultra-low latency performance.
Imagine in a self-driving car example, if the car had to evaluate the scene by sending each frame to the cloud and waiting for analysis, it would never be able to respond in milliseconds to avoid accidents. Also, edge computation is resilient to network downtime and outages (imagine the car going through an area where the cellular network dropped, and not being able to process and stop in time for a pedestrian crossing the street.
As a corollary, since data doesn’t go to the cloud for processing, edge computation minimizes internet bandwidth usage and reduces the cost of data transmission back and forth from the cloud servers. Also, because computation is more efficient, edge computing consumes less energy. Lastly, edge computing allows for techniques such as federated learning to happen that give best-of-breed machine learning solutions while protecting privacy. This is because data never leaves the end devices and thereby is less prone to being intercepted or leaked from central servers. This allows for high-accuracy computation to happen on customers’ personal and confidential data to give highly personalized experiences and best-in-class UX without the threats and risks of data being intercepted or breached or hijacked.
The disadvantages of edge computing have been the complexity of working with various edge devices and compatibility with the device classes. The distributed nature makes edge computing hard to standardize and requires heavy lifting for a lot of different architectures and systems. However, in this world of diverse classes of devices, there are a few edge devices that stand out in terms of acceleration of CV inference. Note that model training is still ideally done in the cloud and even with the most powerful GPUs, training a modern deep neural network can take anywhere from hours to a few days.
NVIDIA Jetson is a family of edge computers that is optimized for running AI / ML / CV applications on the edge. It is ideal for students, tech enthusiasts, DIYers as well as professionals to quickly build and deploy edge AI applications. It comes with its own SDK and a complete ecosystem, comprehensive tutorials, training modules, and getting-started guides.
Due to the strong ecosystem support as well as being high performance and power efficient, it is one of the most popular platforms for edge AI applications today. Additionally, the platform has been designed for flexibility so modularity is built into each component. The SDK comes with all the required libraries, debuggers, and APIs, and supports popular Machine Learning frameworks like Kera, Caffe, and TensorFlow that just makes development on NVIDIA Jetson very easy and fast.
There are different modules available in the market, starting with Jetson Nano which is optimized for building and deploying small AI systems and is great for beginners and students. Don’t mistake the nano in its name to mean small - it is power packed to run parallel neural network workloads, processing data from high-resolution sensors and running hefty workloads, and is ideal for building proof of concepts. The TX2 raises the performance bar and is designed for beefy workloads with high efficiency and speed. The module carries NVIDIA Pascal GPUs and is 2.5x more powerful than the Nano, allowing running deep neural networks with higher accuracy. Jetson Xavier is said to be the smallest supercomputer in the world for edge computation workloads and is designed to run parallel neural networks and process data from multiple high-resolution sensors simultaneously. Jetson AGX Xavier is developed with fully autonomous applications in mind and 20x the performance of the TX2 and is the highest-performing module in the Jetson series. Jetson AGX is ideal for autonomous mobility applications, warehouse robotics, and large surveillance drone use cases.
Google Coral is a Tensor Processing Unit processor-powered edge computing device that is ideal for fast neural network inferencing. Google Coral is basically a purpose-built inference engine for Tensor workloads. Google Coral has 2 components - a compute device that is an edge compute device sized slightly larger than a credit card, and also has a USB accessory that is an AI accelerator. Google Coral is a high-performance, extremely low power-consuming edge computer and has a pretty good cost-per-FPS processing ratio.
Intel Neural Compute Stick 2 is Intel’s answer to edge computing for AI workloads and is powered by the Intel Movidius Myriad X VPU (vision processing unit) chip that is optimized for CV applications. Highlights of this device are high performance and ultra-low power and can be used for a wide variety of applications including VR/AR headsets and security solutions. It comes in a tiny, USB thumb drive form factor and is an accelerator for speeding up neural network inference on the edge in a cost-effective way.
Essentially, they can be thought of as purpose-built co-processors that take the burden of some special computation from the CPU that is required for CV applications. As with NVIDIA Jetson, Intel Neural Compute Stick 2 comes with its own SDK (called the Myriad Development Kit) that contains the necessary libraries, tools, and APIs to quickly implement solutions. Because of the USB interface, it can be plugged into any device that has USB ports and is easy to get started. There are extensive tutorials and guides and a strong community that has tons of getting-started guides and examples to help anybody get started quickly. The Neural Compute Stick can take in multiple high-resolution video streams at the same time and with high frame rates.