Part of the broader artificial intelligence and computer vision realms, human pose estimation (HPE) technology has been gradually making its presence seen in all kinds of software apps and hardware solutions. Still, human pose estimation seemed to be stuck at the edge, failing to cross into mainstream adoption.
But the COVID-19 outbreak pushed entire industries over the technology tipping point. McKinsey reports that the share of digital products in companies’ portfolios has accelerated by seven years. And that affected the computer vision market, too. It reached the value of $11.32 billion in 2020 and is projected to grow further at a CAGR of 7.3% by 2028.
Such rapid growth in demand has marked the time for human pose estimation technology to step into widespread commercial use. In this article, we will look under the hood of human pose estimation technology, explore its use case, and share tips from the ITRex CTO Kirill Stashevsky on how to implement HPE so that it drives real value.
It is easy for a human to tell whether a person is squatting down or waving a hand. For a machine? Not so simple. When shown a photo or a video frame, a computer “sees” a collection of pixels.
Human pose estimation technology enables a computer to model human bodies and recognize body postures in images and videos, including real-time footage.
Traditionally, human pose estimation relied on machine learning algorithms, specifically random forest. The essence of the ML-based approach boiled down to presenting a human body as a collection of parts arranged in a deformable structure, such as the one below.
But the classical, ML-based method struggled to identify hidden body parts and failed at multi-person pose estimation. That has sparked the transition to deep learning that is used for human pose estimation today.
There are many ways to interpret a human posture with deep learning (hence, the many approaches and libraries, which we explore in the subsequent section). Here’s how a typical deep learning model for human pose estimation runs.
A standard deep learning model for human pose estimation has a convolutional neural network (CNN) at its base. And the CNN’s architecture comprises two key components — an encoder (in some approaches referred to as an estimator) and a decoder (or a detector).
An encoder extracts specific features, called keypoints, from an input image. The number of the keypoints depends on the approach and usually ranges from 17 to 33. Examples of keypoints might include a left elbow, a right knee, or a neck base. The encoder runs the input through a set of narrowing convolution blocks and extracts the features of all people in the image.
A decoder then creates a keypoint probability heatmap. It helps refine the encoder’s predictions and estimates how likely the extracted features are to be located in the marked areas of the image.
So, the CNN maps the pre-estimated keypoints to the heatmap and outputs those with the highest probability rate. It then groups the associated keypoints into a skeleton-like body structure.
For multi-person estimation, when there are several high-probability areas for each keypoint, say, two left elbows, an additional post-processing layer may be added to ensure each keypoint belongs to the right person.
The method described above follows a bottom-up pipeline. It means that a CNN locates every instance of a particular keypoint first and then groups the associated ones into a body structure.
A model may follow a top-down approach, too. In this case, a CNN first locates all the bodies in an image, putting them into bounding boxes, and then detects the keypoints within each bounding box.
There are many modifications of the aforementioned method. Let’s look at the most popular ones and explore common open-source and proprietary libraries that could ease the implementation of human pose estimation.
A famous model, OpenPose relies on a convolutional neural network of two branches. One creates confidence maps for each keypoint found in an image. The other estimates the degree of association between the keypoints, predicting the so-called affinity fields. Combining the output from two branches, OpenPose builds a model of the human skeleton. And the subsequent layers of the network are used to refine the prediction.
In contrast to the previous model, wherein the CNN detects individual keypoints and uses them to “assemble” the skeleton, DeepCut handles both tasks simultaneously. It infers all the body parts in an image, classifies them, and assigns those lying close to each other to a specific person.
RMPE helps improve the accuracy of top-down human pose estimation, where a posture is inferred within a bounding box. So, the overall performance of the model depends much on how accurately the body is localized.
To reduce slips in bounding box localization, the authors of RMPE introduce an additional Symmetrical Spatial Transformer Network (SSTN) that cuts out a highly accurate single-person area from a potentially misplaced bounding box. So, the keypoints are detected within this area. Finally, a Spatial De-transformer Network (SDTN) is used to remap the output of the two components to the original image.
With Mask R-CNN, the localization of a human body and keypoint detection run independently. A standard convolutional neural network extracts the keypoints, while a Region Proposal Network (RPN) locates the body. The extracted features are then passed into the parallel branches of the network that refine the predictions and generate body segmentation masks. Based on these masks, the network builds skeleton-like models for each person in the image.
BlazePoze is a pose detection library developed and supported by Google. The library allows crafting robust human pose estimation engines that run in real time. In contrast to similar approaches that typically rely on COCO topology and detect 17 keypoints, BlazePose’s skeleton structure features 33 keypoints, including hands and feet, which is particularly beneficial for fitness and sports applications.
The keypoints are characterized by x and y coordinates, visibility, and vertical alignment. Unlike similar libraries that only rely on heatmaps for keypoint refinement, BlazePose uses a regression approach that combines heatmaps and offset prediction.
MoveNet is an open-source library for 3D pose estimation that detects 17 keypoints. The library comes in two variants: Lightning and Thunder. The former is intended for latency-critical applications, while the latter is appropriate for apps that require high accuracy. Both Lightning and Thunder run faster than in real time on desktop, laptop, and mobile devices.
The MoveNet architecture comprises a feature extractor and a set of four prediction heads — a person center heatmap, a keypoint regression field, a person keypoint heatmap, and a 2D keypoint offset field.
CenterNet is a point-based object detection framework that can be extended to human pose estimation. CenterNet models people as a single point corresponding to the center point of a body’s bounding box. The model then determines other object characteristics, such as size, 3D location, orientation, and pose.
Recognizing human posture and movements has long been in focus for major industries, including sports, retail, and entertainment. Here’s a run-through of sectors where human posture estimation is in use.
The pandemic has pushed more people to practice physical activities at home, making the demand for fitness apps with human pose estimation grow rapidly. HPE-powered fitness applications provide detailed live feedback on whether a user performs an exercise correctly. For that, a human pose estimation component compares a model extracted from camera footage with a benchmark, thus, providing for a safer home workout routine. Fitness apps featuring human pose estimation cater to various activities, from yoga to dancing to weight lifting.
Human pose estimation technology can help athletes improve their performance and assist judges in rating athletes unbiasedly. HPE-powered applications are applied for various tasks — from assessing the quality of figure skating elements to helping soccer players strike perfect kicks to allowing high jumpers to polish up their techniques.
Gaming and filmmaking
Character animation has long been an exhaustive and complex process. Today, it is facilitated with the help of human pose estimation. Graphics, textures, and other enhancements can now be easily applied to a tracked body, so the graphics render naturally even if the body actively moves.
In interactive video gaming, human pose estimation is used to capture players’ motions and render them into the actions of virtual characters as well.
Whether trying to curb the effect of the pandemic or realize their vision of a supermarket of the future, retailers have started turning to AR and real-time virtual effects. Human pose estimation backs up those aspirations, enabling such experiences as virtual try-on and real-time marketing. An HPE-powered app, whether running on a customer’s mobile phone or integrated into a fitting room’s mirror, allows scanning a person’s body and imposing 3D virtual elements on the estimated posture. And that works for trying out everything from clothes to shoes to jewelry.
Traditionally, industrial robots were trained with the help of 2D vision systems that called for time- and effort-intensive calibration. Today, human pose estimation provides for faster, more responsive, and more accurate robot training. Instead of programming robots to follow the set trajectories, one may teach a robot to recognize the pose and the motions of a human. Having estimated the posture of the demonstrator, a robot then devises the way it should move its articulators to perform the same motion.
Security and surveillance
Human pose estimation may be applied to analyze the footage from security cameras to prevent potentially alarming situations. Identifying a human posture and estimating its anomaly score, HPE-powered security software may predict suspicious actions or identify people who have fallen down or, say, potentially feel sick.
ITRex Group has recently helped a fitness tech startup create a fitness mirror powered by artificial intelligence and human pose estimation. We sat down to talk to Kirill Stashevsky, the ITRex CTO, to discuss the specifics of implementing human pose estimation technology that contribute a lot to the project’s success but are often overlooked.
— How does one embark on the HPE implementation journey to ensure they produce a top-notch solution? What should one beware of during project planning to secure that further development efforts are headed in the right direction?
Kirill: Whether you craft a winning human pose estimation product depends largely on the decisions you make at the very beginning of the project. One such decision is selecting the optimum implementation strategy — one may choose to develop a solution from scratch or rely on one of the many human pose estimation libraries.
To choose the best-fitting approach, you need to clearly understand, among other issues, what exactly you aim to achieve with your future product, which platforms it will run on, and how much time you have until releasing your product to the market. Once you’ve clarified the vision, weigh it against the available strategies.
Consider going the custom route if the task you are solving is narrow and non-trivial and requires the ultimate accuracy of human pose estimation. Keep in mind, however, that the development process is likely to be time- and effort-intensive.
In turn, if you are developing a product with a mass-market appeal or a product that caters to a typical use case, going for library-based development would help build a quality prototype faster and with lower effort. Still, in many cases, you would have to adjust the given model to your specific use case by further training it on the data that best reflects the reality.
— Suppose I decide to go for library-based development; what factors should I consider to choose the fitting one?
Kirill: You may go for a proprietary or an open-source library. Proprietary libraries could provide for more accurate pose estimation and require less customization. But you have to prepare a backup plan in case an owner, say, discontinues the library support.
Open-source libraries, in turn, often require more effort to configure. But with an experienced team, it may be an optimum option balancing the quality of recognition, moderate development costs, and fair time-to-market.
Pay attention to the number of keypoints a library is able to recognize, too. A solution for dancers or yogis, for instance, may require identifying additional keypoints for hands and feet so that BlazePose might look like a more reasonable option. If latency is critical, select the library that runs at an FPS rate of 30 and higher, for example, MoveNet.
— Why aren’t the standard datasets lying at the base of most models enough for an accurately performing solution? What data should I then use to further train and test the model?
Kirill: A well-performing human pose estimation model should be trained on the data that is authentic and representative of reality. The truth is that even the most expansive datasets often lack diversity and are not enough to yield reliable outcomes in real-life settings. That’s what we faced when developing the human pose estimation component for our client’s fitness mirror.
To mitigate the issue, we had to retrain the model on additional video footage filmed specifically to reflect the surroundings the mirror will be used in. So, we compiled a custom dataset of videos featuring people of different heights, body types, and skin colors exercising in various settings — from poorly lit rooms to spacious fitness studios — filmed at a particular angle. That helped us significantly increase the accuracy of the model.
— Are there any other easy-to-overlook issues that still influence the accuracy of human pose estimation?
Kirill: Focusing on the innards of deep learning, development teams may fail to pay due attention to the cameras. So, make sure that the features of a camera (including its positioning, frame size, a frame rate, a shooting angle, as well as whether the camera is shooting statically or dynamically) used for filming training data match those of cameras real users might employ.
So, you are considering implementing a solution with human pose estimation features. Here are vital things to keep in mind to develop a winning application:
Looking to develop top-notch software that recognizes body postures as accurately as a human eye? Tell ITRex about your vision, and they will help you turn human pose estimation technology to your advantage.