“Is it possible for a technology solution to replace fitness coaches? Well, someone still has to motivate you saying “Come On, even my grandma can do better!” But from a technology point of view, this high-level requirement led us to 3D human pose estimation technology.
In this article, I will describe our own experience of how 3D human pose estimation can be developed and implemented for the AI fitness coach solution.
Human pose estimation is a computer vision-based technology that detects and analyzes human posture. The main component of human pose estimation is the modeling of the human body. There are three of the most used types of human body models: skeleton-based model, contour-based, and volume-based.
Skeleton-based model consists of a set of joints (keypoints) like ankles, knees, shoulders, elbows, wrists, and limb orientations comprising the skeletal structure of a human body. This model is used both in 2D and 3D human pose estimation techniques because of its flexibility.
Contour-based model consists of the contour and rough width of the body torso and limbs, where body parts are presented with boundaries and rectangles of a person’s silhouette.
Volume-based model consists of 3D human body shapes and poses represented by volume-based models with geometric meshes and shapes, normally captured with 3D scans.
Here, I am talking about skeleton-based models, which may be detected from a 2D or 3D perspective.
2D pose estimation is based on the detection and analysis of X, Y coordinates of human body joints from an RGB image.
3D pose estimation is based on the detection and analysis of X, Y, Z coordinates of human body joints from an RGB image.
When speaking about fitness applications involving human pose estimation, it’s better to use 3D estimation, since it analyzes human poses during physical activities more accurately.
Talking about AI fitness coach apps, the common flow looks as follows:
The visual image of how 3D human pose estimation technology detects keypoints on a human body looks like as follows:
The process usually involves the extraction of joints on a human body, and then analysis of a human pose by deep learning algorithms. If the human pose estimation system uses video records as a data source, keypoints (joints locations) are detected from a sequence of frames, not a single picture. It allows us to achieve more accuracy as the system analyzes an actual movement of a person, not a steady position.
There are several ways to develop the 3D human pose estimation system for fitness. The most optimal ways are training of a deep learning model to extract 3D or 2D key points from the given images/frames
For sure, using video streams from several cameras with different views on the same person doing exercises – it will grant us better accuracy. But multi-cameras are often not available. Also, analyzing video from several video streams will take more computer power to process.
For our research, we used a single video source for the analysis. And applied convolutional neural networks (CNNs) with dilated temporal convolutions (see the video below).
We made the analysis of existing models and figured out that VideoPose3D is the most optimal choice for fitness app purposes. In the input, it should have a set of 2D keypoints detected, where the COCO 2017 dataset is applied as a pre-trained 2D detector. For the accurate prediction of a current joint’s position, it processes visual data from several frames captured at various periods of time.
Digitalization has not spared the fitness industry. According to the Research and Markets report, the digital fitness market size is expected to reach $27.4 billion by 2022.
The 3D human pose estimation is a relatively new but rapidly evolving technology in digital fitness. After analysis and practical experience of working with 3D human pose estimation systems, we have come to our own vision of how it can be implemented. Let’s review the flow of how this system may be built so that it can analyze movements in an automatic manner by utilizing videos of users performing physical exercises.
Assuming that the goal of the given system is to inspect the input video for common exercise mistakes and compare it with the reference video, where the professional athlete is performing the same exercise, the flow will look like as follows:
1. Cutting of the input video depending on the exercise start & end
For the start and the end points indication, we can automatically detect the position of body control points by using arbitrary thresholds. For example, when squatting, it is possible to detect the angle of arms and position of hands by height, and then, by using arbitrary thresholds, we can detect the start and the end points of an exercise.
One more way is to ask the user to indicate the start and the end of the exercise performance manually.
2. Detecting 2D and 3D keypoints on the user’s body
3. Decomposing of the exercise phases
When having the positions of keypoints (joints) extracted, they should be compared with the reference video’s positions. However, we cannot make a direct comparison because the exercise performance speed and the total number of repetitions on the input and reference videos may differ.
These discrepancies can be resolved by decomposition of an exercise into phases. We can see how it is illustrated in the image below, where the squatting exercise is decomposed into two primary phases: squatting down and squatting up.
Photo source: stronglifts.com
The decomposition can be done through the analysis of keypoints detected from the input video frame by frame, and then comparing them by certain criteria with the keypoints from the reference video.
4. Searching for common mistakes
When 3D keypoints and certain phases of an exercise are detected, it’s time to detect common mistakes in an exercise technique in the input video. For example, in squatting, we can detect moments when the legs are bent (not straight) and the knees are closer to the center torso than feet.
5. Comparing the input video frames with the reference ones
Here we should take a reference video, where the exercise is performed correctly, split it into phases, and detect keypoints in each frame. When the keypoints are detected and exercise phases defined in both input and reference videos, we can compare each phase of an exercise performed by a user and professional athlete.
The step-by-step flow looks as follows:
a. Slow down/accelerate the reference video in order to match the speed of the input one.
b. Align both skeleton models of the user and a professional athlete so that their rotation angle and origins match.
c. Normalize the size of both skeletons since reference and input videos can be captured from different distances.
d. Compare keypoints frame by frame and detect motion inconsistencies.
e. Repeat the flow separately for different groups of joints (e. g. feet position, knee position, hands and elbows position, etc.).
6. Display results and generate recommendations for a user
When the whole analysis cycle is completed, the user will get results displayed in different formats. For example, the output may include interactive 3D reconstructions with mistake hints, so that the user can zoom in/out, go back, forward, or pause at a specific moment. It is also possible to collect and display movement statistics such as the number of repetitions, average speed and duration of one repetition, and others.
Visually the 3D human pose estimation system based on videos looks like as follows:
Photo sources: stronglifts.com, Men’s Health channel
In this article, I described how a 3D human pose estimation system works from the perspective of AI fitness coach app development because it illustrates well how it might work by example. But please note that the flow might be changed depending on business requirements or other factors.
Highlights:
Written by Maksym Tatariants, Data Science Engineer, MobiDev. This article is based on our technology research and experience providing software development services.
Previously published at https://mobidev.biz/blog/human-pose-estimation-ai-personal-fitness-coach