Trusted software development company since 2009. Custom DS/ML, AR, IoT solutions https://mobidev.biz
I hate shopping in brick and mortar stores. However, my interest in virtual shopping is not limited to the buyer experience only. With the MobiDev Data Science department, I’ve gained experience in working on AI technologies for virtual fitting room development. The goal of this article is to describe how these systems work from the inside.
A few years ago, the "Try before you buy" strategy was an efficient customer engagement method in outfit stores. Now, this strategy exists in the form of virtual fitting rooms. Fortune Business Insights predicted that the virtual fitting room market size is expected to reach USD 10.00 billion by 2027.
To better understand the logic of virtual fitting room technology, let's review the following example. Some time ago, we had a project of Augmented Reality (AR) footwear fitting room development. The fitting room works in the following way:
When working with ARKit (Augmented Reality framework for Apple’s devices) we discovered that it has rendering limitations. As you can see from the video above, the tracking accuracy is too low to use it for footwear positioning. The cause of this limitation may be the maintenance of the inference speed while neglecting the tracking accuracy, which might be critical for apps working in real-time.
One more issue was the poor identification of body parts by the ARKit algorithm. Since this algorithm is aimed to identify the whole body, it doesn’t detect any keypoints if the processed image contains only a part of the body. It is exactly the case of a footwear fitting room when the algorithm is supposed to process only a person's legs.
The conclusion was that virtual fitting room apps might require additional functionality along with the standard AR libraries. Thus, it’s recommended to involve data scientists in developing a custom pose estimation model supposed to detect keypoints on only one or two feet in the frame and operate in real-time.
The virtual fitting room technology market provides offerings for accessories, watches, glasses, hats, clothes, and others. Let’s review how some of these solutions work under the hood.
A good example of virtual watches try-on is the AR-Watches app allowing users to try on various watches. The solution is based on the ARTag technology utilizing specific markers printed on a band, which should be worn on a user’s wrist in place of a watch in order to start a virtual try-on the watch.
The computer vision algorithm processes only those markers visible in the frame and identifies the camera's position in relation to them. After that, to render a 3D object correctly, the virtual camera should be placed at the same location.
Overall, technology has its limits (for instance, not everybody has a printer at hand to print out the ARTag band). But if it matches the business use case, it wouldn’t be that difficult to create a product with a production-ready quality. Probably, the most important part would be to create proper 3D objects to use.
Technically, such a solution utilizes a foot pose estimation model based on deep learning. This technology may be considered for a particular case of widespread full-body 3D pose estimation models that estimate the position of selected keypoints in 3D dimension directly or through the inference of detected 2D keypoints' positions into 3D coordinates.
Once positions of 3D keypoints of feet are detected, they can be utilized for creating a parametric 3D model of a human foot, and positioning & scaling of a footwear 3D model according to the geometric properties of the parametric model.
Compared to the full-body/face pose estimation model, foot pose estimation still has certain challenges. The main issue is the lack of 3D annotation data required for model training.
However, the optimal way to avoid this problem is to use the synthetic data which supposes rendering of photorealistic 3D human feet models with key points and training a model with that data; or to use photogrammetry which supposes the reconstruction of a 3D scene from multiple 2D views to decrease the number of labeling needs.
This kind of solution is way more complicated. In order to enter the market with a ready-to-use product, it is required to collect a large enough foot keypoint dataset (either using synthetic data, photogrammetry, or a combination of both), train a customized pose estimation model (that would combine both high enough accuracy and inference speed), test its robustness in various conditions and create a foot model. We consider it a medium complexity project in terms of technologies.
This solution is based on the deep learning-powered pose estimation approach utilized for facial landmarks detection, where the common annotation format includes 68 2D/3D facial landmarks.
Such an annotation format allows the differentiation of face contour, nose, eyes, eyebrows, and lips with a sufficient accuracy level. The data for training the face landmark estimation model might be taken from such open-source libraries as Face Alignment, providing face pose estimation functionality out-of-the-box.
In terms of technologies, this kind of solution is not that complicated, especially if using any pre-trained model as a basis for the face recognition task. But it’s important to consider that low-quality cameras and poor light conditions could be limiting factors.
Amidst the COVID-19 pandemic, ZapWorks launched the AR-based educational app aimed to instruct users on how to wear surgical masks properly. Technically, this app is also based on a 3D facial landmark detection method. Like the glasses try-on app, this method allows receiving information about facial features and further mask rendering.
Given the fact that facial landmark detection models work well, another frequently simulated AR item is hats. Everything required for correct rendering of a hat on the person’s head is the 3D coordinates of several keypoints indicating temples and the location of a forehead center. The virtual hats try-on apps have already been launched by QUYTECH, Banuba, and Vertebrae.
Compared to shoes, masks, glasses, and watches, virtual try-on 3D clothes still remain a challenge. The reason is that clothes are deformed when taking the shape of a person’s body. Thus, for proper AR experience, a deep learning model should identify not only basic keypoints on the human body’s joints but also the body shape in 3D.
Looking at one of the most recent deep learning models DensePose aimed to map pixels of an RGB image of a person to the 3D surface of the human body, we can find out that it’s still not quite suitable for augmented reality.
The DensePose’s inference speed is not appropriate for real-time apps, and body mesh detections have insufficient accuracy for the fitting of 3D clothing items. In order to improve results, it’s required to collect more annotated data which is a time and resource-consuming task.
The alternative is to use 2D clothing items and 2D people’s silhouettes. That’s what Zeekit company does, giving the users a possibility to apply a number of clothing types (dresses, pants, shirts, etc.) to their photo.
Strictly speaking, the method of 2D clothes images transferring cannot be considered as Augmented Reality, since the “Reality” aspect implies the real-time operation, however, it still can provide an unusual and immersive user experience.
Since there are no ready pre-trained models for the virtual dressing room we researched this field experimenting with the ACGPN model. The idea was to explore outputs of this model in practice for 2D cloth transferring by utilizing various approaches.
The model was applied to people's images in constrained (samples from the training dataset, VITON) and unconstrained (any environment) conditions. In addition, we tested the limits of the model's capabilities by not only running it on custom persons' images but also using custom clothing images that were quite different from the training data.
Here are examples of results we received during the research:
1. Replication of results described in the “Towards Photo-Realistic Virtual Try-On by Adaptively GeneratingPreserving↔Image Content” research paper, with the original data and our preprocessing models:
B1 - poor inpainting
B2 - new clothes overlapping
B3 - edge defects
2. Application of custom clothes to default person images:
Row A - no defects
Row B - some defects to be moderated
Row C - critical defects
3. Application of default clothes to the custom person images:
Row A - edge defects (minor)
Row B - masking errors (moderate)
Row C - inpainting and masking errors (critical)
4. Application of custom clothes to the custom person images:
Row A - best results obtained from the model
Row B - many defects to be moderated
Row C - most distorted results
When analyzing the outputs, we discovered that virtual clothes try on still has certain limitations. The point is the training data should contain paired images of the target cloth, and people wearing this cloth. If given a real-world business scenario, it may be challenging to accomplish. The other takeaways from the research are:
In conclusion, I’d say that current virtual fitting rooms work well for items related to separate body parts like head, face, feet, and arms. But talking about items where the human body requires to be fully detected, estimated, and modified, the virtual fitting is still in its infancy. However, the AI technology evolves in leaps and bounds, and the best strategy is to stay tuned and keep watching the latest AI trends on MobiDev.
Written by Maksym Tatariants, Data Science Engineer at MobiDev
Previously published at https://mobidev.biz/blog/ar-ai-technologies-virtual-fitting-room-development
Create your free account to unlock your custom reading experience.