Machine learning can be complex and overwhelming. Luckily Google is on its way to democratize machine learning by providing Google AutoML, a Google Cloud tool to handle all the complexity of machine learning for common use cases.
Because of the ease of building a custom machine learning model with AutoML, time-to-market has been reduced drastically.
If you upload your labelled dataset, AutoML will extract the patterns and create a custom machine learning model. AutoML can be used with tables, text, images, and video. The trained model can then be accessed through an API or client, enabling machine learning capabilities to your specific use-case. It’s like uploading a recipe for bread to a machine, putting in all the ingredients and within hours, a perfectly baked loaf of bread comes out, without even knowing how to bake.
Recently, Google released AutoML Edge for vision recognition. This leads to many new, innovative opportunities. Previously, you could only use API calls to access the cloud-hosted model, even though an API call to the model can take more than a second. However, with AutoML Edge, you can download the model and run it on your local machine, reducing the time to get predictions from the model to milliseconds. In this way, users get a real-time experience.
At Incentro, a digital change company, we always keep an eye out for new techniques that could result in a truly innovative advantage for our clients. As it turns out, AutoML Edge was the perfect match for a business case that we were working on. The customer is a global optical retailer. The business case was to design a ‘magic mirror’. The magic mirror is presented in offline stores, and it is able to guide the customers through their fitting session when they are shopping for glasses, and helps them to find the perfect pair of glasses. In this case, both the speed and convenience of the model were essential. Therefore, we made a Proof of Concept (PoC) to present to the client. In this blog post, I will tell you more about this magic mirror and how it came to be.
Who would ever buy something that’s going to be on their face every single day, without trying it out first? I sure wouldn’t. That’s the main reason why buying a new pair of glasses used to be a strictly offline experience for such a long time. Keeping up with fast-growing technology and innovations is crucial for companies to survive.
Nowadays, countless retail stores go bankrupt, because they did not put enough emphasis on their e-commerce.
Even when doing e-commerce nowadays, you’re barely keeping up with the competition. If one wants to get a headstart, one should connect online and offline by creating an omnichannel experience, which allows the company to interact with its customers on multiple levels, bundling their forces, to enhance the customer experience.
In our case, the portal between the online and offline would be the ‘magic mirror’: an interactive mirror that guides customers through the fitting process. It would do so by using machine learning to detect the glasses someone’s trying on in the store and provides the customer with useful information, such as pricing, the brand story and other available colours. And it doesn’t stop there: the mirror could also recommend glasses based on the types of glasses someone tried on before, the shape of their face, their age, their gender and more. All these characteristics can be extracted using machine learning. This is where the speed and convenience part comes in: for an experience to feel truly immersive, the detection of these characteristics need to be instant.
Luckily for us, AutoML Vision Edge was released recently. Time to check it out!
When opting to use AutoML Vision, you immediately face some questions: classification or object detection? Edge or cloud-hosted?
At first, I experimented with cloud-hosted object detection. However, I soon found out that labelling the objects (glasses in our case) by hand was very time-consuming: we had to draw a rectangle around each object, assign a label to it and then move on to the next one, over and over again for more than a thousand times. Luckily, the process could easily be automated by using the standard AutoML vision object detection API. The API would take care of labelling the glasses in the pictures and since I already knew which pictures featured which glasses, I could simply insert the name of the glasses into the label. Problem solved!
Well… not quite. Sometimes the standard Google Vision API didn’t detect the glasses properly or detected multiple glasses in one image. The pictures therefore still needed to be checked manually — a time-consuming task that was quite discouraging. Wasn’t there a way for us to just sit back and enjoy the show?
After labelling hundreds of images semi-automatically, uploading them to the cloud and hosting the model, I found that the API was easily implemented and already rendered reasonable results. However, one would have to pay for every hour the API is deployed. Maybe not the best solution in our case.
But there are still some options on the table, like experimenting with the AutoML Vision Edge. That would make labelling a much easier task since we wouldn’t have to put a box around objects — the entire image would just get (multiple) labels.
Using this classification option, our first results were shockingly bad. Unexpected? Not really, because the label only described about 5% of the image: the glasses. The images contained a lot of noise, which was troublesome for the model. However, this was something we could solve.
I asked myself: what if I just cut out the heads from the images and feed them into AutoML? That would reduce noise drastically! From my object detection experiment, I learned that I don’t like to do things manually. That’s why I decided to write some code to do the crops automatically. After some browsing on the internet, I found a basic algorithm using ‘haar cascade’ to indicate a face in an image. By using this code, I have created a function that crops out the head from the image and saves this as a new image of only one head wearing glasses. This is something I can feed back into our AutoML machine.
Way of making pictures
Just like the bread baking machine, AutoML needs its ingredients. In our case that’s images. Loads of them. For the PoC, I decided to train our model with 10 different pairs of glasses, some of them almost identical. A rule of thumb is that if a human eye should be able to see the difference between two pairs of glasses, as small as it may be, AutoML would be able to learn it as well.
Filled with confidence, I started taking pictures with my mobile phone — about 20 per pair of glasses. I cropped out the heads to reduce noise, fed the almost noiseless pictures into AutoML and directly trained the first AutoML model. After an hour, the model was ready to deploy. Filled with excitement and suspense, I created a simple API using the built-in webcam of my MacBook, put one of the glasses on my face and … got some very confusing results, to say the least. It felt like the model made random guesses rather than a well-made prediction. What an anti-climax.
After some minutes of thinking about what went wrong, I realised my MacBook’s webcam is about 1 MP, while the camera on my phone is like 12 MP. Maybe, I mused, it would help to use a better camera and to train the model with images that were made with that same camera.
After the purchase of a more capable webcam, I got started to train an AutoML model again. This time, I thought it would be a smart decision to overload AutoML with pictures — and what faster way to make a lot of pictures than shooting a video? I asked about five of my colleagues to act as a model for my model (pun intended) and recorded a video of about 10 seconds for every single pair of glasses. With 30 frames per second, 10 seconds per video and 5 colleagues/models per pair of glasses, this resulted in 1500 images per pair of glasses, which had to be enough.
After uploading and training the model again, waiting for another hour, downloading the model and implementing it in my code, I was ready to give it another go. Now, if I used the new webcam standing at my desk, the results were good. But as soon as I left my desk and tried the model somewhere else in the office, the results were not great, not terrible ( https://www.quora.com/What-does-not-great-not-terrible-mean).
So I got to thinking again. Maybe the problem was the variety within the images I fed into AutoML — or rather the lack of variety. The frames from the video’s I made, shot from a stationary spot, were (logically) almost identical. For the next couple of images, I wrote a script that would automatically take about 5 pictures per second, crop out the face, upload them to the cloud, label them and finally feed them to AutoML. I kindly asked my colleagues from before to model again and used a greater variety of lights, locations and head movements this time. As such, I made another couple of hundred pictures per pair of glasses, removed a great portion of the video frames to avoid overfitting and tried my luck again. This AutoML thing was becoming a familiar ritual: uploading, waiting for an hour, downloading the model, implementing it in my code and — poof! The model worked! The response time was lightning fast and the result was very reliable: the model recognized every single pair of glasses. Time to hook it up to the smart mirror and see how this works out.
Serving the results to customers
Now that the backend works properly, I can have a look at what happens in the frontend. Every 200 ms, the frontend will make a request to the backend via a REST-API call. The backend then takes a picture using the webcam inside the mirror and checks if there is a person in the image. If there is, the backend classifies if the person is wearing glasses and if so, classifies the glasses, finally sending back the data to the frontend — all in under 200 ms to handle the next request.
Algorithm to detect the person in question
After using the smart mirror for a couple of minutes, I called in some of my colleagues to show them the magic. They enthusiastically grabbed a pair of glasses from the shelves, put them on their faces and gathered together in front of the mirror. And in that moment, I encountered yet another challenge: with that many people in front of the mirror, it struggled to determine which of all those people to check for glasses.
Therefore I had to find a smart way to determine which person standing in front of the mirror is most likely to be interacting with it. The way I chose to make this decision is based on two metrics: someone’s distance from the mirror and their distance from the centre of the mirror. When we tried the mirror again with this improved algorithm, it worked like a charm.
Data collection
This is just a Proof of Concept, but I can’t help but wonder over the endlessness of possibilities a smart mirror would open. In the first place, our objective is to enhance the customer experience by providing customers with additional information about the glasses they’re trying out, all while (in the background) extracting valuable data from the people interacting with it: gender, age, face shape, sentiment, and more. All this collected data could resulting in the mirror being able to give personal recommendations later on. This helps the customer to find their perfect pair of glasses within minutes of entering the store. The store will really know it’s customer and can use this information to better serve the customer.
Next steps
The next steps are to get the code into production, enlarge our training set of ten glasses to about a thousand and see how it will respond. The concept of smart mirrors is not unique to glasses and can be used for a variety of use-cases both in retail and other industries.
It’s the speed and convenience that makes Google AutoML edge unique and the possibilities endless.
If this kicked your mind into gear and you’re looking for someone to wonder about the possibilities, or if you’ve simply got a question, you know where to find me. For more cases have a look at www.incentro.com.