Unwrapping Wine Labels - How We Trained A Neural Network To Do It

In the previous article, it was described a six-point method to unwrap wine labels. Finding anchor points were performed with Hough transform. It gave fair results for good labels, but for many real cases it was quite unstable, and the efforts to tune it didn’t help much. It became clear at some point, Hough transform itself wasn’t capable of handling the variety of label forms, so the next step was training a neural network.

This article describes how we were collecting and labeling a dataset, and trying neural network architectures — it was lots of fun.

A dataset is probably the most important part of neural network training. Potentially, it’s possible to create a synthetic set of pictures, but in that case, it’s not clear how well it corresponds to the real images, that’s why the dataset must consist of labeled photos at least partially.

Training a network will require marking images first, and that requires a special tool for tagging. There are lots of tools available for machine learning, but the choice is mostly limited to classification tasks, segmentation, or bounding box. In our case, it was required to specify six points, also we needed to dynamically visualize 3D mesh — so we created our own tool.

The first milestone, “Pallas’s cat” (each milestone we called after different cats), goals:

* Create a simple system to label images
* Mark a thousand of images
* Try neural network proof of the concept

Create a simple system to label imagesMark a thousand of imagesTry neural network proof of the concept

Requirements:

* Ability to process images online by a distributed team
* Each image had to be a separate task
* Two-step processing (processing and reviewing)
* Direct links to each of the tasks
* Visualization

Statistics and permission management was not a part of the initial scope, as we were going to process the first thousand of images ourselves.

I chose Django framework to build the labeling system — it has admin out of the box, there is a great Django REST framework to build API, and it’s extensible.

Let’s create ImageTask model:

class ImageTask(models.Model):
    orig_image = models.OneToOneField(
        Image, blank=True, null=True, on_delete=models.CASCADE, related_name='+',
    )

    status = models.IntegerField(choices=get_enum_choices(TaskStatus))

    type = models.IntegerField(
        choices=get_enum_choices(TaskType), default=TaskType.detect_marker.value
    )

    worker = models.ForeignKey(
        User, null=True, blank=True, on_delete=models.SET_NULL, related_name='workers'
    )
    reviewer = models.ForeignKey(
        User, null=True, blank=True, on_delete=models.SET_NULL, related_name='reviewers'
    )

    data = JSONField(blank=True, default=dict)

    created = models.DateTimeField(auto_now_add=True)
    updated = models.DateTimeField(auto_now=True)

    def clickable_preview(self):
        url = ''
        if self.orig_image:
            url = self.orig_image.full_url
        task_url = f'/static/jsapp/task/{self.id}'

        tag = f"""
            <a href="{task_url}" target="_blank">
                <img src="{url}" height="100" width="100" object-fit: "scale-down">
            </a>
        """
        return mark_safe(tag)

Then define its admin:

@admin.register(ImageTask)
class ImageTaskAdmin(admin.ModelAdmin):
    list_display = ['id', 'thumb', 'task_status', 'tags_list', 'perform_action', "updated"]
    list_filter = [
        "status",
        TrelloCardFilter,
        "worker",
        "reviewer"
    ]
    readonly_fields = ['orig_image', 'processed_image']
    ordering = ["-updated", '-id']
    list_select_related = ['orig_image']

And override tasks template:

{# templates/admin/labelapp/imagetask/change_list.html #}
{% extends "admin/change_list.html" %}
{% load bootstrap3 %}

{% block extrastyle %}
    {{ block.super }}
    <link rel="stylesheet" type="text/css" href="/static/css/imagetask_admin_changel\
ist.css" />
    <link rel="stylesheet" type="text/css" href="/static/css/toastr.min.css" />
{% endblock extrastyle %}

{% block extrahead %}
    {{ block.super }}
    {% bootstrap_css %}
    <script>
        var csrftoken = '{{ csrf_token }}</script>
    <script type="text/javascript" src="/static/js/jquery-3.3.1.min.js"></script>
    <script type="text/javascript" src="/static/js/toastr.min.js"></script>
    <script type="text/javascript" src="/static/js/const.js"></script>
    <script type="text/javascript" src="/static/js/imagetask_admin_changelist.js"></\
script>
{% endblock extrahead %}

The code is shortened a little for simplicity. Let’s open “Image Task’s” page, and see a backlog:

The labeling interface is the most interesting part, of course. The most recent version is depicted below:

The labeled image on the left part of the page has six draggable keypoints. Once their position is changed, the mesh is recalculated in real-time. The first version of the interface was written in jQuery + Canvas, and some UI elements in the picture above were missing. It took me about three days to write it (Django and simple Javascript are awesome), and we started labeling the dataset right away.

It took around 30 man-hours, spread across two weeks, to process a thousand of items.

The moto of processing a dataset — “Move fast, everything is fixable”. The flow must be speed-oriented, with no modal windows — any action is performed through a single click. On the other hand, if the error occurred (that happens), it must be a way to fix it easily. Here is an example — all the uploaded images have status “New”. We add them manually by going through the list and clicking “Move to backlog” button”. If you changed your mind, it can be easily restored by another button “Set as new”:

I can’t say it’s rocket science, but all that makes the work highly effective.

So, once the dataset is ready, we can do the most interesting part — a neural network! We took U-net, trained it, and it worked! Not ideally, of course, but good enough for proof of the concept. So, continued labeling the dataset further.

Next milestone, “Rose panther”, goals:

* Create a working dataset
* Choose an optimal neural network architecture
* Run REST API to detect wine label keypoints
* Stitch labels

Meantime Kirill (he was responsible for the architecture and stuff) tried to make changes in labeling interface. His face when he saw my jQuery code:

So, the first thing he did — refactored the page to use React.js, and added dynamic preview:

Later the interface had the other changes — we added hotkeys, classification tags, Trello integration, setting up initial markers by a neural network, but the interface didn’t change much visually.

In general, putting six-point markers was not a trivial task. There were a lot of different wine label shapes, and it was required to make the basic rules on how to mark them. For example, which of the cases below is correct?

If the dataset is not consistent in similar cases, the neural network will get confused, and the accuracy will suffer. At some point, we had to update the rules and reprocess the existing dataset (it was about 3000 images).

I mentioned Trello integration — we used it to start a discussion thread linked to a task if it was not clear how to tag it:

All the questions from workers go to the first column. The reviewer picks up a card and drags it to the “Conversations” column. Once the discussion is over, the worker will archive the card, but the link is still available from the original processing task. The benefit of that approach — the integration can be done in half a day, so it’s not required to write own chat, and Trello is free, and it has nice wallpapers :)

Two-step processing — firstly by a worker, secondly by a reviewer, is required not even to reduce the number of errors, but to make the process more parallel. One or two reviewers can review tasks pretty quickly, and they know markup rules better than workers. But workers should not worry about the rules too much — if something is wrong, the reviewer will catch the error and explain how to fix it.

Another problem with building a dataset — where to collect source images. Even in supermarkets, there are about 2–3 thousand different bottles, and there is an intersection in products between stores. Taking pictures is not too hard — I could take around a thousand in an hour, luckily I was never asked to leave by the staff.

One day I realized I shot every single bottle in New Jersey, and it was no point to walk with the camera anymore. Additionally, the dataset we collected consisted of bottles, standing on shelves, which was quite a uniform environment. At some point, we noticed the neural network detected wine labels on this background very well but was totally confused if it was something different. So, we decided to extend the dataset with web scraping.

We looked for images with a white background, with text, or something unexpected. For example, if a picture of beer also contained a glass full of foamy amber, the neural network could not focus well enough. So, we trained its stamina with the additional hundred images.

I also want to share a lifehack, as it’s important to understand that the dataset work is boring. And if you are Google, and own Re-Captcha, the boredom is distributed across millions of innocent people, but if you mark the dataset yourself, all the monotony is distributed on you personally. I just want to say, all the steps that can be automated have to be automated. In the case of downloading images, we launched an API that took an image URL, downloaded it, unwrapped, and visualized the queue.

On the client-side, we used CopyQ application — it’s cross-platform, runs in the background, and executes actions if a hotkey is detected. In my case, I had the following configuration:

The first step is to look up images in Google, then, if you like one, copy its URL and hit Win-Z.

CopyQ sends the URL to API, adding the image to a processing queue, and once, let’s say, 100 items are accumulated, you can switch back to admin and add to backlog the worst cases:

There is no need to add images that are already being recognized well, but the example above (it’s a random image from the internet) shows the top left keypoint in the wrong position — it’s positioned nearby the label, meantime it had to be left side of the bottle. Now we click “Add to dataset”, and the image goes to the backlog.

We processed 10k images in 3 months eventually, it doesn’t seem too much, but the dataset was well-balanced. In reality, a big dataset is not necessarily good, as it’s much harder to balance it, i.e. to choose the different examples in equal proportions. If you don’t, it might be easier for the neural network to ignore some of the examples, rather than to generalize the algorithm. Let’s say the dataset contains 1 million images, and all of them are made in good light conditions. And you realize 15% of real photos are made in bad light conditions and the neural network is not able to recognize them. The solution is simple — it’s necessary to add low light samples, but as your dataset is large, you’ll have to add thousands of them, before they affect the accuracy. It’s like a bureaucratic system, that is targeted to handle common cases but ignores the outliners.

But 10k items can lead to an overfitting problem, so it’s required to apply augmentations. There are nice libraries like imgaug or albumentations that can modify images in different ways — rotate, reflect, blur, or change color balance. It’s only required to keep the image readable and take into account that even little changes, applied together, may have an unexpected effect — for example, blurring and saturation applied at the same time, can visually shift bottle edges.

So, the dataset is ready, it’s well balanced and large enough due to augmentations. There is still a question — what neural network to use? Is it required to invent a new architecture, or there are some universal ones? The truth is — no one knows, but what’s is clear distinctively — it will be necessary to try many different architectures and settings, and any machine learning project is built around this idea. If it takes 3 weeks to try a new architecture, it’s bad news. Ideally, a pipeline must be designed in a way, that only a few changes in configuration, and, maybe, a couple of adapter classes to change input/output format, will connect a new architecture.

Our proof of the concept was written without those requirements, of course, but once it worked, our next step was to build such a pipeline. Initially, it was built using Keras, but Sergey (our artificial intelligence charmer) switched to Pytorch finally.

We also made neural network nightly builds available via REST API, which was used by labeling interface — for automatic image labeling. Django REST Framework has a very convenient feature — token authentication https://www.django-rest-framework.org/api-guide/authentication/#tokenauthentication. The token can be created with the admin:

Then, to get access to the endpoint, it’s just necessary to pass the token in the request header.

I’d rather finish the article halfway through, soon there will be the other ones:

* Stitching several images into a long one
* How to register a company in Delaware
* A little about B2B and SaaS promotion