An Introduction to Automation in Vision AI

Levels of Annotation Automation

Preamble

This post is a high-level exploration of the most common ways of implementing image-based deep learning (often referred to as image-based Artificial Intelligence or AI), basic annotation approaches, types of annotation and levels of automation for this task.

This article is intended to introduce topics that we will dive deeper into in follow up posts. It can be used as a helpful guide for people looking to implement image-based AIs or who are starting their research and coming to grips with the buzzwords being thrown around. For the sake of sanity, we have simplified some of the concepts below.

Introduction to annotation (a.k.a. labelling)

Image-based AIs are trained using labelled data. This is also referred to as the ‘ground truth’, ‘labelled’ or ‘annotated’ data. There are multiple types of ‘annotations’ for different data science models. They vary and include things like ‘key-point’ annotation, ‘interpolation’, ‘pose estimation’, and so on. For the purpose of this post we will focus on the four most commonly used types of annotation (Figure 1):

Figure 1 — types of annotation (not an exhaustive list)

Classification (often referred to as tagging)
This is useful to get a quick indication of the attributes of an image. It includes the existence of an object, mood, or background in the image. It is the simplest form of annotation and the one we see in things like Google-captcha. However, there is limited functionality as the position, shape and unique attributes of objects are unknown and millions of images would need to be annotated in order to learn this detail reliably with this method.

Object detection (a.k.a. bounding boxes)
This is useful to locate discrete objects in an image. The annotation is relatively simple as one simply has to draw a tight box around the intended object. The benefits here are that storing this information and the required computations are relatively light. The drawback is that the ‘noise’ in the box — the ‘background’ captured — often interferes with the model learning the shape and size of the object. Thus, this method struggles when there is a high level of ‘occlusion’ (overlapping or obstructed objects) or high variance in the shape of an object and that information is important — think of types of biological cells or dresses.

Object detection — the ‘noise’ is the sand included in the bounding box

Semantic segmentation
This is useful in indicating the shape of something where the count is not important such as the sky, road or simply the background. The benefits here are that there is much richer information on the entire image as you annotate every pixel. Your goal is to know exactly where regions are and their shape. The challenge with this method is that every pixel needs to be annotated and the process is time-consuming and error-prone.

Instance segmentation
This is useful in indicating discrete objects such as car 1, car 2, flower a, flower b or actuator. The benefits are that the shapes and attributes of objects are learnt far faster, having to be shown fewer examples, and occlusions are handled much better than with object detection. The challenge is that this method has a very time-consuming and error-prone annotation process.

[spoiler alert: or should we say had… ;)]

NOTE: the latest method of 'panoptic' annotation is combining semantic and instance segmentation into a single model.

The challenges of Segmentation

Manual segmentation — label an object in a minute

As you can see, instance and semantic segmentation are time-consuming as one needs to manually outline the exact target object — point for point with a ‘polygon’, or even pixel for pixel with a ‘mask’. This is why it is so error-prone. In fact, the best annotators in the world have a 4–6% error rate while the average person has around 8–9%. This error rate makes a significant difference in the performance of the resulting AI and is often what blocks projects from making it through the proof of concept phase.

Now imagine that the target objects are complex, such as organic cells or mechanical items. Further, what if the margin for error is slim as the consequences of a wrong decision from the model can be dire or even fatal. Usually, in these non-trivial cases, segmentation has the most utility and is required for you to achieve a high-performing model.

70% of the work required to build an image-based AI is annotation work. If you see an AI working in practice (e.g. autonomous driving) then know that it has taken millions of hours for people to create enough labelled data to train that neural network to a point that the team felt confident enough to put it into production. Even then, there is more often than not the need to relabel or label additional data after the model is deployed.

The benefit in automating this manual work is highest when experts are needed to annotate these images. Typical use cases include medical and biological imaging, robotics, quality assurance, advanced materials and agricultural. Think about cases where you are building an AI to assist a human who took many years to become an expert in that domain.

Levels of automation

The goal of automation in machine vision is to determine the outline of an object by giving the fewest inputs possible. For this section, we will largely be referring to automating segmentation tasks as this is generally the most labour intensive.

Levels of automation in this context can be outlined as estimating the outline of:

Level 1: a single object in a single image

Level 2: multiple objects in a single image

Level 3: estimating the outline of multiple objects in multiple images

The goal is to accurately estimate the outline of all objects in all images for a given project.

Level 1 — annotate an object in a matter of seconds

Using classic computer vision methods popularised from the well known ‘OpenCV’ framework, tools known from Photoshop and even some based on novel AI approaches, are tools that are aiming to automate the annotation of a single object as much as possible. Examples of Level 1 tools include:

Contour | looks at outlines based on contrasts
> great for objects on a contrasting background

GrabCut | extracts the background from the foreground for a predefined region
> great for objects on a monochromatic background

Magic wand | selects an area by finding similar pixels near the selected pixel for a given range
> great for monochromatic (or close to) objects

DEXTR | uses a model trained on a large generic dataset to attempt to identify the outline of an object within a defined region
> great for dynamic objects on dynamic backgrounds

DEXTR — label a full image in minutes

NOTE: often annotation tools claim ‘automated labelling’ with features like DEXTR. However, it is still a manual tool reliant on being previously trained on generic datasets that gives you a suggestion per object. Don’t get us wrong, this tool is great and has its uses to get to level 1 automation, but it is a far cry from complete ‘automated labelling’.

Level 2 — annotate a full image in a matter of seconds

On this level, you try to annotate all objects in an image in one action. This is close to the current cutting edge of deep learning. The time savings compared to Level 1 is drastic as human input decreases radically. However, this automation requires a higher level of confidence than level 1. The implication is that one starts an annotation project using Level 1 tools until Level 2 tools are ready to be deployed.

Instance segmentation assistant — label a full image in a few seconds

Level 2 automation is achieved with the use of AI assistants. These assistants learn in the background while you annotate. When they have reached a certain confidence score, you as a user can start to use them and get suggestions not only for individual objects but for a complete image. The assistant retrains and improves as more images are complete.

Level 3 — annotate a full image batch/project in a matter of seconds

When annotation has been automated to this level, you as a user should be able to annotate a collection of images or even a complete project in a matter of seconds. What is expected here is that you as a user just click a button, and all images in a project get annotated.

Finish an entire dataset in seconds…

Although extremely powerful, using Level 3 tools also come with challenges. For example, if you annotate a dataset containing 10 000 images of animals where 1 000 have already been annotated, and the Level 3 tool has a hard time differentiating between frogs and toads, the 9 000 images that you auto annotate with the tool might have serious quality issues. What should be classified as frogs are now toads and vice versa and the annotations made are unusable. This is a classification error — only one of four types of error that can occur. The others are generating artefacts, inaccurate segmentations or missing objects altogether.

Thus, to use a Level 3 tool, you need to be very certain that the results will be accurate and the error percentage very low (<0.5%). This certainty can be reached by taking the user behaviour into account for level 2 automation, such as making minor or no adjustments to suggestions from level 2 and looking at things like confidence levels.

In Hasty, we are working towards a Level 3 tool but that’s still under development and will need a few more months before introducing it to users. This is where features like our ‘Error finder’ become critical, which will be a topic for a whole new post…