Data Labeling for AI Products: How to Process Thousands of Data Labels by@romangaev

Data Labeling for AI Products: How to Process Thousands of Data Labels

Roman Gaev HackerNoon profile picture

Roman Gaev

Project manager at

Today, the AI industry is on the rise – the global AI market is valued at over $40 billion and continues to grow at a rapid pace. The industry relies on three vital components for its development: (1) the hardware that provides computational power and virtual living space for Machine Learning (ML) algorithms; (2) the actual ML algorithms, which is code written by data scientists to instruct the machine on how to teach itself; and finally (3) the data that needs to be labelled and fed into the machine along with any accompanying training instructions, so that it has some substance to fall back onto. Without the data, the first two components can do as much as an automobile can without a full tank – absolutely nothing.

As a result, data labeling has emerged as one of the bedrocks of the current AI boom, a statement supported by Gartner’s Hype Cycle Report for Data Science and ML.

Importantly, whereas both hardware solutions and training algorithms are obtainable commodities that are more or less the same among data scientists, the data annotation component still varies significantly. Different data-labeling approaches and methodologies exist, each one with its fortes and foibles.

Importance of data

The truth is that no matter the labeling approach, whoever can get their hands on more data and whoever can label this data faster, cheaper, and more accurately – has a competitive advantage. This is why the data-labeling market is also growing, in fact at almost 30% per year. Some sources go as far as to predict that the global data-labeling market will exceed $8 billion by 2030. More is yet to come.

Below are a handful of recent case studies that show the power of data labeling in action.

Case study 1: Neatsy / content collection

Content collection can take many shapes – from voice assistants, chatbots, and web content extraction to field tasks for retail. In the case of Neatsy, an app that takes your measurements and finds you the right shoe size, it was about collecting 50,000 images of human feet.

The app’s main feature is a 3D scanner that creates a 3D model of the customer’s foot and then uses it to recommend the best fitting shoes. The scanner works as a sophisticated neural network – its main task is to successfully separate human feet from the floor. Initially, the company built its first MVP using around 1,000 labeled images. When Neatsy’s team realized that for their app to work they would need 50 times as many labeled images, they turned to AI and had all of the images labeled within a week.

Neatsy then requested an advanced assignment that required more precise foot outlines. This involved creating a special training pool for performers with a follow-up competency test. Another three weeks later, the company received all 50,000 labeled images.

Neatsy noted that the app’s time to market was accelerated significantly and their 3D scanner became 12% more accurate.


Case study 2: Toloka / brand identity and decision-making

SMM teams need to engage and understand their target audience. There are a number of ways to do it: intent classification, sentiment analysis, utterance collection, and surveys.

One of the examples of this is when a company is in need of rebranding as it aims to become a globally recognizable business with its own corporate identity. As was the case when our team went through our own rebranding earlier this year, agreeing on the right design can be challenging. Especially when you have more than one design variant on your hands and more than one voice within your team, which often is the case.

With crowdsourcing, for example, if you have multiple versions of your logo or other imagery, you can quickly engage with your audience and check in advance what your future customers think about your new corporate identity.

All it takes is getting the crowd to do a simple side-by-side comparison and rank your options from best to worse with detailed explanations, should they be required. This endeavour can take as little as 10 minutes.


Case study 3: Handl and data extraction

Y-Combinator backed Handl is an algorithm-based solution that helps large companies analyze, categorize, and retrieve customer information from scanned documents in a matter of seconds.

The problem is that for Handl to be able to offer reliable results and maintain a competitive edge, the algorithm needs to be updated and retrained on a regular basis. This means lots of incoming data labeled consistently and accurately.

Handl used crowdsourced approach to data labeling to its advantage – the company now collects over 4500+ daily tasks, with each new document requiring 6-8 minutes to process, and all that at an affordable cost that’s 50% lower than the in-house route.


Takeaway: AI Data Labeling

In today’s world, ML products and data-driven business decisions require a stable, fast, and reliable flow of high-quality data. Having a wide range of options—both unsupervised and supervised—to support your business by ensuring transparent production pipelines, high labeling accuracy, and rapid delivery is crucial.


Signup or Login to Join the Discussion


Related Stories