AI and Crowdsourcing: Using Human-in-the-Loop Labeling by@dustalov

AI and Crowdsourcing: Using Human-in-the-Loop Labeling

AI today rests on three pillars – ML algorithms, the hardware on which they’re run, and the data for training and testing the models. But obtaining high-quality up-to-date data at scale remains a challenge. One of the ways to resolve this is to adopt the data-centric approach to data labeling that entails building human-in-the-loop pipelines, i.e. hybrid pipelines that include both machine and human efforts. Crowdsourcing is an online activity where individuals perform tasks assigned on a platform that’s becoming more popular.
image
Dmitry Ustalov HackerNoon profile picture

Dmitry Ustalov

Ph.D. in Natural Language Processing | Head of Research at Toloka.ai

linkedin social icongithub social icon

AI today rests on three pillars – ML algorithms, the hardware on which they’re run, and the data for training and testing the models. While the first two pose no obstacle as such, obtaining high-quality up-to-date data at scale remains a challenge. One of the ways to resolve this is to adopt the data-centric approach to data labeling that entails building human-in-the-loop pipelines, i.e. hybrid pipelines that include both machine and human efforts.

Crowdsourcing case studies

Human-in-the-Loop (HITL) refers to a computational process that combines the efforts of human labelers and code and is normally managed by a human architect (crowd solutions architect, CSA).


Crowdsourcing is an online activity where individuals perform tasks assigned on a platform that’s becoming more popular because of its cost- and time effectiveness. Let’s look at two categories of case studies in which crowdsourcing successfully aided AI production.

Case study #1: search relevance evaluation

Many industries today rely on recommender systems to support their business. Recommender systems consist of learning-to-rank algorithms: these are utilized with search engines (documents), e-commerce sites (shopping items), as well as social networks and sharing apps (images and videos). The main hurdle to overcome when testing and improving learning-to-rank systems has to do with obtaining enough relevant data that (by definition) consists of subjective opinions of individual users.


We’ve been able to determine that the following pipeline allows for effective testing and validation because it shortens testing periods from many weeks to just a few hours:


image

To follow it, you need to:


  • Perform stratified sampling of queries and documents.
  • Sample and annotate pairs of documents per each query (pairwise comparison).
  • Recover the ranked lists and compute ERR (Expected Reciprocal Rank), NDCG (Normalized Discounted Cumulative Gain) or use any other applicable evaluation.


With pairwise comparison used in recommender system testing, one object within each pair indicates the user’s preference. While the task looks simple, it allows us to address a number of complex problems, including information retrieval evaluation. Since we’re annotating object pairs, we need to aggregate these comparisons into ranked lists for further use.


image

To do that and obtain an improved recommender system based on up-to-date human judgements, we need to:


  • Pick a subjective aggregation method (for example, Bradley-Terry) and transform the comparisons into a ranked list.
  • Compute evaluation scores by comparing the system output against the aggregated human preferences.
  • Train a new version of your learning-to-rank model.


Case study #2: business listings

Another common task utilized by many companies is spatial crowdsourcing, also known as field tasks. Spatial crowdsourcing is used to find information about brick-and-mortar stores (i.e. physical retail) for digital maps and directory services. Obtaining up-to-date information about such establishments normally poses a huge challenge because of a large number of modern businesses that tend to come and go or change their whereabouts on a regular basis.


Spatial crowdsourcing is a powerful HITL pipeline element that can successfully overcome this issue. Unlike the traditional survey-like crowdsourcing tasks, spatial tasks are shown on a map, so people can sign up to visit any number of locations to gather the latest information about a business required for the task (and, for example, take a photo).


Just like with pairwise comparison, this sounds deceptively simple, but it can actually help us resolve a number of extremely complex problems. We suggest using the following pipeline:


image

If the business in question can be located, the information is transcribed accordingly: the name of the business, telephone number, website, and working hours. ML algorithms are applied to retrieve company codes and other information. We ask the crowd if it is possible to use this  photo as part of a map or directory service. In contrast, if the business in question cannot be located, we choose a different, more suitable photo.


This screenshot shows a typical spatial crowdsourcing task, in which one person takes a picture and another does the transcribing.

image

Important points to consider:


  • Submissions may need to be reviewed if some locations can’t be reached due to bad weather/traffic conditions, etc.
  • Free-form text responses are more difficult to process and may also require reviewing.
  • Fortunately, reviewing can be assigned to the crowd as a classification task that scales very well.
  • Subsequently, classical quality control or aggregation methods can be applied (for example, Dawid-Skene).
  • If done correctly, you’ll be able to update your database quickly and effectively.


Takeaway: humans and machines

With human-in-the-loop data labeling, humans and machines complement each other, which results in simple solutions for a variety of difficult problems at scale.

react to story with heart
react to story with light
react to story with boat
react to story with money

Related Stories

L O A D I N G
. . . comments & more!