Effective Management of Data Sources in Machine Learning

Machine learning and artificial intelligence have been a buzzword as tools based on these technologies, like Chat GPT, became available to the general public.

In this article, I would like to have a peek into the ML backstage, especially the data it ingests– the essential fuel for any ML model no matter what it is applied to.

There are two major aspects of data when it comes to selecting and processing it: quality and quantity. Also, there is a third factor affecting the practical side of ML: economic feasibility. I would like to share my personal experience of working with different data sources and finding the precise case-tailored balance between data quality, quantity, and available resources

Types of Data Sources

Two primary data types exist, each requiring distinct processing approaches prior to their utilization in machine learning. Firstly, there is user-generated data, which entails meticulous logging of user activity and behavior. In domains like online advertising, actions such as clicks and conversions are collected, logged, and transformed into labeled data.

Secondly, we encounter data that lacks inherent labeling but necessitates ML model training. To address this, manual data labeling by human annotators becomes crucial. A prime example is the MNIST dataset, a renowned computer vision dataset compiled and labeled by humans.

Another strategy for handling non-labeled data emerges when resource constraints hinder staffing sufficient annotators for labeling training data. In such cases, the proxy-label method can be employed. For instance, users can report inappropriate content, which can then be labeled accordingly. However, this approach may introduce noise and occasionally require the attention of the annotators' team.

Now, let's delve into the approaches for effectively managing the most challenging data type: non-labeled data.

Gathering Datasets with Human Annotators

Building datasets employing human annotators is a commonly used technique in machine learning. Its two main limitations are cost and time consumption.

Costs can turn into a major challenge if a dataset is large and processing it means hiring many people. Also, the nature of data may require a high level of expertise and, consequently, employing expensive specialists. For instance, annotating medical images or legal documents can only be done by highly trained staff.

Time can be an issue for several reasons. First, it is obvious that large datasets take a long while (or an impractical number of people) to process. Sometimes it can also be difficult to recruit annotators willing to engage in long-time projects. Second, annotators should be adequately trained to deliver the desired level of quality. Such training can be time consuming, especially when high expertise levels are required.

Let me share the lessons I learned while working on an image quality classifier for an online shop. Our solution was built to automatically detect blurry, cropped, and images with an unprofessional background. In order to accomplish this, we needed to gather a training sample. Here is what made the process more effective

Annotate data in batches
As always, we had a limited budget for human raters, so splitting the annotation process into batches was extremely useful. Initially, our raters weren't producing high-quality results, and when we discovered this, we had only spent about 5% of the budget. To address the issue we collaborated with the team of annotators to refine the guidelines and process in order to reduce the number of errors (more on this in the next section).

Detecting the data quality issue early on allowed us to allocate the remaining 95% of the budget to a well-trained team of human annotators.
Another advantage of sending data in batches is that it enabled us to implement active learning. If we had sent the entire dataset to the annotators all at once, we wouldn't have been able to utilize this strategy

Sample batches with active learning
Random Sampling is generally a good initial approach, but with this particular project there were two problems:

The negative class was overrepresented, because good images were a majority in the database. Also, with this approach, we could not easily add “difficult” images (with borderline predictions of about 0.5) to a training sample

The solution was to sample with Active Learning.

We followed three steps:

Randomly sampled some images;
Got predictions for every image using the current model;
Sorted images by predictions in descending order, and sent images with the highest probabilities OR borderline images to human annotators.

Switching from random sampling to an active learning strategy enabled us to increase the percentage of the positive class:

2x for images with an unprofessionally made background;
4x for partially displayed images;
33% for blurry images

Track annotators’ quality
It’s quite common when working with human annotators to collect several responses per item and assign the final label using the supermajority rule.

For example, let's assume we collect three responses per image and if two answers indicate that the image is blurry we assign it as the final label. Intuitively it seems like this procedure should significantly increase the accuracy of the data.

However, if we make simple calculations we will see that the increase in the probability of the final label being correct is small and it’s more important to work on the accuracy of the individuals.

Let’s assume each annotator has a minimum probability to give a correct answer for a given question; let us put this probability as 'p'. Now let us calculate what is the minimum probability for at least two raters to give a correct answer; we shall put this probability as 'q'. Then, q is a sum of probabilities of two events: all three raters giving a correct answer (p^3) and any two raters giving a correct answer 3*(1-p)*p^2. In the table below we can see how q changes depending on p.

As we can see, the final probability (q) does not change much if we apply the supermajority rule with a minimum of three responses. That is why it is very important to track the quality of each annotator.

In our case, the following process helped:

Introducing a “golden set”, created by well trained annotators. We used this set to calculate accuracy for each annotator assigned to our project;
Having a biweekly AMA where annotators could ask the questions about controversial images;
Introducing final exam and minimum performance threshold.

An annotator could only start rating if they pass the final exam and got a score higher than the threshold.

Reducing human involvement

Sometimes, a human-based approach may fail or become too impractical to employ. In many cases, there might be an elegant solution to bail you out. Let us have a look at some of them:

Proxy Values
Sometimes we could be creative and use a proxy value for a label instead of building a dataset employing human annotators. For instance, once I was developing automation that would allow us to identify family-friendly properties listed on an accommodation-booking website at scale.

There are several proxy signals for this case -we can look at the review ratings left by travelers and labeled all the hotels with good ratings left by families as “family-friendly”. Also, we can create a "non family-friendly" dataset obtained in a similar way. As an alternative, we can just measure the share of family bookings. If the share of such bookings over a set period (say, a year) exceeds a certain threshold then we can label the accommodation as family-friendly. This works because people who book hotels usually do extensive research and we just use their research results expressed as their decision.

Data augmentation
In this case, data from an already processed dataset is reproduced in an altered form and then fed back into the model. Working on my image quality project, I took the existing good photos and used basic graphic tools to make the good photos bad. In particular, I blurred the good images and added these "bad" images to the training set. I did the same with randomly cropping the images as well, so it displayed only part of the product. These simple transformations brought me +8.08% to ROC AUC

As you see, human input in machine learning is essential. However, managing it may be a matter of survival for many tech projects. The rule of thumb is opting for less but more qualified staff, setting reference datasets, and extracting as much as possible from the actions of users, employing them basically as free annotators. If you want to entertain yourself with a mind game I can suggest an interesting task. Imagine you manage a media platform that uses crowdsourcing, publishing written content produced by authors of different backgrounds, skills, and quality.

Initially, selecting, editing, and publishing good articles and rejecting bad ones was entrusted to human editors. But as your platform gains popularity, editors become overwhelmed by the workload. Boosting your staff is not an option, as your budget is limited.

Try to think out ways to reduce the number of the editors’ incoming tasks by means of ML:

Rejecting AI-generated texts;
Turning down texts that are mere compilations of older ones;
Using historical data to predict articles’ readability and user engagement, rate them accordingly, and prioritize editors’ tasks according to this rating.