Finding, creating, and annotating training data is one of the most intricate and painstaking tasks in machine learning (ML) model development. Many crowdsourced data annotation solutions often employ inter-annotator agreement checks to make sure their labeling team understands the labeling tasks well and is performing up to the client’s standards. However, some studies have shown that self-agreement checks are as important or even more important than inter-annotator agreement when evaluating your annotation team for quality.
In this article, we will explain what self-agreement is and introduce an ML study where self-agreement checks were crucial to the quality of the team training data and the accuracy of their model.
Simply put, self-agreement is a QA protocol you can use in data annotation to evaluate the abilities of individual annotators. Whereas inter-annotator agreement protocols check to see if two or more annotators agree with each other, self-agreement checks whether or not a single annotator is consistent in their own annotations.
For example, a simple inter-annotator agreement workflow would be to send two separate annotators the same piece of data. Then, you would check to see if their annotations are the same. If they are not the same, you could then bring that piece of data to a supervisor to make a ruling on which annotation is correct.
On the other hand, with self-agreement protocols, you would send the same annotator the same piece of data twice to see if they provide the same label both times. For example, if they are tasked with annotating 100 images, you could set image 1 and image 35 as the same image, evaluate the result, and repeat this process many times. Theoretically, you could send an annotator the same data more than twice, but the effect is minimized because the annotator starts to realize that they’ve seen this data point before.
Why Use Self-Agreement in Your Data Annotation Workflow?
The point of self-agreement is to evaluate the abilities of the annotator and make sure they are annotating each piece of data correctly, and not simply rushing through the project to get it done as quickly as possible. Furthermore, there is concrete proof found in a 2016 study that using self-agreement can help weed out low-quality annotators and improve the quality of your dataset.
In a 2016 sentiment analysis study by researchers from the Jozef Stefan Institute, the team found that the quality of human annotators could play a larger role in the accuracy of the model than the type of model itself.
The team’s goal was to create a sentiment classifier for Twitter posts in multiple languages, so they analyzed 1.6 million tweets in 13 different languages. These tweets were all labeled for sentiment by human annotators. Ultimately, the researchers said:
“Our main conclusion is that the choice of a particular classifier type is not so important, but that the training data has a major impact on the results.”
Many of the teams insights were about the relationship between inter-annotator agreement and self-agreement and how those values relate to the quality of the data. Firstly, they found that self-agreement will almost always be higher than inter-annotator agreement.
Figure 1 from Multilingual Twitter Sentiment Classification
In the above diagram and the rest of this article, Alpha refers to Krippendorf’s Alpha, which is a coefficient used to calculate agreement between observers. An Alpha = 1 is the highest score which means perfect agreement.
If you have low self-agreement, you will have even lower inter-annotator agreement in most, if not all, cases. As a result, self-agreement tests may be an easier and quicker way to track the overall quality of your dataset by analyzing the performance of your annotators.
For example, if you are aiming for an inter-annotator agreement Alpha of 0.6, but the self-agreement levels of most of your annotators are at 0.4, chances are you aren’t going to hit the inter-annotator agreement Alpha you were hoping for. Therefore, you may want to focus on raising self-agreement above your desired levels before proceeding with inter-annotator agreement checks.
In the study, one example of this was the low quality of the Spanish tweets sentiment dataset (see image above). The researchers found that the self-agreement was at 0.244, while the inter-annotator agreement was at 0.120.
As part of the overall project, the team created an Emojis Dataset which included tweets from various languages that had emojis. They collected 70,000 tweets in total across various languages. Around 20,000 of these tweets were from the poorly-annotated Spanish dataset mentioned in the previous section. As a result, the total self-agreement of the Emojis Dataset was at Alpha 0.544.
As a result, the emojis dataset (as seen in Figure 1 above) was the only dataset where the self-agreement was lower than the inter-annotator agreement. However, after removing all of the Spanish tweets from the Emojis Dataset, the Alpha of the Emojis Dataset jumped to 0.720.
This insight reconfirmed the team’s conclusion: “Low quality annotators have to be excluded and their annotations removed from the datasets.” This is especially true when you have a large project using multiple annotators or even multiply annotation teams.
Should You Use Self-Agreement or Inter-Annotator Agreement to Improve Training Data Quality?
The safe answer is: both. In most cases, you should not abandon inter-annotator agreement testing entirely. In fact, the research team stated that the two measures can provide you with different insights:
“It turns out that the self-agreement is a good measure to identify low-quality annotators, and that the inter-annotator agreement provides a good estimate of the objective difficulty of the task”
So if the self-agreement of an annotator is extremely low, then they either aren’t prepared for the labeling task or they’re simply the wrong person for your project. If inter-annotator agreement is low, but self-agreement is at acceptable levels, then the task is either too difficult or calls for subjective reasoning, as is often the case with sentiment classification projects.
There is no one-size-fits-all method when it comes to testing for inter-annotator agreement and self-agreement. It depends on the task and what your acceptable Alpha levels are. Throughout the article, the researchers consistently emphasized that both levels should be constantly tested throughout the training data creation process.
Annotators should be informed when levels drop too low, and actions should be taken to maintain the quality of the training data. Sometimes that will unfortunately mean removing low-quality annotators from your project and labeling their data again with a better annotator.
Hopefully, this guide helped you understand the power of self-agreement checks and how they can improve the quality of your data. If you’re looking to learn more about how to improve your data, check out this in-depth training data guide.
Create your free account to unlock your custom reading experience.