Multiple models trained on your data perform surprisingly poorly, despite having decent metrics on the validation set. The code seems fine, so you decide to take a closer look at your training data. You check a random sample - the label is wrong. So is the next. Your stomach sinks and you start looking through your data in batches*. Thirty minutes later, you realize that x% of your data is incorrect.
Unfortunately this I've been in this situation a few too many times. Creating datasets is hard, even for relatively simple tasks like document classification/sentiment analysis etc. Not only do you have to worry about stuff like annotator bias, but label noise is insidious. You can measure label noise with annotator agreement metrics**, but even a relatively high agreement score can allow for ~10% of your data to be wrong. A good example of this is SST5, where classification performance is significantly worse than SST2
This problem can be even worse if you don't have very clearly defined orthogonal classes - It's relatively easy to come up with a labelling scheme that has a large amount of overlap between classes - this normally results in a lot of variance between annotators as each annotator will overrepresent a single class. For example, if you're building sentiment analysis models for survey results, how do you deal with the fact that 30% of your data might have customers complaining about your after-sales support but also complementing the fact that you're the most affordable option on the market? Do you switch to an aspect-based sentiment approach? If not, do you mark that sample as neutral? Irrespective of the judgement call that you take, how do you ensure that the annotators can be aligned to take similar judgement calls?
A lot of companies I've done contract work for, default to Mturk for annotations. I think that this is a particularly bad idea - MTurk can be very unreliable and needs a lot of effort to ensure that you get high-quality annotations. With MTurk, you don't have a strong ability to train the annotators and align them with your requirements, and while you can restrict the regions where you get workers from - you can't be sure that they have the kinds of information that you need without directly testing it. Finally, if you do decide to test it, how do you ensure that the care/attention the annotators pay to the qualification tests will be maintained through every sample that they annotate?
In my experience, the best thing to do is hire a set of annotators. That's hard too, but once you eventually figure out the hiring, here's some things that I've learned managing annotators.
*This is a decent way to get an idea of your data, you essentially figure out the number of errors per batch and evaluate as many batches as you need to until you get a stable average value. You can then run a significance test to make sure that the average is representative if you want to be very thorough, but it usually is.
**Agreement metrics essentially evaluate what proportion of your labels annotators agree about after adjusting for agreement based on pure chance.
Photo by Jan Antonin Kolar on Unsplash