“There cannot be too much of relevant training data! Training your machine learning models for scene understanding requires a lot of data. This is mainly due to the variability of input sensors and the environmental conditions introduced both by nature and man.”
Every company needs its own data to build models so as to achieve various identification tasks. This need arises, due to the variability within the problem domain.
Options to Scale Training Datasets
- Open source Datasets
- Customized Datasets with Human validation
- Synthetic Data Generation
Although obtaining raw datasets is relatively much easier nowadays, enriching or annotating this data poses a whole new set of logistical challenges. As mentioned in our previous blog post there are 3 ways,
- Built-in image annotators (e.g., Facebook’s image tagging, Google’s reCaptcha)
- Traditional BPOs (e.g., you know better)
- Fully Managed Annotation experts (e.g., Playment)
But, how does one go about making a decision?
In the future, I hope there will be a central system which would have seen so much data. It’s just an API call to fetch whatever and how much data we need in real-time.
But, such a system does not exist today. And, so there is no escaping the human role in scaling of training data.
What to look for when scaling up Training Data!
Scalable Trained Workforce + Seamless API integration + Dedicated Project Manager + Stringent QC to ensure Data Accuracy = High Quality Training Data at Scale
Here’s a use case from one of our customer,
“The new Autonomous driving startup, building AI Brain for self-driving cars, who needs a huge amount of data to be annotated datasets for vehicles and pedestrians.”
Breaking this down we got,
Traffic signs, pedestrians, different types of vehicles… around 45 different classes.
The problem with existing datasets is, the autonomous vehicles have been trained for say., Europe roads but, what if to put the same car on American roads?
This would lead to contextually irrelevant data to build for localization, categorization with high precision and recall accuracy.
So after freezing the scope of work, our project manager started analyzing the nuances and clarifying queries.
With our trained workforce, the only issue was to train our users for new cases, then the qualifiers.
How does Playment fit in?
Getting hundreds of thousands of annotators are easy. The hard part is training them for the annotation tools and complex tasks.
Solving User Training Complexities
When we get more no. of classes this becomes really a challenge for our project teams.
So if the classes are more, we try grouping them to ease user training.
Like mentioned above, The total M no. of classes are clubbed into N groups. With this, we complete all object localization tasks and again break the groups into individual M no. of classes for performing categorization tasks.
- On an average day, we deliver 3000+ man hours
- Implemented consensus logic to qualify annotations
- We do daily in-house QC and if there is any discrepancy we redo it completely.
We understand how large quantity of datasets is essential for high-performing machine learning algorithms.
We do prep all the training data you need so that you can only focus on innovation instead. For more information on data labeling problems, feel free to get in touch.
Originally published on Playment blogs Feb 19, 2018.