“There cannot be too much of relevant training data! Training your machine learning models for scene understanding requires a lot of data. This is mainly due to the variability of input sensors and the environmental conditions introduced both by nature and man.” Every company needs its own data to build models so as to achieve various identification tasks. This need arises, due to the variability within the problem domain. Options to Scale Training Datasets Open source Datasets Customized Datasets with Human validation Synthetic Data Generation Although obtaining raw datasets is relatively much easier nowadays, enriching or annotating this data poses a whole new set of logistical challenges. As mentioned in our previous there are 3 ways, blog post Built-in image annotators (e.g., Facebook’s image tagging, Google’s reCaptcha) Traditional BPOs (e.g., you know better) Fully Managed Annotation experts (e.g., ) Playment But, how does one go about making a decision? Read on. In the future, I hope there will be a central system which would have seen so much data. It’s just an API call to fetch whatever and how much data we need in real-time. But, such a system does not exist today. And, so there is no escaping the human role in scaling of training data. What to look for when scaling up Training Data! Scalable Trained Workforce + Seamless API integration + Dedicated Project Manager + Stringent QC to ensure Data Accuracy = High Quality Training Data at Scale Here’s a use case from one of our customer, “The new Autonomous driving startup, building AI Brain for self-driving cars, who needs a huge amount of data to be annotated datasets for vehicles and pedestrians.” Breaking this down we got, Traffic signs, pedestrians, different types of vehicles… around . 45 different classes The problem with existing datasets is, the autonomous vehicles have been trained for say., Europe roads but, what if to put the same car on American roads? This would lead to contextually irrelevant data to build for localization, categorization with high precision and recall accuracy. So after freezing the scope of work, our project manager started analyzing the nuances and clarifying queries. With our trained workforce, the only issue was to train our users for new cases, then the qualifiers. How does Playment fit in? Getting hundreds of thousands of annotators are easy. The hard part is . training them for the annotation tools and complex tasks Solving User Training Complexities When we get more no. of classes this becomes really a challenge for our project teams. So if the classes are more, we try grouping them to ease user training. Like mentioned above, The total no. of classes are clubbed into groups. With this, we complete all object localization tasks and again break the groups into individual no. of classes for performing categorization tasks. M N M On an average day, we deliver 3000+ man hours Implemented consensus logic to qualify annotations We do daily in-house QC and if there is any discrepancy we redo it completely. Closing Thoughts We understand how large quantity of datasets is essential for high-performing machine learning algorithms. We do prep all the training data you need so that you can only focus on innovation instead. For more information on data labeling problems, feel free to . get in touch Originally published on Playment blogs Feb 19, 2018.

Facebook

Fetch

How to Scale Training Data for AI Race

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

A Definitive Guide To Build Training Data For Computer Vision

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

A Definitive Guide To Build Training Data For Computer Vision

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps