3,966 reads

Busting AI Myths: "You Need Tons of Data for Machine Learning"

by Frederik BusslerSeptember 16th, 2019

Too Long; Didn't Read

Leading researchers like Karl Friston describe AI as "active inference" —creating computational statistical models that minimize prediction-error. The human brain operates much the same way, also learning from data. A common argument goes:

Companies Mentioned

featured image - Busting AI Myths: "You Need Tons of Data for Machine Learning"

"AI will never be intelligent because it needs to see something thousands of times to learn it, while a human only needs to see something once."

Others put it simply: "the more, the better." As this Medium article describes, "anyone on a mission to create a first-class AI-powered product needs vast amounts of data to feed to the machines," because multiple parameters and classes need to be learned.

However, we can outline a few ways to circumvent the need for big data in your AI journey:

Transfer learning (including one-shot and zero-shot learning)
Turn-key solutions
(High quality) little data

1. Transfer Learning

“Transfer learning is an up-and-coming technique that allows us to transfer the knowledge learned in one dataset and apply it to another dataset.” - Bradley Arsenault

Transfer learning essentially takes learning from one domain and brings it to another, so you don't have to start from 0. This is especially useful in highly-specific domains with not much data available.

To visualize it:

This super in-depth guide goes into the nuances of transfer learning. Besides not having to start from 0 with transfer learning, there are methods such as one-shot or even zero-shot learning that enable training models with minimal data.

One-shot learning is to infer required output based on just one or few training examples, as discussed in this paper: ‘One Shot Learning of Object Categories’.

Zero-shot learning is a more extreme version of the above, where no labelled examples are used to learn a task.

2. Turn-key solutions

The second method of deploying AI with less data is by using turn-key solutions, that are already pre-trained on massive quantities of data.

Here are just a few examples:

Google Cloud AI

"Google Cloud’s AI Hub provides enterprise-grade sharing capabilities, including end-to-end AI pipelines and out-of-the-box algorithms, that let your organization privately host AI content to foster reuse and collaboration among internal developers and users..."

Microsoft Azure AI Platform

"Only Azure empowers you with the most advanced machine learning capabilities. Quickly and easily build, train, and deploy your machine learning models using Azure Machine Learning, Azure Databricks and ONNX..."

Amazon Machine Learning

"AWS pre-trained AI Services provide ready-made intelligence for your applications and workflows. AI Services easily integrate with your applications to address common use cases such as personalized recommendations, modernizing your contact center, improving safety and security, and increasing customer engagement..."

3. Little data

More data is not always better, especially if that data is not labelled, not indicative of the problem at hand, or dirty. You might have millions of rows, but if it's messy data that's hardly relevant to the problem and only usable with unsupervised learning, then a smaller, highly-targeted, and clean data-set would be much better to have.

This article lists a few questions to decide between using little data and big data:

Do you already have the data you need, and is it labeled?
What’s your use case and what is the minimum data needed to address it?
How advanced is your organization (really) when it comes to AI/ML?

While big data is all the rage, it's not the only way to fuel ML models. Some would go even so far as to say that small data is the future of AI.

A great Harvard Business Review article discusses a few specific cases of using little data in the real-world.

For example, "researchers at Vicarious have developed a model that can break through CAPTCHAs at a far higher rate than deep neural networks and with 300-fold more data efficiency." Their model needed only five training examples per character.

In conclusion, more data can be better, and if you have it available, then great! However, if you don't, there are still ways to deploy AI in your organization.