1,418 reads

How Big Should A Dataset Be For An AI Project

by Oleg KokorinSeptember 12th, 2023

Too Long; Didn't Read

Each AI project has a unique set of requirements for the size of the dataset. The complexity of the model, accebtable level or accuracy and input diversity are the main factors that determine the dataset size. Increasing the dataset can be achieved by using data augmentation techniques and synthetic data generation.

featured image - How Big Should A Dataset Be For An AI Project

Any data scientist will likely respond, "It depends" or "The more, the better" if you ask them how much data is required for machine learning.

The thing is, both responses are accurate.

To get the best outcome, it is always a good idea to increase the size of the dataset as much as you can. More often than not, however, datasets are not large enough, and people struggle to collect much needed images or videos for neural network training.

So, here’s the question: how big should the dataset be and how do you deal with data shortage?

What Affects The Size Of The Dataset For An ML Project

Every machine learning (ML) project has a unique set of parameters that affect how large the AI training data sets must be for successful modeling. Below are the ones that are most important.

ML Model Complexity

The complexity of a machine learning model, in short, is how many parameters it can learn. The more different features of an object the model needs to recognise, the more parameters you need to enter.

Learning Algorithm Complexity

Complex algorithms require more data, that’s a given. Standard ML algorithms that use structured learning require less data to train. In this case, increasing the size of a dataset won’t improve the recognition quality.

Deep learning algorithms, on the other hand, require significantly more data. These algorithms work without a predefined structure, and figure out the parameters themselves when processing data.

In this case, the dataset needs to be much larger to accommodate algorithm-generated categories.

Acceptable Level Of Accuracy

Despite all AI projects claiming to have ‘high accuracy’, the accuracy levels actually can vary significantly. Certain machine learning projects can tolerate a lower degree of accuracy.

For example, weather forecasting algorithms can be off by 10-20% without significantly impacting the functionality of a product.

On the other hand, a low level of accuracy of an AI medical application can result in poor patient outcomes, making applications in this field less tolerant to mistakes.

Obtaining high accuracy is, in large part, done by increasing the dataset size.

Diversity Of Input

When input is highly variable, the dataset needs to reflect as much variety as possible.

For example, detecting animals in the wild can come with a lot of data variability. Depending on the weather and lighting conditions, time of day, and animal age and sex, the animal can look very different.

It’s important to include as much of this variability into a dataset, including blurred, underexposed, and otherwise ‘warped’ images in the dataset.

The more variety there is in the environment, the more data is going to be required.

What Dataset Size Is Optimal For Neural Network Training?

Many people worry that their ML projects won't be as reliable as they could be because they don't have enough data. But very few people genuinely understand how much data is "enough," "too much," or "too little."

Using the 10 times rule is the most typical technique to determine whether a data set is sufficient:

The number of input data should be ten times greater than the number of degrees of freedom in a model.

Degrees of freedom typically refer to parameters in your data set.

So, 10K images are required to train the model if, for instance, your algorithm can distinguish between images of cats and images of dogs based on 1000 parameters.

Although the "10 times rule" is a well-known concept in machine learning, it can only be applied to small models.

Bigger models don’t follow this rule since the quantity of examples collected is not always indicative of the quantity of training data.

The right approach would be to multiply the number of images by the size of each image by the number of color channels.

This rule would be a good enough estimation to get a project up and going, however the only reliable way to determine the dataset size is to consult with a machine learning development partner.

Strategies To Increase The Size Of A Dataset

There are several things that often go wrong with a dataset when it comes to an AI project, one of them being a low volume of data. Small datasets are detrimental to the final product as it’s a foundation for all of the subsequent development.

Here’s a list of strategies you can implement to increase the amount of data in a dataset.

Data Augmentation

Data augmentation is the process of extending an input dataset by slightly modifying the original images.

It’s mostly used for augmenting image datasets. Cropping, rotating, zooming, flipping, and color adjustments are common image editing methods.

Data augmentation increases generalization capability, improves class imbalance difficulties, and adds more adaptable data to the models. Yet, the supplemented data will also be skewed if the original dataset is.

Data augmentation helps with increasing dataset volume, balancing dataset classes, and increases neural network generalization abilities.

Synthetic Data Generation

Some consider data generation a type of data augmentation, the results are quite different. During data augmentation, the original data is changed, while during data generation, completely new data is created.

Synthetic data has several important advantages over ‘regular’ data:

Synthetic data can be labeled before it’s even generated, while organic data needs to be labeled one picture at a time
Synthetic data can help go around data privacy regulations, e.g. medical or financial data, in cases where getting organic data is difficult

As all good things, synthetic data has drawbacks that are important to keep in mind.

The Balance Of Real And Synthetic Data

Using predominantly synthetic data can introduce bias into your AI project. The bias may be inherited from the original dataset. This bias can unbalance the classes in your dataset, lowering the recognition quality dramatically.

Synthetic datasets don’t always capture the complexity of real-world datasets: they often omit important details needed for training a neural network. This is especially important in fields where mistakes are not an option, e.g. the medical field.

Synthetic data is also difficult to validate. It may look realistic and close-to-life, but it’s difficult to know for sure if it captures the underlying trends of authentic data.

Summing Up

Machine learning initiatives must carefully consider the size of AI training data sets. You must take into account a number of variables, such as the project type, algorithm and model complexity, error margin, and input diversity, in order to determine the ideal amount of data you require.

The 10 times rule is another option, however it's not always accurate when dealing with difficult tasks.

If you come to the conclusion that the data currently available is insufficient and that obtaining the necessary real-world data is impractical or prohibitively expensive, try using one of the scaling strategies.

Depending on the needs and financial constraints of your project, it can involve data augmentation, the creation of synthetic data, or transfer learning.