Any data scientist will likely respond, "It depends" or "The more, the better" if you ask them how much data is required for machine learning.
The thing is, both responses are accurate.
To get the best outcome, it is always a good idea to increase the size of the dataset as much as you can. More often than not, however, datasets are not large enough, and people struggle to collect much needed images or videos for neural network training.
So, here’s the question: how big should the dataset be and how do you deal with data shortage?
Every machine learning (ML) project has a unique set of parameters that affect how large the AI training data sets must be for successful modeling. Below are the ones that are most important.
The complexity of a machine learning model, in short, is how many parameters it can learn. The more different features of an object the model needs to recognise, the more parameters you need to enter.
Complex algorithms require more data, that’s a given. Standard ML algorithms that use structured learning require less data to train. In this case, increasing the size of a dataset won’t improve the recognition quality.
Deep learning algorithms, on the other hand, require significantly more data. These algorithms work without a predefined structure, and figure out the parameters themselves when processing data.
In this case, the dataset needs to be much larger to accommodate algorithm-generated categories.
Despite all AI projects claiming to have ‘high accuracy’, the accuracy levels actually can vary significantly. Certain machine learning projects can tolerate a lower degree of accuracy.
For example, weather forecasting algorithms can be off by 10-20% without significantly impacting the functionality of a product.
On the other hand, a low level of accuracy of an AI medical application can result in poor patient outcomes, making applications in this field less tolerant to mistakes.
Obtaining high accuracy is, in large part, done by increasing the dataset size.
When input is highly variable, the dataset needs to reflect as much variety as possible.
For example, detecting animals in the wild can come with a lot of data variability. Depending on the weather and lighting conditions, time of day, and animal age and sex, the animal can look very different.
It’s important to include as much of this variability into a dataset, including blurred, underexposed, and otherwise ‘warped’ images in the dataset.
The more variety there is in the environment, the more data is going to be required.
Many people worry that their ML projects won't be as reliable as they could be because they don't have enough data. But very few people genuinely understand how much data is "enough," "too much," or "too little."
Using the 10 times rule is the most typical technique to determine whether a data set is sufficient:
The number of input data should be ten times greater than the number of degrees of freedom in a model.
Degrees of freedom typically refer to parameters in your data set.
So, 10K images are required to train the model if, for instance, your algorithm can distinguish between images of cats and images of dogs based on 1000 parameters.
Although the "10 times rule" is a well-known concept in machine learning, it can only be applied to small models.
Bigger models don’t follow this rule since the quantity of examples collected is not always indicative of the quantity of training data.
The right approach would be to multiply the number of images by the size of each image by the number of color channels.
This rule would be a good enough estimation to get a project up and going, however the only reliable way to determine the dataset size is to consult with a machine learning development partner.
There are several things that often go wrong with a dataset when it comes to an AI project, one of them being a low volume of data. Small datasets are detrimental to the final product as it’s a foundation for all of the subsequent development.
Here’s a list of strategies you can implement to increase the amount of data in a dataset.
Data augmentation is the process of extending an input dataset by slightly modifying the original images.
It’s mostly used for augmenting image datasets. Cropping, rotating, zooming, flipping, and color adjustments are common image editing methods.
Data augmentation increases generalization capability, improves class imbalance difficulties, and adds more adaptable data to the models. Yet, the supplemented data will also be skewed if the original dataset is.
Data augmentation helps with increasing dataset volume, balancing dataset classes, and increases neural network generalization abilities.
Some consider data generation a type of data augmentation, the results are quite different. During data augmentation, the original data is changed, while during data generation, completely new data is created.
Synthetic data has several important advantages over ‘regular’ data:
As all good things, synthetic data has drawbacks that are important to keep in mind.
Using predominantly synthetic data can introduce bias into your AI project. The bias may be inherited from the original dataset. This bias can unbalance the classes in your dataset, lowering the recognition quality dramatically.
Synthetic datasets don’t always capture the complexity of real-world datasets: they often omit important details needed for training a neural network. This is especially important in fields where mistakes are not an option, e.g. the medical field.
Synthetic data is also difficult to validate. It may look realistic and close-to-life, but it’s difficult to know for sure if it captures the underlying trends of authentic data.
Machine learning initiatives must carefully consider the size of AI training data sets. You must take into account a number of variables, such as the project type, algorithm and model complexity, error margin, and input diversity, in order to determine the ideal amount of data you require.
The 10 times rule is another option, however it's not always accurate when dealing with difficult tasks.
If you come to the conclusion that the data currently available is insufficient and that obtaining the necessary real-world data is impractical or prohibitively expensive, try using one of the scaling strategies.
Depending on the needs and financial constraints of your project, it can involve data augmentation, the creation of synthetic data, or transfer learning.