It’s a no-brainer that ‘data’ is the most important aspect of Data Science, and Computer Vision is no exception. You might be familiar with the popular, "garbage in - garbage out".
The quality of the data is extremely important too. Whether you search your dataset online or take pictures by yourself, these tips should help you collect a better dataset.
You should always think about your inference and collect your dataset based on that. If you are going for inference on security cameras and real world objects - you don't need to add cartoons or toy images. You should try to recreate an inference scenario and collect datasets close to real life. For this process, you should remember that the dataset still needs to be diverse, so you should collect data with different lighting, seasons, background, and even different cameras. So you want to make sure that your neural net will be ready for most real-world cases.
It's a good idea to collect the same amount of samples for every class. You should always aim for balance, but some imbalance is not always a bad thing. A balanced dataset will help your model avoid bias for any class. In some cases you might want imbalanced data if in the real world scenario you are going to have that imbalance. But in the worst case, your model might end up predicting just class and if you use accuracy as your metric (which is wrong in a case with imbalanced data), you wouldn't even see that there is a problem. You can't always get perfectly balanced data, but you can use different techniques to deal with that.
You should be consistent in your data labeling. If you have 2 types of damage as your classes, you should be consistent when you classify damage as type 1 or 2. You should label every object of your class in the dataset. It's really important not to leave your target labels unlabeled. You also really want to label with good precision. Your bounding box should be as close to your object's edges as possible.
Of course, more data is better. But we live in the real world, so it's always hard to get more data. The amount of data needed to train a model really depends on the type of your objects, so I can't just give advice on the amount of data you will need. But with pre-trained models, life is a lot easier. If your images are not really specific like MRI or X-rays data, you should be ok with pre-trained models, as they already can extract low-level features. You might need something like 1000 images per class to tune the model for your task. If your data is really specific, and you need to train a model from scratch - you will need a lot more data.
If you need to detect an anomaly or something rare, and you process a lot of frames, you are going to have False Positives. The problem is going to be even worse if your objects are small, and you process frames in real-time 24/7.
There are several techniques to deal with that problem, but you should start with adding background images to your dataset first. Background images should be similar to other images in the dataset, but without any target class in them. You just show the model what is not the target. It's also a good idea to add some objects there which are visually similar to your target classes. That also should help reduce False Positives.
Augmentation helps to make your dataset more diverse and your model to better generalize. You take your image and rotate it a little bit, for example. There are a lot of ways to augment your data, albumentation is a good lib for that purpose.
You can use augmentation in two ways:
1 - to artificially make your dataset bigger, or to balance classes.
2 - to randomly apply augmentations at runtime. The second method is more commonly used and often implemented in SOTA detectors.
It's a good idea to split your dataset into 3 parts: train, validation, and test. You use train data for training, the validation dataset for choosing the best weights, and at the end of your training, you use test data to make sure that you didn't 'overfit' to validation data. You can't overfit on validation data, as you don't train on it, but you choose the best weights based on validation data. Test data is not seen by the model, but validation data is. So you want to ensure that your metrics are similar on validation and test data.
When you have a working model, and you still collect more data to make the model better, you can use your model to label new data. Firstly, you can use your model as a labeling tool. It will help you, though you might need to correct the bounding boxes. It's not that valuable to retrain on images, where your model already could detect the object with a high confidence level, but it is helpful if confidence was not that high.
Secondly, you can run your model on new data and see in which cases it doesn't detect the object at all, so you can make more images in that style to help the model in the future.