This article presents one of the approaches to building a computer vision-driven car damage detection solution. We will make an introduction to the popular instance and semantic segmentation architectures Mask R-CNN and U-Net and provide insights deriving from our attempt to train and test ML models for car damage detection using these architectures. Then, we will share datasets used for training and testing. After that, we will explain in depth how our computer vision solution operates and demonstrate how a car damage model works using a demo. We also provide parts of the code of the models used to build a resulting solution.
We considered it reasonable to try both approaches to image segmentation that are suitable for the purposes of our project. This way, we get not only desirable outcomes but also receive meaningful insights into how instance and semantic segmentation architectures work on the particular car damage detection task.
Here’s a brief into the technologies utilized:
The complete procedure for training, testing, and comparing the models built with these algorithms is provided in the below sections.
There is a strong correlation between the volume and quality of data and the outcomes of training AI models. After all, a widely known concept which is garbage in, garbage out (GIGO), clearly indicates that nonsense input data produces nonsense output.
In ideal circumstances, we would prepare extensive datasets of damaged vehicles with varying deteriorations, from diverse angles, with different lighting, etc., and conduct data annotation ourselves. But, for this trial project, we considered it reasonable to utilize precleaned and prepacked data from these publicly available datasets:
We recommend using the mentioned datasets. Yet, it may already be possible to find newer, higher-quality collections. For business-level projects, such as implementing computer vision technology in an insurance claim processing workflow, we highly recommend the following approach to data preparation:
Discover additional information on
Let’s thoroughly get through our process of choosing, training, and testing semantic and instance segmentation models. It contains insights and explanations of the unique solutions we came up with. These can come of great usefulness to anyone willing to develop their ML computer vision model for car damage detection.
Semantic segmentation architecture — U-Net. We picked a PyTorch framework, and in particular, from a segmentation_models_pytorch library, we chose a U-Net model. We tested different backbones (pre-trained models that are used for feature extraction). Still, the best tradeoff between performance and accuracy we achieved was with an efficient net-b1 (Efficient net architecture) backbone that was pre-trained on the ImageNet dataset.
import segmentation_models_pytorch as smp
CLASSES = ['damage']
model_params = {"encoder_name": 'efficientnet-b1',
"encoder_weights": 'imagenet',
"activation": None,
"encoder_depth": 5,
"decoder_channels": [256, 128, 64, 32, 16]
}
model = smp.Unet(classes=len(CLASSES), **model_params)
For the loss function, we utilized different functions such as BCEWithLogitsLoss, DiceLoss, and LovaszHingeLoss. As a result, the best scores on the test dataset were achieved with the linear combination of two losses which were DiceLoss and BCEWithLogitsLoss. Combining two loss functions is a rather unconventional approach, which, nevertheless, provided the best outcomes in the case of car damage detection task.
import torch.nn as nn
import torch.nn.functional as F
class DiceLoss(nn.Module):
def __init__(self, weight=None, size_average=True):
super(DiceLoss, self).__init__()
def forward(self, inputs, targets, smooth=1):
#comment out if your model contains a sigmoid or equivalent activation layer
inputs = F.sigmoid(inputs)
#flatten label and prediction tensors
inputs = inputs.view(-1)
targets = targets.view(-1)
intersection = (inputs * targets).sum()
dice = (2.*intersection + smooth)/(inputs.sum() + targets.sum() + smooth)
return 1 - dice
criterion = DiceLoss()
criterion_2 = nn.BCEWithLogitsLoss()
# …
loss = criterion(output.squeeze(1), mask) + criterion_2(output.squeeze(1), mask)
We trained this model using the AdamW optimizer and scheduling a learning rate via the OneCycleLR scheduler. Additionally, we adopted an early stopping algorithm to track a loss on an evaluation dataset.
optimizer = torch.optim.AdamW(model.parameters(),
lr=params['max_lr'],
weight_decay=params['weight_decay'])
sched = torch.optim.lr_scheduler.OneCycleLR(optimizer,
params['max_lr'],
epochs=params['epoch'],
steps_per_epoch=len(train_dataloader))
Instance segmentation architecture — Mask R-CNN. Again, here we fine-tuned the model on a pretty small dataset consisting of 59 train images and 11 images for the eval and final tests, respectively. For the detectron2, we used the built-in default loss with default optimizer.
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("car_damage_dataset_train")
cfg.DATASETS.TEST = ("car_damage_dataset_val",)
cfg.DATALOADER.NUM_WORKERS = 4
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.SOLVER.IMS_PER_BATCH = 4
cfg.SOLVER.BASE_LR = 0.001
cfg.SOLVER.WARMUP_ITERS = 700
cfg.SOLVER.MAX_ITER = 700
cfg.SOLVER.STEPS = (600, 800)
cfg.SOLVER.GAMMA = 0.05
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1
cfg.MODEL.RETINANET.NUM_CLASSES = 1
cfg.TEST.EVAL_PERIOD = 600
cfg.MODEL.DEVICE = device
trainer = CocoTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
The neat part of how detectron2 makes a prediction is that it predicts multiple different instances for one damaged part. It can be seen in the below example.
We tried to improve the outcomes by implementing post-processing, where we merge together all provided instances that overlap more than a certain threshold. The results of our attempt can be seen in the below example.
ML engineers from Intelliarts AI considered the Mean Intersection-Over-Union (mIoU) metric as our primary choice for evaluating the outcomes of the U-net and Mask R-CNN model comparison. Jumping ahead, we also utilized the dice coefficiency metric, but the results were less representative compared to the ones with mIoU.
So, with the U-net model, we achieved the mIoU equal to approximately 0.398. With detectron2, the mIoU values ranged from approximately 0.249 to 0.3654 depending on the threshold, which is significantly worse than the results with the U-net model.
We assumed that this is because U-net considers the image as one instance while semantic segmentation tries to predict different parts of the same damage as different damages, which by the way, we tried to solve by implementing post-processing.
As for the visual representation of the results, we may also compare the predictions of both models with the ground truth masks, i.e., what actually mask on the test dataset looks like, provided in the image below.
We can observe that U-net tends to underestimate damages, and the detectron2, on the contractor, is more inclined to overestimate damages, masking larger segments and in a higher quantity.
As stated, we decided to move on with the U-net model. Not only was it justified by the better results than the U-net model showed, but also by the fact that underestimating damage is better than overestimating in our case. The reason is that we try to automate insurance claim processing by helping inspectors make assessments faster. However, we also are to minimize business risks of insurance agents that include overestimating damage and, respectively, overpaying the car owners.
In the end, we built and trained two separate U-net-based models, one of which is intended for car damage detection and the other one for car part detection only.
When a user enters an image of a damaged car, the resulting solution indicates the damage and identifies the affected car part separately. If they intersect, the solution outputs something like “left door — dent.” The results are then compared to similar cases in a prepared image database with repair cost estimates.
You can observe the working principles of both models in the image below.
We prepared a demo that shows the mechanism of how the detailed ML models work. The example is not as high-performing as the real-world model is, but it provides an appropriate insight into the implementation of the technology. You can check the below recording of the demo:
Link
Or you can give a car damage detection demo a try yourself here: https://huggingface.co/spaces/intelliarts/Car_parts_damage_detection
Repo link https://gitlab.com/r-d-machine-learning/researches/Car-damage-detection.git
This ML solution is a non-trivial combination of distinct models, one for detecting car damage and another for identifying car parts. Combined together, they allow for obtaining detailed information for car inspection. From our research, it’s evident that semantic segmentation architecture, particularly U-net, is probably the optimal algorithm for solving image processing tasks that are related to car damage detection.
However, it’s crucial to understand that the technology is intended for dealing with large batches of minor vehicle assessment cases. We position it as a useful instrument that human inspectors should master. Therefore, we believe that insurance firms will find our solution useful for claims processing automation. Finally, our engineering team regards integration of the described technology in business processes as another step towards digital transformation.