Solving Car Damage Detection Task By Using a Two-Model Computer Vision Solution

This article presents one of the approaches to building a computer vision-driven car damage detection solution. We will make an introduction to the popular instance and semantic segmentation architectures Mask R-CNN and U-Net and provide insights deriving from our attempt to train and test ML models for car damage detection using these architectures. Then, we will share datasets used for training and testing. After that, we will explain in depth how our computer vision solution operates and demonstrate how a car damage model works using a demo. We also provide parts of the code of the models used to build a resulting solution.

Overview of Mask R-CNN and U-Net Architectures and Performance Comparison Based on Observation

We considered it reasonable to try both approaches to image segmentation that are suitable for the purposes of our project. This way, we get not only desirable outcomes but also receive meaningful insights into how instance and semantic segmentation architectures work on the particular car damage detection task.

Here’s a brief into the technologies utilized:

Instance segmentation architecture — Mask R-CNN: Constructed on top of the Faster R-CNN object detection model, this deep learning architecture incorporates a segmentation component. It generates region proposals using a Region Proposal Network (RPN), which suggests regions of the image that are likely to contain objects. Subsequently, it performs object detection and segmentation by predicting class labels and providing bounding boxes and masks for each proposal.
Semantic segmentation architecture — U-Net: U-Net is a convolutional neural network architecture designed for image segmentation tasks. It comprises two pathways: the contracting path, which extracts context and decreases the input image size, and the expansive path, which facilitates precise localization and increases the feature maps’ size. This CNN can retrieve in-depth information about segmented objects while capturing the image’s context and overall structure.

The complete procedure for training, testing, and comparing the models built with these algorithms is provided in the below sections.

Datasets Used in Training and Testing

There is a strong correlation between the volume and quality of data and the outcomes of training AI models. After all, a widely known concept which is garbage in, garbage out (GIGO), clearly indicates that nonsense input data produces nonsense output.

In ideal circumstances, we would prepare extensive datasets of damaged vehicles with varying deteriorations, from diverse angles, with different lighting, etc., and conduct data annotation ourselves. But, for this trial project, we considered it reasonable to utilize precleaned and prepacked data from these publicly available datasets:

We recommend using the mentioned datasets. Yet, it may already be possible to find newer, higher-quality collections. For business-level projects, such as implementing computer vision technology in an insurance claim processing workflow, we highly recommend the following approach to data preparation:

Decide on the types of damages the ML model is intended to assist with the identification of which.
Choose high-quality images from the database of past car inspection cases and build an extensive collection.
Perform manual image labeling assisted with the usage of ML algorithms.

Discover additional information on how to get better datasets for your computer vision tasks.

The Essence of our Tried and Tested Computer Vision Solution for Car Damage Detection

Let’s thoroughly get through our process of choosing, training, and testing semantic and instance segmentation models. It contains insights and explanations of the unique solutions we came up with. These can come of great usefulness to anyone willing to develop their ML computer vision model for car damage detection.

Choosing Models

Semantic segmentation architecture — U-Net. We picked a PyTorch framework, and in particular, from a segmentation_models_pytorch library, we chose a U-Net model. We tested different backbones (pre-trained models that are used for feature extraction). Still, the best tradeoff between performance and accuracy we achieved was with an efficient net-b1 (Efficient net architecture) backbone that was pre-trained on the ImageNet dataset.

import segmentation_models_pytorch as smp


CLASSES = ['damage']


model_params = {"encoder_name": 'efficientnet-b1',
               "encoder_weights": 'imagenet',
               "activation": None,
               "encoder_depth": 5,
               "decoder_channels": [256, 128, 64, 32, 16]
               }
model = smp.Unet(classes=len(CLASSES), **model_params)

Instance segmentation architecture — Mask R-CNN. We used a detectron2 library and tried different models from it. Mask-RCN_R_50_FPN_3x proved itself to be the best performing on car damage detection tasks.

Training Models

Semantic segmentation architecture — U-Net. We fine-tuned the model on a pretty small dataset consisting of 59 train images and 11 images for the eval and final tests, respectively. We increased the dataset size by using different augmentation techniques, including random flipping, scaling, rotations, adding Gaussian noise, changing brightness blurring, etc.

For the loss function, we utilized different functions such as BCEWithLogitsLoss, DiceLoss, and LovaszHingeLoss. As a result, the best scores on the test dataset were achieved with the linear combination of two losses which were DiceLoss and BCEWithLogitsLoss. Combining two loss functions is a rather unconventional approach, which, nevertheless, provided the best outcomes in the case of car damage detection task.

import torch.nn as nn
import torch.nn.functional as F


class DiceLoss(nn.Module):
   def __init__(self, weight=None, size_average=True):
       super(DiceLoss, self).__init__()


   def forward(self, inputs, targets, smooth=1):
      
       #comment out if your model contains a sigmoid or equivalent activation layer
       inputs = F.sigmoid(inputs)      
       #flatten label and prediction tensors
       inputs = inputs.view(-1)
       targets = targets.view(-1)
      
       intersection = (inputs * targets).sum()                           
       dice = (2.*intersection + smooth)/(inputs.sum() + targets.sum() + smooth) 
      
       return 1 - dice

criterion = DiceLoss()
criterion_2 = nn.BCEWithLogitsLoss()

# …
 
loss = criterion(output.squeeze(1), mask) + criterion_2(output.squeeze(1), mask)

We trained this model using the AdamW optimizer and scheduling a learning rate via the OneCycleLR scheduler. Additionally, we adopted an early stopping algorithm to track a loss on an evaluation dataset.

optimizer = torch.optim.AdamW(model.parameters(),
                              lr=params['max_lr'],
                              weight_decay=params['weight_decay'])

sched = torch.optim.lr_scheduler.OneCycleLR(optimizer,
                                            params['max_lr'],
                                            epochs=params['epoch'],
                                            steps_per_epoch=len(train_dataloader))

Instance segmentation architecture — Mask R-CNN. Again, here we fine-tuned the model on a pretty small dataset consisting of 59 train images and 11 images for the eval and final tests, respectively. For the detectron2, we used the built-in default loss with default optimizer.

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")) 
cfg.DATASETS.TRAIN = ("car_damage_dataset_train")
cfg.DATASETS.TEST = ("car_damage_dataset_val",) 
cfg.DATALOADER.NUM_WORKERS = 4 
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml") 
cfg.SOLVER.IMS_PER_BATCH = 4 
cfg.SOLVER.BASE_LR = 0.001 
cfg.SOLVER.WARMUP_ITERS = 700 
cfg.SOLVER.MAX_ITER = 700 
cfg.SOLVER.STEPS = (600, 800) 
cfg.SOLVER.GAMMA = 0.05 
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128 
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 
cfg.MODEL.RETINANET.NUM_CLASSES = 1 
cfg.TEST.EVAL_PERIOD = 600 
cfg.MODEL.DEVICE = device 
trainer = CocoTrainer(cfg)
trainer.resume_or_load(resume=False) 
trainer.train()

The neat part of how detectron2 makes a prediction is that it predicts multiple different instances for one damaged part. It can be seen in the below example.

We tried to improve the outcomes by implementing post-processing, where we merge together all provided instances that overlap more than a certain threshold. The results of our attempt can be seen in the below example.

Testing and Comparing Models

ML engineers from Intelliarts AI considered the Mean Intersection-Over-Union (mIoU) metric as our primary choice for evaluating the outcomes of the U-net and Mask R-CNN model comparison. Jumping ahead, we also utilized the dice coefficiency metric, but the results were less representative compared to the ones with mIoU.

So, with the U-net model, we achieved the mIoU equal to approximately 0.398. With detectron2, the mIoU values ranged from approximately 0.249 to 0.3654 depending on the threshold, which is significantly worse than the results with the U-net model.

We assumed that this is because U-net considers the image as one instance while semantic segmentation tries to predict different parts of the same damage as different damages, which by the way, we tried to solve by implementing post-processing.

As for the visual representation of the results, we may also compare the predictions of both models with the ground truth masks, i.e., what actually mask on the test dataset looks like, provided in the image below.

We can observe that U-net tends to underestimate damages, and the detectron2, on the contractor, is more inclined to overestimate damages, masking larger segments and in a higher quantity.

The End Result: Two-Model Computer Vision Solution

As stated, we decided to move on with the U-net model. Not only was it justified by the better results than the U-net model showed, but also by the fact that underestimating damage is better than overestimating in our case. The reason is that we try to automate insurance claim processing by helping inspectors make assessments faster. However, we also are to minimize business risks of insurance agents that include overestimating damage and, respectively, overpaying the car owners.

In the end, we built and trained two separate U-net-based models, one of which is intended for car damage detection and the other one for car part detection only.

Model 1 for car damage detection. Allows for identifying the damaged area. It also provides a rough estimation of the magnitude of damage.
Model 2 for car part detection. Allows for identifying distinct car parts. It also provides the name of an affected part.

When a user enters an image of a damaged car, the resulting solution indicates the damage and identifies the affected car part separately. If they intersect, the solution outputs something like “left door — dent.” The results are then compared to similar cases in a prepared image database with repair cost estimates.

You can observe the working principles of both models in the image below.

Demo Overview

We prepared a demo that shows the mechanism of how the detailed ML models work. The example is not as high-performing as the real-world model is, but it provides an appropriate insight into the implementation of the technology. You can check the below recording of the demo:

Link

Or you can give a car damage detection demo a try yourself here: https://huggingface.co/spaces/intelliarts/Car_parts_damage_detection

Repo link https://gitlab.com/r-d-machine-learning/researches/Car-damage-detection.git

Final Take

This ML solution is a non-trivial combination of distinct models, one for detecting car damage and another for identifying car parts. Combined together, they allow for obtaining detailed information for car inspection. From our research, it’s evident that semantic segmentation architecture, particularly U-net, is probably the optimal algorithm for solving image processing tasks that are related to car damage detection.

However, it’s crucial to understand that the technology is intended for dealing with large batches of minor vehicle assessment cases. We position it as a useful instrument that human inspectors should master. Therefore, we believe that insurance firms will find our solution useful for claims processing automation. Finally, our engineering team regards integration of the described technology in business processes as another step towards digital transformation.