Table of Links
VI. Conclusions and References
IV. EXPERIMENTS
In this section, we evaluate one-stage Yolo v5 [15] and twostage Faster R-CNN [4] network with varied pre-processing, backbone network, hyper-parameter tuning and training strategy to achieve better Avg F1 score. We do not use ensemble approach considering that is great for competitions but rarely works well when deployed. In our backbone and methods, we have used Resnet 50, Resnet 101, ResneXt 101 [9, 20] and CSPNet [16] for evaluations considering that when trained these weights can be pruned and compressed to work on smaller devices with minor degradation in accuracy.
In our experiment we start with data pre-processing where we used image augmentation like resize, orientation, and DeepLab V3+ [18] based segmentation to isolate the road surface for downstream assessment.
Next, we look at training a detection model for each country and look at their performance on the submitted F1 score. We also train a single model with the data for all the three countries as a generalized approach. We focus more on the generalized approach considering the theme of this work and the challenge [23] is to obtain a model that can be transferred to other countries.
Finally, we look at thresholding and proposal ranking method applied to the detection results over test datasets. This is important as the output that we submit to the challenge should be the top proposals.
A PyTorch and Detectron2 [14] based framework from Facebook AI Research (FAIR) was used to train and evaluate the Faster R-CNN [4] models while a PyTorch based Yolov5 [15] implementation was used from Ultralytics for comparison purposes. All these implementations are available in opensource Github repository for the community. We were able to customize the data loader and mapper objects to setup the codebase for experimentation. Both these codebases support Tensorboard project for tracking the training accuracy and optimization loss throughout the training process.
The experiments reported in the various tables next, have the model description with epoch runs and chosen backbone network in the first column. Hyper-parameters are described in the second column. Average F1 score is reported for Test 1 and Test 2 dataset based on the kind of experiment we ran.
A. Pre-processing Images
We looked at segmentation as a way to eliminate background and noise from the image so that we can analyze features only on the road. A PyTorch and Detectron2 [14] based DeepLab V3+ [18] implementation is used for segmentation contours and image cropping.
We used standard DeepLab V3+ [18] model trained on Cityscape semantic segmentation dataset. The model was able to achieve fair segmentation on most roads in Japan and Czech, while roads in India which had gravel and mud like surface, it did not do a good job of separating the road from surrounding surfaces. We did a basic analysis in Table II to verify, whether segmentation offered an improvement. The dataset used all countries annotation to train a single Faster R-CNN [4] model.
In our experiments, we did not observe any benefit based on our segmentation approach. It appears that the model performance deteriorates and that could stem from segmentation in the India dataset. We proceed without segmentation for the rest of the dataset pre-processing.
B. Model per country
We trained Faster R-CNN [4] models to fit the data of each country in order to achieve the baseline. The expectation was that the model will achieve better accuracy with three different models dedicated to Czech, Japan and India. We look at a comparison of this approach in Table III.
We get a 1.5% benefit in Average F1 score metric when we train with the baseline Train/Val (T) dataset. However, we take the approach of training a single model across the country’s dataset considering the benefits of deployment and model management.
C. Generalized Model
We attempt to generalize the model by training it on the data from all the countries in the dataset. Here we attempt to compare the two-stage Faster R-CNN [4] and one-stage YoloV5 [15] detection models. We clearly observe in Table IV that two-stage detector out-performs the one-stage detector.
The data used in training these models consists of Train/Val (T) baseline split that is described in the dataset section. We combine Train and Test (T+T) data for training for the second set in the table. Thereafter we improve upon this by composing Train and Val (T+V) data for training the remaining Faster RCNN model runs. We do gain the expected benefit with this data composition.
The Model description in Table IV consists of the model name, epoch runs and backbone network. We observe that Faster R-CNN model performs better than YoloV5. The Hyperparameters includes Batch size, Learning Rate (LR) and LR Step Scheduler. A scheduler decreases the LR by a gamma factor of 0.05 over the steps of mentioned epoch values. LR of 0.01 and 0.015 have performed well with a step schedule of (23k, 25k, 26k) and (25k, 28k) epochs respectively.
We show the best F1 score in Table IV for Faster R-CNN [4] based on a batch size 640 and Resnet 50 [9] in Test 2 evaluation while for Test 1 evaluation score web observe Batch size 4096 and Resnet 101 [9] seems to work well.
We look at the mean accuracy (IoU=.50:.05:.95) on the 5% split Test (T) dataset to monitor and track the progress of the models training. We see in Fig. 8, that this dataset shows high bounding box accuracy on D20 damage type in both the models. However, Resnet 50 with Batch 640 trained model seems to perform well on D10 and D40 damage types, considering both of those classes have relatively low annotations.
When we look at damage classification accuracy in Fig. 9, the Resnet 101 [9] backbone with high batch size demonstrates high accuracies. We also see that the LR step scheduler has a significant impact on the accuracy around 23k for the smaller network and around 25k for the larger network. We also see that the model stops learning around 30k epoch and an early stopping method is used to end the training process. This stops the model from overfitting the training data.
A generalized approach with low network size may allow the model to transfer across countries and reduce the deployment overhead based on the target conditions. However, a bigger network has higher classification accuracy.
D. Post-processing
In this step we look at operations after detection. The resulting bounding boxes are filtered at 0.7 confidence threshold. Additionally, the detections are sorted by confidence and only the top 5 bounding boxes are sampled for best submission.
Authors:
(1) Rahul Vishwakarma, Big Data Analytics & Solutions Lab, Hitachi America Ltd. Research & Development, Santa Clara, CA, USA ([email protected]);
(2) Ravigopal Vennelakanti, Big Data Analytics & Solutions Lab, Hitachi America Ltd. Research & Development, Santa Clara, CA, USA ([email protected]).
This paper is