Authors:
(1) Ning Wang, Huawei Inc.;
(2) Jiangrong Xie, Huawei Inc.;
(3) Hang Luo, Huawei Inc.;
(4) Qinglin Cheng, Huawei Inc.;
(5) Jihao Wu, Huawei Inc.;
(6) Mingbo Jia, Huawei Inc.;
(7) Linlin Li, Huawei Inc.;
Table of Links
3 Methodology and 3.1 Model Architecture
4 Experiments
4.1 Datasets and Metrics and 4.2 Implementation Details
4.4 Inference on the Mobile Device and 4.5 State-of-the-art Comparison
4.3 Ablation Study
Model Pre-training. It has been well recognized that model pre-training on large-scale image-text corpus benefits the image captioning. As shown in Table 1, for the student model with limited capacity, model pre-training significantly improves the performance by 8.0 CIDEr score.
Visual Concept Extractor. The proposed visual concept extractor provides valuable clues for image captioning via an efficient image-text retrieval manner. As shown in Table 1, for the student model, the visual concept extractor improves the captioning performance by 3.4 CIDEr score on the COCO dataset. This mechanism also improves the strong teacher model by 3.7 CIDEr score
Cross-modal Modulator. The cross-modal modulator takes advantage of the retrieved visual concepts to modulate the raw CLIP features. As shown in Table 1, based on the student model with a visual concept extractor, the proposed cross-modal modulator further improves the captioning performance by 1.8 CIDEr score. This tiny block promotes the strong teacher model by 2.1 CIDEr score
Sequential Model Distillation. In Table 2, we ablate the model knowledge distillation (KD) techniques in our approach. First, we investigate KD in the pre-training stage in Table 2 (top). In these experiments, we only adopt the standard cross-entropy optimization without any KD in the finetuning stage. In the pre-training stage, the “attention & representation distillation” improves 0.8 CIDEr score, and the distillation of output token probability improves 2.0 CIDEr score. Considering the characteristic of cross-modal training, we further propose to distill the soft prediction of the anchor words (i.e., visual concepts), which brings an additional 1.2 CIDEr gain. This indicates the concept distillation facilitates the cross-modal alignment.
Next, we investigate KD in the model fine-tuning stage. As shown in Table 2, based on the distilled fusion model from the pre-training stage, in the fine-tuning stage, “attention & representation distillation” and “output token distillation” further improve 1.1 CIDEr and 2.6 CIDEr, respectively. Combining the above KD techniques achieves the best result of 3.3 CIDEr gain. Finally, by virtue of the model distillation in both pre-training and fine-tuning, our lightweight student model achieves a promising captioning performance of 37.1 BLEU@4 and 124.1 CIDEr, and even matches the strong teacher model (i.e., 37.5 BLUE@4 and 126.3 CIDEr in Table 1).
Ensemble Model Distillation. The above experiments are based on the single head setting. Actually, our model adopts the ensemble head for superior performance. To encourage the prediction diversity, we prepare three teachers to individually distill these heads. As shown in Table 2, ensemble head module and ensemble KD improve 1.7 CIDEr.
This paper is available on arxiv under CC BY 4.0 DEED license.