This story draft by @escholar has not been reviewed by an editor, YET.

Efficient Image Captioning for Edge Devices: Ablation Study

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Ning Wang, Huawei Inc.;

(2) Jiangrong Xie, Huawei Inc.;

(3) Hang Luo, Huawei Inc.;

(4) Qinglin Cheng, Huawei Inc.;

(5) Jihao Wu, Huawei Inc.;

(6) Mingbo Jia, Huawei Inc.;

(7) Linlin Li, Huawei Inc.;

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Methodology and 3.1 Model Architecture

3.2 Model Training

3.3 Knowledge Distillation

4 Experiments

4.1 Datasets and Metrics and 4.2 Implementation Details

4.3 Ablation Study

4.4 Inference on the Mobile Device and 4.5 State-of-the-art Comparison

5 Conclusion and References

A Implementation Details

B Visualization Results

C Results on Nocaps

D Limitations and Future Work

4.3 Ablation Study

Model Pre-training. It has been well recognized that model pre-training on large-scale image-text corpus benefits the image captioning. As shown in Table 1, for the student model with limited capacity, model pre-training significantly improves the performance by 8.0 CIDEr score.


Visual Concept Extractor. The proposed visual concept extractor provides valuable clues for image captioning via an efficient image-text retrieval manner. As shown in Table 1, for the student model, the visual concept extractor improves the captioning performance by 3.4 CIDEr score on the COCO dataset. This mechanism also improves the strong teacher model by 3.7 CIDEr score


Cross-modal Modulator. The cross-modal modulator takes advantage of the retrieved visual concepts to modulate the raw CLIP features. As shown in Table 1, based on the student model with a visual concept extractor, the proposed cross-modal modulator further improves the captioning performance by 1.8 CIDEr score. This tiny block promotes the strong teacher model by 2.1 CIDEr score


Table 1: Ablative study of the proposed LightCap. To better investigate the performance of each component, the student model does not employ any knowledge distillation and uses a single head model. The evaluation metrics are BLEU@4 (B@4), METEOR (M), CIDEr (C), and SPICE (S) scores on the COCO-caption Karpathy test split (Lin et al. 2014).


Table 2: Ablative study of the proposed LightCap method using different distillation techniques. “Atten&Rep”, “Caption”, and “Concept” denote the knowledge distillations on attention weight and hidden representation, token probability, and concept probability, respectively. Finally, we adopt the ensemble head block and leverage the ensemble distillation to optimize the overall model.


Sequential Model Distillation. In Table 2, we ablate the model knowledge distillation (KD) techniques in our approach. First, we investigate KD in the pre-training stage in Table 2 (top). In these experiments, we only adopt the standard cross-entropy optimization without any KD in the finetuning stage. In the pre-training stage, the “attention & representation distillation” improves 0.8 CIDEr score, and the distillation of output token probability improves 2.0 CIDEr score. Considering the characteristic of cross-modal training, we further propose to distill the soft prediction of the anchor words (i.e., visual concepts), which brings an additional 1.2 CIDEr gain. This indicates the concept distillation facilitates the cross-modal alignment.


Next, we investigate KD in the model fine-tuning stage. As shown in Table 2, based on the distilled fusion model from the pre-training stage, in the fine-tuning stage, “attention & representation distillation” and “output token distillation” further improve 1.1 CIDEr and 2.6 CIDEr, respectively. Combining the above KD techniques achieves the best result of 3.3 CIDEr gain. Finally, by virtue of the model distillation in both pre-training and fine-tuning, our lightweight student model achieves a promising captioning performance of 37.1 BLEU@4 and 124.1 CIDEr, and even matches the strong teacher model (i.e., 37.5 BLUE@4 and 126.3 CIDEr in Table 1).


Table 3: Illustration of model details including number of parameters (in M), model size (in MB), and computational complexity (FLOPs, in G) of the proposed LightCap.


Ensemble Model Distillation. The above experiments are based on the single head setting. Actually, our model adopts the ensemble head for superior performance. To encourage the prediction diversity, we prepare three teachers to individually distill these heads. As shown in Table 2, ensemble head module and ensemble KD improve 1.7 CIDEr.


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks