This story draft by @escholar has not been reviewed by an editor, YET.

Efficient Image Captioning for Edge Devices: Visualization Results

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Ning Wang, Huawei Inc.;

(2) Jiangrong Xie, Huawei Inc.;

(3) Hang Luo, Huawei Inc.;

(4) Qinglin Cheng, Huawei Inc.;

(5) Jihao Wu, Huawei Inc.;

(6) Mingbo Jia, Huawei Inc.;

(7) Linlin Li, Huawei Inc.;

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Methodology and 3.1 Model Architecture

3.2 Model Training

3.3 Knowledge Distillation

4 Experiments

4.1 Datasets and Metrics and 4.2 Implementation Details

4.3 Ablation Study

4.4 Inference on the Mobile Device and 4.5 State-of-the-art Comparison

5 Conclusion and References

A Implementation Details

B Visualization Results

C Results on Nocaps

D Limitations and Future Work

B Visualization Results

B.1 Visualization of Visual Concept Extractor

We visualize the image concept retrieval results in Figure 4. In the second column, we exhibit the foreground detection


Table 7: Inference latency of the proposed LightCap on the CPU device.


results of the tiny detector YOLOv5n. Although this detector is relatively weak and fails to outperform the state-of-theart two-stage detection methods, it is extremely light with only 1.9M parameters. Besides, accurate bounding boxes are not necessary for our framework. Based on the roughly predicted foreground ROIs, we focus on retrieving visual concepts of the image. As shown in the third column, our visual concept extractor is able to predict accurate and dense object tags to form the image concept.


Figure 4: From left to right: input image, foreground detection results, and concept retrieval results. All the testing images are from COCO dataset (Lin et al. 2014).

B.2 Visualization of Cross-modal Modulator

In Figure 5, we further visualize the channel attentions of the retrieved visual concepts. For the given image in Figure 5, the first three visual concepts are Dessert, Cake, and Spoon. These visual concepts are projected to the channel attentions to modulate the raw CLIP features. As shown


Figure 5: In the top figure, we show the predicted image caption, ground truth (GT) captions, and our predicted visual concepts. In the bottom figure, we exhibit the channel attention weights of the first three concepts (i.e., Dessert, Cake, and Spoon).


Figure 6: Uncurated image captioning examples of the first four images in COCO Karpathy test split (Karpathy and Fei-Fei 2015), coupled with the correspondence ground truth (GT) sentences.


in the bottom figures in Figure 5, the activated channels are sparse (i.e., only a few channels yield the high attention values of more than 0.8) and most channel weights are below 0.5. This verifies our assumption that the raw CLIP features are redundant in the channel dimension. Besides, the channel attentions from Dessert and Cake are similar, potentially due to their high similarity in the semantic space. However, the attention weight generated by Spoon is quite different from the attentions of Dessert and Cake. It is well recognized that different feature channels represent certain semantics, and our approach is able to activate the informative channels using the retrieved concepts for effective image captioning.

B.3 Qualitative Evaluation

Finally, we exhibit the captioning results of our approach on the COCO-caption dataset (Karpathy and Fei-Fei 2015) in Figure 6, coupled with ground truth (GT) sentences. Figure 6 also showcases the results of the state-of-the-art OscarB method (Li et al. 2020b). Overall, on these uncurated images from the COCO Karpathy test set, our LightCap generates accurate captions and is comparable with the strong OscarB. The proposed approach even yields more accurate captions than OscarB in the third picture, where OscarB predicts woman instead of man. It should be noted that such a robust model achieves promising results by retaining only 2% FLOPs of the current state-of-the-art captioners.


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks