The Dual-Edged Sword of Vision-LLMs in AD: Reasoning Capabilities vs. Attack Vulnerabilities

Table of Links

Related Work

2.1 Vision-LLMs

2.2 Transferable Adversarial Attacks
Preliminaries

3.1 Revisiting Auto-Regressive Vision-LLMs

3.2 Typographic Attacks in Vision-LLMs-based AD Systems
Methodology

4.1 Auto-Generation of Typographic Attack

4.2 Augmentations of Typographic Attack

4.3 Realizations of Typographic Attacks
Experiments
Conclusion and References

3.2 Typographic Attacks in Vision-LLMs-based AD Systems

The integration of Vision-LLMs into end-to-end AD systems has brought promising results thus far [9], where Vision-LLMs can enhance user trust through explicit reasoning steps of the scene. On the one hand, language reasoning in AD systems can elevate their capabilities by utilizing the learned commonsense of LLMs, while being able to proficiently communicate to users. On the other hand, exposing Vision-LLMs to public traffic scenarios not only makes them more vulnerable to typographic attacks that misdirect the reasoning process but can also prove harmful if their results are connected with decision-making, judgment, and control processes.

Authors:

(1) Nhat Chung, CFAR and IHPC, A*STAR, Singapore and VNU-HCM, Vietnam;

(2) Sensen Gao, CFAR and IHPC, A*STAR, Singapore and Nankai University, China;

(3) Tuan-Anh Vu, CFAR and IHPC, A*STAR, Singapore and HKUST, HKSAR;

(4) Jie Zhang, Nanyang Technological University, Singapore;

(5) Aishan Liu, Beihang University, China;

(6) Yun Lin, Shanghai Jiao Tong University, China;

(7) Jin Song Dong, National University of Singapore, Singapore;

(8) Qing Guo, CFAR and IHPC, A*STAR, Singapore and National University of Singapore, Singapore.

This paper is available on arxiv under CC BY 4.0 DEED license.