Table of Links
-
Related Work
-
Preliminaries
-
Methodology
4.1 Auto-Generation of Typographic Attack
6 Conclusion
Our research has developed a comprehensive typographic attack framework designed for benchmarking Vision-LLMs under AD systems, exploring their adoption, the potential impacts on decision-making autonomy, and the methods by which these attacks can be physically implemented. Firstly, our dataset-agnostic framework is capable of automatically generating misleading responses that misdirect the reasoning of Vision-LLMs. Secondly, our linguistic formatting scheme is shown to augment attacks at a higher degree and can extend to simultaneously targeting multiple reasoning tasks. Thirdly, our study on the practical implementation of these attacks in physical traffic scenarios is critical for highlighting the need for defense models. Our empirical findings on the effectiveness, transferability, and realizability of typographic attacks in traffic environments highlight their effects on existing Vision-LLMs (e.g., LLaVA, Qwen-VL, VILA). This research underscores the urgent need for increased awareness within the community regarding vulnerabilities associated with integrating Vision-LLMs into AD systems.
Limitations. One of the primary limitations of our typographic attack framework lies in its dependency on environmental control and predictability. Our framework can demonstrate the vulnerability of Vision-LLMs to typographic manipulations in controlled settings, so the variability and unpredictability of real-world traffic scenarios can significantly diminish the consistency and reproducibility of the attacks. Additionally, our attacks assume that AD systems do not evolve to recognize and mitigate such manipulations, which may not hold true as defensive technologies advance. Another limitation is the ethical concern of testing and deploying such attacks, which could potentially endanger public safety if not managed correctly. This necessitates a careful approach to research and disclosure to ensure that knowledge of vulnerabilities does not lead to malicious exploitation.
Safeguards. To safeguard against the vulnerabilities exposed by typographic attacks, it is essential to develop robust defensive mechanisms within AD systems. While the current literature on defensive techniques is still understudied, there are ways forward to mitigate potential issues. A concurrent work is investigating how better prompting can support better reasoning to defend against the attacks [16], or how incorporating keyword training of Vision-LLMs can make these systems more resilient to such attacks by conditioning their answers on specific prefixes [15]. Another basic approach is to detect and remove all non-essential texts in the visual information. Overall, it is necessary to foster a community-wide effort toward establishing standards and best practices for the secure deployment of Vision-LLMs into AD.
Broader Impacts. The implications of our research into typographic attacks extend beyond the technical vulnerabilities of AD systems, touching on broader societal, ethical, and regulatory concerns. As Vision-LLMs and AD technologies proliferate, the potential for such attacks underscores the need for comprehensive safety and security frameworks that anticipate and mitigate unconventional threats. This research highlights the interplay between technology and human factors, illustrating how seemingly minor alterations in a traffic environment can lead to significant misjudgments by AD systems, potentially endangering public safety.
References
[1] Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. VILA: On pre-training for visual language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024.
[2] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
[3] Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. MM-LLMs: Recent advances in multimodal large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.
[4] Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John F. Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II, volume 11206 of Lecture Notes in Computer Science, pages 577–593. Springer, 2018.
[5] Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslander, Yu Liu, and Hongsheng Li. LMDrive: Closed-loop end-to-end driving with large language models. In CVPR, 2024.
[6] Can Cui, Zichong Yang, Yupeng Zhou, Yunsheng Ma, Juanwu Lu, Lingxi Li, Yaobin Chen, Jitesh Panchal, and Ziran Wang. Personalized autonomous driving with large language models: Field experiments, 2024.
[7] Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, and Oleg Sinavski. LingoQA: Video question answering for autonomous driving. arXiv preprint arXiv:2312.14115, 2023.
[8] Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2Drive: Towards interpretable and chain-based reasoning for autonomous driving. arXiv preprint, 2023.
[9] Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. LLM4Drive: A survey of large language models for autonomous driving. CoRR, abs/2311.01043, 2023.
[10] Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, and Cihang Xie. How many unicorns are in this image? a safety evaluation benchmark for vision LLMs. arXiv preprint arXiv:2311.16101, 2023.
[11] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
[12] Gabriel Goh, Nick Cammarata †, Chelsea Voss †, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill, 2021. https://distill.pub/2021/multimodal-neurons.
[13] Hao Cheng, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu, and Renjing Xu. Unveiling typographic deceptions: Insights of the typographic vulnerability in large vision-language model. CoRR, abs/2402.19150, 2024.
[14] Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, and Bryan A. Plummer. Vision-LLMs can fool themselves with self-generated typographic attacks. CoRR, abs/2402.00626, 2024.
[15] Hiroki Azuma and Yusuke Matsui. Defense-prefix for preventing typographic attacks on CLIP. In IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Workshops, Paris, France, October 2-6, 2023, pages 3646–3655. IEEE, 2023.
[16] Hao Cheng, Erjia Xiao, and Renjing Xu. Typographic attacks in large multimodal models can be alleviated by more informative prompts. arXiv preprint arXiv:2402.19150, 2024.
[17] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
[18] Zhenwei Shao, Xuecheng Ouyang, Zhenbiao Gai, Zhou Yu, and Jun Yu. Imp: An emprical study of multimodal small language models, 2024.
[19] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodied multimodal language model, 2023.
[20] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sagnak Ta¸sırlar. Fuyu-8B: A multimodal architecture for ai agents, 2024. ˘
[21] Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal C4: an open, billion-scale corpus of images interleaved with text. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[22] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karén Simonyan. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
[23] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[24] Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6:14410–14430, 2018.
[25] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[26] Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In Artificial intelligence safety and security, pages 99–112. Chapman and Hall/CRC, 2018.
[27] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
[28] Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2730–2739, 2019.
[29] Xiaosen Wang, Xuanran He, Jingdong Wang, and Kun He. Admix: Enhancing the transferability of adversarial attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16158–16167, 2021.
[30] Jianping Zhang, Jen-tse Huang, Wenxuan Wang, Yichen Li, Weibin Wu, Xiaosen Wang, Yuxin Su, and Michael R Lyu. Improving the transferability of adversarial samples by path-augmented method. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8173–8182, 2023.
[31] Jiadong Lin, Chuanbiao Song, Kun He, Liwei Wang, and John E Hopcroft. Nesterov accelerated gradient and scale invariance for adversarial attacks. arXiv preprint arXiv:1908.06281, 2019.
[32] Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. Evading defenses to transferable adversarial examples by translation-invariant attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4312–4321, 2019.
[33] Zeyu Qin, Yanbo Fan, Yi Liu, Li Shen, Yong Zhang, Jue Wang, and Baoyuan Wu. Boosting the transferability of adversarial attacks with reverse adversarial perturbation. Advances in neural information processing systems, 35:29845–29858, 2022.
[34] Sensen Gao, Xiaojun Jia, Xuhong Ren, Ivor Tsang, and Qing Guo. Boosting transferability in visionlanguage attacks via diversification along the intersection region of adversarial trajectory. arXiv preprint arXiv:2403.12445, 2024.
[35] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 2019.
[36] Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
[37] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
[38] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024.
[39] Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In The Eleventh International Conference on Learning Representations, 2023.
[40] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
[41] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. BLEURT: Learning robust metrics for text generation. In Proceedings of ACL, 2020.
[42] Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020.
[43] OpenAI team. GPT-4 technical report, 2024.
Authors:
(1) Nhat Chung, CFAR and IHPC, A*STAR, Singapore and VNU-HCM, Vietnam;
(2) Sensen Gao, CFAR and IHPC, A*STAR, Singapore and Nankai University, China;
(3) Tuan-Anh Vu, CFAR and IHPC, A*STAR, Singapore and HKUST, HKSAR;
(4) Jie Zhang, Nanyang Technological University, Singapore;
(5) Aishan Liu, Beihang University, China;
(6) Yun Lin, Shanghai Jiao Tong University, China;
(7) Jin Song Dong, National University of Singapore, Singapore;
(8) Qing Guo, CFAR and IHPC, A*STAR, Singapore and National University of Singapore, Singapore.
This paper is