Authors:
(1) Divyanshu Kumar, Enkrypt AI;
(2) Anurakt Kumar, Enkrypt AI;
(3) Sahil Agarwa, Enkrypt AI;
(4) Prashanth Harshangi, Enkrypt AI.
2 Problem Formulation and Experiments
Our work investigates the LLM’s safety against Jailbreak attempts. We have demonstrated how fine-tuned and quantized models are vulnerable to jailbreak attempts and stress the importance of using external guardrails to reduce this risk. Fine-tuning or quantizing model weights alters the risk profile of LLMs, potentially undermining the safety alignment established through RLHF. This could result from catastrophic forgetting, where LLMs lose memory of safety protocols, or the finetuning process shifting the model’s focus to new topics at the expense of existing safety measures.
The lack of safety measures in these fine-tuned and quantized models is concerning, highlighting the need to incorporate safety protocols during the fine-tuning process. We propose using these tests as part of a CI/CD stress test before deploying the model. The effectiveness of guardrails in preventing jailbreaking highlights the importance of integrating them with safety practices in AI development.
This approach not only enhances AI models but also establishes a new standard for responsible AI development. By ensuring that AI advancements prioritize innovation and safety, we promote ethical AI deployment, safeguarding against potential misuse and fostering a secure digital future.
Zifan Wang Andy Zou. AdvBench Dataset, July 2023. URL https://github.com/ llm-attacks/llm-attacks/tree/main/data.
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv, October 2023. doi: 10.48550/arXiv.2310.08419.
Edoardo Debenedetti, Giorgio Severi, Nicholas Carlini, Christopher A. Choquette-Choo, Matthew Jagielski, Milad Nasr, Eric Wallace, and Florian Tramer. Privacy Side Channels in Machine ` Learning Systems. arXiv, September 2023. doi: 10.48550/arXiv.2309.05610.
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv, May 2023. doi: 10.48550/arXiv.2305.14314.
Micah Gorsline, James Smith, and Cory Merkel. On the Adversarial Robustness of Quantized Neural Networks. arXiv, May 2021. doi: 10.1145/3453688.3461755.
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv, February 2023. doi: 10.48550/arXiv.2302.12173.
Bing He, Mustaque Ahamad, and Srijan Kumar. PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models. In KDD ’21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 575–584. Association for Computing Machinery, New York, NY, USA, August 2021. ISBN 978-1-45038332-5. doi: 10.1145/3447548.3467390.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. arXiv, June 2021. doi: 10.48550/arXiv.2106.09685.
Shuhei Kashiwamura, Ayaka Sakata, and Masaaki Imaizumi. Effect of Weight Quantization on Learning Models by Typical Case Analysis. arXiv, January 2024. doi: 10.48550/arXiv.2401. 17269.
Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh. ProPILE: Probing Privacy Leakage in Large Language Models. arXiv, July 2023. doi: 10.48550/arXiv. 2307.01881.
Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. Certifying LLM Safety against Adversarial Prompting. arXiv, September 2023. doi: 10.48550/arXiv.2309.02705.
Thai Le, Suhang Wang, and Dongwon Lee. MALCOM: Generating Malicious Comments to Attack Neural Fake News Detection Models. IEEE Computer Society, November 2020. ISBN 978-1- 7281-8316-9. doi: 10.1109/ICDM50108.2020.00037.
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. arXiv, May 2023. doi: 10.48550/arXiv.2305.13860.
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. arXiv, December 2023. doi: 10.48550/arXiv.2312.02119.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. arXiv, March 2022. doi: 10.48550/arXiv.2203.02155.
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! arXiv, October 2023. doi: 10.48550/arXiv.2310.03693.
Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. ACL Anthology, pp. 431–445, December 2023. doi: 10.18653/v1/2023. emnlp-demo.40.
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How Does LLM Safety Training Fail? arXiv, July 2023. doi: 10.48550/arXiv.2307.02483.
Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, and Houari Sahraoui. Exploring ParameterEfficient Fine-Tuning Techniques for Code Generation with Large Language Models. arXiv, August 2023. doi: 10.48550/arXiv.2308.10462.
Yisong Xiao, Aishan Liu, Tianyuan Zhang, Haotong Qin, Jinyang Guo, and Xianglong Liu. RobustMQ: Benchmarking Robustness of Quantized Models. arXiv, August 2023. doi: 10.48550/ arXiv.2308.02350.
Andy Zhou, Bo Li, and Haohan Wang. Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks. arXiv, January 2024. doi: 10.48550/arXiv.2401.17263.
Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models. arXiv, October 2023. doi: 10.48550/arXiv.2310.15140.
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv, July 2023. doi: 10.48550/arXiv.2307.15043.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.