Authors:
(1) Divyanshu Kumar, Enkrypt AI;
(2) Anurakt Kumar, Enkrypt AI;
(3) Sahil Agarwa, Enkrypt AI;
(4) Prashanth Harshangi, Enkrypt AI. Table of Links Abstract and 1 Introduction 2 Problem Formulation and Experiments 3 Experiment Set-up & Results 4 Conclusion and References A. Appendix 4 CONCLUSION Our work investigates the LLM’s safety against Jailbreak attempts. We have demonstrated how fine-tuned and quantized models are vulnerable to jailbreak attempts and stress the importance of using external guardrails to reduce this risk. Fine-tuning or quantizing model weights alters the risk profile of LLMs, potentially undermining the safety alignment established through RLHF. This could result from catastrophic forgetting, where LLMs lose memory of safety protocols, or the finetuning process shifting the model’s focus to new topics at the expense of existing safety measures. The lack of safety measures in these fine-tuned and quantized models is concerning, highlighting the need to incorporate safety protocols during the fine-tuning process. We propose using these tests as part of a CI/CD stress test before deploying the model. The effectiveness of guardrails in preventing jailbreaking highlights the importance of integrating them with safety practices in AI development. This approach not only enhances AI models but also establishes a new standard for responsible AI development. By ensuring that AI advancements prioritize innovation and safety, we promote ethical AI deployment, safeguarding against potential misuse and fostering a secure digital future. REFERENCES Zifan Wang Andy Zou. AdvBench Dataset, July 2023. URL https://github.com/ llm-attacks/llm-attacks/tree/main/data. Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv, October 2023. doi: 10.48550/arXiv.2310.08419. Edoardo Debenedetti, Giorgio Severi, Nicholas Carlini, Christopher A. Choquette-Choo, Matthew Jagielski, Milad Nasr, Eric Wallace, and Florian Tramer. Privacy Side Channels in Machine ` Learning Systems. arXiv, September 2023. doi: 10.48550/arXiv.2309.05610. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv, May 2023. doi: 10.48550/arXiv.2305.14314. Micah Gorsline, James Smith, and Cory Merkel. On the Adversarial Robustness of Quantized Neural Networks. arXiv, May 2021. doi: 10.1145/3453688.3461755. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv, February 2023. doi: 10.48550/arXiv.2302.12173. Bing He, Mustaque Ahamad, and Srijan Kumar. PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models. In KDD ’21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 575–584. Association for Computing Machinery, New York, NY, USA, August 2021. ISBN 978-1-45038332-5. doi: 10.1145/3447548.3467390. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. arXiv, June 2021. doi: 10.48550/arXiv.2106.09685. Shuhei Kashiwamura, Ayaka Sakata, and Masaaki Imaizumi. Effect of Weight Quantization on Learning Models by Typical Case Analysis. arXiv, January 2024. doi: 10.48550/arXiv.2401. 17269. Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh. ProPILE: Probing Privacy Leakage in Large Language Models. arXiv, July 2023. doi: 10.48550/arXiv. 2307.01881. Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. Certifying LLM Safety against Adversarial Prompting. arXiv, September 2023. doi: 10.48550/arXiv.2309.02705. Thai Le, Suhang Wang, and Dongwon Lee. MALCOM: Generating Malicious Comments to Attack Neural Fake News Detection Models. IEEE Computer Society, November 2020. ISBN 978-1- 7281-8316-9. doi: 10.1109/ICDM50108.2020.00037. Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. arXiv, May 2023. doi: 10.48550/arXiv.2305.13860. Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. arXiv, December 2023. doi: 10.48550/arXiv.2312.02119. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. arXiv, March 2022. doi: 10.48550/arXiv.2203.02155. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! arXiv, October 2023. doi: 10.48550/arXiv.2310.03693. Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. ACL Anthology, pp. 431–445, December 2023. doi: 10.18653/v1/2023. emnlp-demo.40. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How Does LLM Safety Training Fail? arXiv, July 2023. doi: 10.48550/arXiv.2307.02483. Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, and Houari Sahraoui. Exploring ParameterEfficient Fine-Tuning Techniques for Code Generation with Large Language Models. arXiv, August 2023. doi: 10.48550/arXiv.2308.10462. Yisong Xiao, Aishan Liu, Tianyuan Zhang, Haotong Qin, Jinyang Guo, and Xianglong Liu. RobustMQ: Benchmarking Robustness of Quantized Models. arXiv, August 2023. doi: 10.48550/ arXiv.2308.02350. Andy Zhou, Bo Li, and Haohan Wang. Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks. arXiv, January 2024. doi: 10.48550/arXiv.2401.17263. Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models. arXiv, October 2023. doi: 10.48550/arXiv.2310.15140. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv, July 2023. doi: 10.48550/arXiv.2307.15043. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. Authors: (1) Divyanshu Kumar, Enkrypt AI; (2) Anurakt Kumar, Enkrypt AI; (3) Sahil Agarwa, Enkrypt AI; (4) Prashanth Harshangi, Enkrypt AI. Authors: Authors: (1) Divyanshu Kumar, Enkrypt AI; (2) Anurakt Kumar, Enkrypt AI; (3) Sahil Agarwa, Enkrypt AI; (4) Prashanth Harshangi, Enkrypt AI. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Problem Formulation and Experiments 2 Problem Formulation and Experiments 3 Experiment Set-up & Results 3 Experiment Set-up & Results 4 Conclusion and References 4 Conclusion and References A. Appendix A. Appendix 4 CONCLUSION Our work investigates the LLM’s safety against Jailbreak attempts. We have demonstrated how fine-tuned and quantized models are vulnerable to jailbreak attempts and stress the importance of using external guardrails to reduce this risk. Fine-tuning or quantizing model weights alters the risk profile of LLMs, potentially undermining the safety alignment established through RLHF. This could result from catastrophic forgetting, where LLMs lose memory of safety protocols, or the finetuning process shifting the model’s focus to new topics at the expense of existing safety measures. The lack of safety measures in these fine-tuned and quantized models is concerning, highlighting the need to incorporate safety protocols during the fine-tuning process. We propose using these tests as part of a CI/CD stress test before deploying the model. The effectiveness of guardrails in preventing jailbreaking highlights the importance of integrating them with safety practices in AI development. This approach not only enhances AI models but also establishes a new standard for responsible AI development. By ensuring that AI advancements prioritize innovation and safety, we promote ethical AI deployment, safeguarding against potential misuse and fostering a secure digital future. REFERENCES Zifan Wang Andy Zou. AdvBench Dataset, July 2023. URL https://github.com/ llm-attacks/llm-attacks/tree/main/data. Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv, October 2023. doi: 10.48550/arXiv.2310.08419. Edoardo Debenedetti, Giorgio Severi, Nicholas Carlini, Christopher A. Choquette-Choo, Matthew Jagielski, Milad Nasr, Eric Wallace, and Florian Tramer. Privacy Side Channels in Machine ` Learning Systems. arXiv, September 2023. doi: 10.48550/arXiv.2309.05610. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv, May 2023. doi: 10.48550/arXiv.2305.14314. Micah Gorsline, James Smith, and Cory Merkel. On the Adversarial Robustness of Quantized Neural Networks. arXiv, May 2021. doi: 10.1145/3453688.3461755. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv, February 2023. doi: 10.48550/arXiv.2302.12173. Bing He, Mustaque Ahamad, and Srijan Kumar. PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models. In KDD ’21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 575–584. Association for Computing Machinery, New York, NY, USA, August 2021. ISBN 978-1-45038332-5. doi: 10.1145/3447548.3467390. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. arXiv, June 2021. doi: 10.48550/arXiv.2106.09685. Shuhei Kashiwamura, Ayaka Sakata, and Masaaki Imaizumi. Effect of Weight Quantization on Learning Models by Typical Case Analysis. arXiv, January 2024. doi: 10.48550/arXiv.2401. 17269. Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh. ProPILE: Probing Privacy Leakage in Large Language Models. arXiv, July 2023. doi: 10.48550/arXiv. 2307.01881. Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. Certifying LLM Safety against Adversarial Prompting. arXiv, September 2023. doi: 10.48550/arXiv.2309.02705. Thai Le, Suhang Wang, and Dongwon Lee. MALCOM: Generating Malicious Comments to Attack Neural Fake News Detection Models. IEEE Computer Society, November 2020. ISBN 978-1- 7281-8316-9. doi: 10.1109/ICDM50108.2020.00037. Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. arXiv, May 2023. doi: 10.48550/arXiv.2305.13860. Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. arXiv, December 2023. doi: 10.48550/arXiv.2312.02119. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. arXiv, March 2022. doi: 10.48550/arXiv.2203.02155. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! arXiv, October 2023. doi: 10.48550/arXiv.2310.03693. Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. ACL Anthology, pp. 431–445, December 2023. doi: 10.18653/v1/2023. emnlp-demo.40. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How Does LLM Safety Training Fail? arXiv, July 2023. doi: 10.48550/arXiv.2307.02483. Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, and Houari Sahraoui. Exploring ParameterEfficient Fine-Tuning Techniques for Code Generation with Large Language Models. arXiv, August 2023. doi: 10.48550/arXiv.2308.10462. Yisong Xiao, Aishan Liu, Tianyuan Zhang, Haotong Qin, Jinyang Guo, and Xianglong Liu. RobustMQ: Benchmarking Robustness of Quantized Models. arXiv, August 2023. doi: 10.48550/ arXiv.2308.02350. Andy Zhou, Bo Li, and Haohan Wang. Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks. arXiv, January 2024. doi: 10.48550/arXiv.2401.17263. Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models. arXiv, October 2023. doi: 10.48550/arXiv.2310.15140. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv, July 2023. doi: 10.48550/arXiv.2307.15043. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Conclusion and References

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Review of Top Open-Source Code LLMs and Quantization Techniques

What Are Large Language Models Capable Of: The Vulnerability of LLMs to Adversarial Attacks

What is Training Data Security and Why Does it Matter?

The Power of MEME: Adversarial Malware Creation with Model-Based Reinforcement Learning

Understanding the Threat Model: Black-Box Attacks on Malware Detection Systems

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Abstract and Introduction

A Review of Top Open-Source Code LLMs and Quantization Techniques

What Are Large Language Models Capable Of: The Vulnerability of LLMs to Adversarial Attacks

What is Training Data Security and Why Does it Matter?

The Power of MEME: Adversarial Malware Creation with Model-Based Reinforcement Learning

Understanding the Threat Model: Black-Box Attacks on Malware Detection Systems

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Abstract and Introduction

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps