Increased LLM Vulnerabilities from Fine-tuning and Quantization: Abstract and Introduction

by QuantizationOctober 17th, 2024

Too Long; Didn't Read

This study examines how fine-tuning and quantization of Large Language Models impact their vulnerability to attacks, emphasizing the need for safety measures.

featured image - Increased LLM Vulnerabilities from Fine-tuning and Quantization: Abstract and Introduction

Authors:

(1) Divyanshu Kumar, Enkrypt AI;

(2) Anurakt Kumar, Enkrypt AI;

(3) Sahil Agarwa, Enkrypt AI;

(4) Prashanth Harshangi, Enkrypt AI.

Table of Links

Abstract and 1 Introduction

2 Problem Formulation and Experiments

3 Experiment Set-up & Results

4 Conclusion and References

A. Appendix

ABSTRACT

Large Language Models (LLMs) have become very popular and have found use cases in many domains, such as chatbots, auto-task completion agents, and much more. However, LLMs are vulnerable to different types of attacks, such as jailbreaking, prompt injection attacks, and privacy leakage attacks. Foundational LLMs undergo adversarial and alignment training to learn not to generate malicious and toxic content. For specialized use cases, these foundational LLMs are subjected to fine-tuning or quantization for better performance and efficiency. We examine the impact of downstream tasks such as fine-tuning and quantization on LLM vulnerability. We test foundation models like Mistral, Llama, MosaicML, and their fine-tuned versions. Our research shows that fine-tuning and quantization reduces jailbreak resistance significantly, leading to increased LLM vulnerabilities. Finally, we demonstrate the utility of external guardrails in reducing LLM vulnerabilities.

1 INTRODUCTION

Generative models are becoming more and more important as they are becoming capable of automating a lot of tasks, taking autonomous actions and decisions, and at the same time, becoming better at content generation and summarization tasks. As LLMs become more powerful, these capabilities are at risk of being misused by an adversary, which can lead to fake content generation, toxic, malicious, or hateful content generation, privacy leakages, copyrighted content generation, and much more Chao et al. (2023) Mehrotra et al. (2023) Zou et al. (2023) Greshake et al. (2023) Liu et al. (2023) Zhu et al. (2023) He et al. (2021) Le et al. (2020). To prevent LLMs from generating content that contradicts human values and to prevent their malicious misuse, they undergo a supervised fine-tuning phase after their pre-training, and they are further evaluated by humans and trained using reinforcement learning from human feedback (RLHF) Ouyang et al. (2022), to make them more aligned with human values. Further, special filters called guardrails are put in place as filters to prevent LLMs from getting toxic prompts as inputs and outputting toxic or copyrighted responses Rebedea et al. (2023) Kumar et al. (2023) Wei et al. (2023) Zhou et al. (2024). The complexity of human language makes it difficult for LLMs to completely understand what instructions are right and which are wrong in terms of human values. After going through the alignment training and after the implementation of guardrails, it becomes unlikely that the LLM will generate a toxic response. But these safety measures can easily be circumvented using adversarial attacks, and the LLM can be jailbroken to generate any content that the adversary wants, as shown in recent works Chao et al. (2023) Mehrotra et al. (2023) Zhu et al. (2023).

Recent works such as the Prompt Automatic Iterative Refinement (PAIR) attacks Chao et al. (2023) and Tree-of-attacks pruning (TAP) Mehrotra et al. (2023) have shown the vulnerability of LLMs and how easy it is to jailbreak them into generating content for harmful tasks specified by the user. Similarly, a class of methods called privacy leakage attacks Debenedetti et al. (2023) are used to attack LLMs to extract their training data or personally identifiable information Kim et al. (2023), and prompt injection attacks can be used to make an LLM application perform tasks that are not requested by the user but are hidden in the third-party instruction which the LLM automatically executes. Figure 1 shows how an instruction can be hidden inside a summarization text and how the LLM will ignore the previous instruction to execute the malicious instruction. Qi et al. (2023) showed that it only takes a few examples to fine-tune an LLM into generating toxic responses by forgetting its safety training. Our work in this paper extends that notion and shows that both finetuning the LLM on any task (not necessarily toxic content generation) and quantization can affect its safety training. In this study, we use a subset of adversarial harmful prompts called AdvBench SubsetAndy Zou (2023). It contains 50 prompts asking for harmful information across 32 categories. It is a subset of prompts from the harmful behaviours dataset in the AdvBench benchmark selected to cover a diverse range of harmful prompts. The attacking algorithm used is tree-of-attacks pruning Mehrotra et al. (2023) as it has shown to have the best performance in jailbreaking and, more importantly, this algorithm fulfils three important goals (1) Black-box: the algorithm only needs black-box access to the model (2) Automatic: it does not need human intervention once started, and (3) Interpretable: the algorithm generates semantically meaningful prompts. The TAP algorithm is used with the tasks from the AdvBench subset to attack the target LLMs in different settings, and their response is used to evaluate whether or not they have been jailbroken.

The rest of the paper is organized in the following manner. Section 2 talks about the experimental setup for jailbreaking in which the models are tested. It specifically describes the different modes of downstream process that an LLM has undergone e.g fine-tuning, quantization and tested for these modes. Section 3 describes the experiment set-up used and defines guardrails Rebedea et al. (2023), fine-tuning and quantization Kashiwamura et al. (2024) Gorsline et al. (2021) Xiao et al. (2023) Hu et al. (2021) Dettmers et al. (2023) settings used in the experimental context. We demonstrate the results in detail and show how model vulnerability is affected by downstream tasks for LLMs. Finally, section 4 concludes the study and talks about methods to reduce model vulnerability and ensure safe and reliable LLM development.