paint-brush
Increased LLM Vulnerabilities from Fine-tuning and Quantization: Problem Formulation and Experimentsby@quantization

Increased LLM Vulnerabilities from Fine-tuning and Quantization: Problem Formulation and Experiments

by QuantizationOctober 17th, 2024
Read on Terminal Reader
tldt arrow

Too Long; Didn't Read

This study examines how fine-tuning and quantization of Large Language Models impact their vulnerability to attacks, emphasizing the need for safety measures.
featured image - Increased LLM Vulnerabilities from Fine-tuning and Quantization: Problem Formulation and Experiments
Quantization HackerNoon profile picture

Authors:

(1) Divyanshu Kumar, Enkrypt AI;

(2) Anurakt Kumar, Enkrypt AI;

(3) Sahil Agarwa, Enkrypt AI;

(4) Prashanth Harshangi, Enkrypt AI.

Abstract and 1 Introduction

2 Problem Formulation and Experiments

3 Experiment Set-up & Results

4 Conclusion and References

A. Appendix

2 PROBLEM FORMULATION AND EXPERIMENTS

We want to understand the role played by fine-tuning, quantization, and guardrails on LLM’s vulnerability towards jailbreaking attacks. We create a pipeline to test for jailbreaking of LLMs which undergo these further processes before deployment. As mentioned in the introduction, we attack the LLM via the TAP algorithm using the AdvBench subset. We use a subset of AdvBenchAndy Zou (2023). It contains 50 prompts asking for harmful information across 32 categories. The evaluation results, along with the complete system information, are then logged. The overall flow is shown in the figure 2. This process continues for multiple iterations, taking into account the stochastic nature associated with LLMs. The complete experiment pipeline is shown in Figure 2.


TAP Mehrotra et al. (2023) is used as the jailbreaking method, as it is currently the state-of-the-art, black-box, and automatic method which generates prompts with semantic meaning to jailbreak LLMs. TAP algorithm uses an attacker LLM A, which sends a prompt P to the target LLM T. The response of the target LLM R along with the prompt P are fed into the evaluator LLM JUDGE, which determines if the prompt is on-topic or off-topic. If the prompt is off-topic, it is removed, thereby eliminating the tree of bad attack prompts it was going to generate. Otherwise, if the prompt is on-topic, it receives a score between 0-10 by the JUDGE LLM. This prompt is used to generate a new attack prompt using breadth-first search. This process continues till the LLM is jailbroken or a specified number of iterations are exhausted.


Now describing our guardrail against jailbreaking prompts, we use our in-house Deberta-V3 model, which has been trained to detect jailbreaking prompts. It acts as an input filter to ensure that only


Figure 2: Jailbreak process followed to generate the reports


sanitized prompts are received by the LLM. If the input prompt is filtered out by the guardrail or fails to jailbreak the LLM, then the TAP algorithm generates a new prompt considering the initial prompt and response. The new attacking prompt is then again passed through the guardrail. This process is repeated till a jailbreaking prompt is found or a pre-specified number of iterations are exhausted.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.