Table of Links
-
Methodology and 3.1 Preliminary
-
A. Character Generation Detail
C. Effect of Text Moderator on Text-based Jailbreak Attack
4.4 Defense Analysis
We evaluate the robustness of VRP against two defense approaches, namely System Prompt-based Defense, and the Eye Closed Safety On (ECSO) approach [15]
System Prompt-based Defense: To defend against the jailbreak attack, a system prompt can instruct the model to conduct a preliminary safety assessment of the text and image input, thereby filtering out queries that violate AI safety policies. We add the following Prompt 2 to the existing system prompt of MLLMs.
ECSO[15]: A defense method utilizing MLLMs’ aligned textual module to mitigate the vulnerability in visual modality. ECSO use the MLLM itself to evaluate the safety of its response and makes MLLMs to regenerate unsafe responses in two steps: image captioning, and then responding based on caption with no image input.
VRP is effective against System Prompt-based Defense and ECSO. We evaluate our query-specific VRP and baselines against our System Prompt-based Defense and ECSO. As shown in Tab. 4, the results demonstrate that our query-specific VRP consistently achieves the ASR across all models, regardless of whether it is tested against System Prompt-based Defense or ECSO. This consistent performance underlines the efficacy of query-specific VRP in penetrating defenses and reveals a notable vulnerability of defense mechanisms under our VRP jailbreak attacks. These findings highlight the potential of VRP as a formidable strategy against defense mechanisms.
Authors:
(1) Siyuan Ma, University of Wisconsin–Madison ([email protected]);
(2) Weidi Luo, The Ohio State University ([email protected]);
(3) Yu Wang, Peking University ([email protected]);
(4) Xiaogeng Liu, University of Wisconsin-Madison ([email protected]).
This paper is