Table of Links

Abstract and 1. Introduction

A. Character Generation Detail

B. Ethics and Broader Impact

C. Effect of Text Moderator on Text-based Jailbreak Attack

D. Examples

E. Evaluation Detail

4.4 Defense Analysis

We evaluate the robustness of VRP against two defense approaches, namely System Prompt-based Defense, and the Eye Closed Safety On (ECSO) approach [15]

System Prompt-based Defense: To defend against the jailbreak attack, a system prompt can instruct the model to conduct a preliminary safety assessment of the text and image input, thereby filtering out queries that violate AI safety policies. We add the following Prompt 2 to the existing system prompt of MLLMs.

ECSO[15]: A defense method utilizing MLLMs’ aligned textual module to mitigate the vulnerability in visual modality. ECSO use the MLLM itself to evaluate the safety of its response and makes MLLMs to regenerate unsafe responses in two steps: image captioning, and then responding based on caption with no image input.

VRP is effective against System Prompt-based Defense and ECSO. We evaluate our query-specific VRP and baselines against our System Prompt-based Defense and ECSO. As shown in Tab. 4, the results demonstrate that our query-specific VRP consistently achieves the ASR across all models, regardless of whether it is tested against System Prompt-based Defense or ECSO. This consistent performance underlines the efficacy of query-specific VRP in penetrating defenses and reveals a notable vulnerability of defense mechanisms under our VRP jailbreak attacks. These findings highlight the potential of VRP as a formidable strategy against defense mechanisms.

Table 4: Attack Success Rate of query-specific VRP against the defense on the test set of RedTeam-2K. Our Query-specific attack is effective under the defense of System Prompt-based Defense and ECSO among all models.

Authors:

(1) Siyuan Ma, University of Wisconsin–Madison (siyuan.ma.jasper@outlook.com);

(2) Weidi Luo, The Ohio State University (luo.1455@osu.edu);

(3) Yu Wang, Peking University (rain_wang@stu.pku.edu.cn);

(4) Xiaogeng Liu, University of Wisconsin-Madison (xiaogeng.liu@wisc.edu).

This paper is available on arxiv under CC BY 4.0 DEED license.

Defense Analysis

About Author

Topics

Around The Web

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps