Table of Links
-
Methodology and 3.1 Preliminary
-
A. Character Generation Detail
C. Effect of Text Moderator on Text-based Jailbreak Attack
3.2 Query-specific Visual Role-play
To improve the limited jailbreak attack performance of existing structure-based jailbreak methods [14; 79], we introduce a novel MLLM jailbreak method named VRP, which misleads MLLMs to bypass safety alignments by instructing the model to act as a high-risk character in the image input (see Fig. 1). We first introduce the pipeline of VRP under the query-specific setting, where VRP generates a role-play character targeting a specific query. The details are as follows.
Authors:
(1) Siyuan Ma, University of Wisconsin–Madison ([email protected]);
(2) Weidi Luo, The Ohio State University ([email protected]);
(3) Yu Wang, Peking University ([email protected]);
(4) Xiaogeng Liu, University of Wisconsin-Madison ([email protected]).
This paper is