Table of Links
-
Methodology and 3.1 Preliminary
-
A. Character Generation Detail
C. Effect of Text Moderator on Text-based Jailbreak Attack
3 Methodology
In this section, we first define the jailbreak attack tasks in MLLMs in Sec. 3.1. Then, we introduce the pipeline of VRP in a query-specific setting in Sec. 3.2. In Sec. 3.3, we further extend the VRP into a universal setting and obtain a universal role-play character.
3.1 Preliminary
Adversarial Capabilities. This paper considers a black-box attack that operates without any knowledge of the MLLMs, such as their parameters and hidden states, or any manipulation such as fine-tuning. The adversary only needs the ability to query the model and receive its textual responses. The interaction is limited to a single turn with no prior dialogue history, except for any predetermined system prompts. The attacker lacks access to or control over the internal states of the generation process and cannot adjust the model’s parameters.
Authors:
(1) Siyuan Ma, University of Wisconsin–Madison ([email protected]);
(2) Weidi Luo, The Ohio State University ([email protected]);
(3) Yu Wang, Peking University ([email protected]);
(4) Xiaogeng Liu, University of Wisconsin-Madison ([email protected]).
This paper is