Table of Links
-
Methodology and 3.1 Preliminary
-
A. Character Generation Detail
C. Effect of Text Moderator on Text-based Jailbreak Attack
2 Related Works
Role Playing. Role-playing represents an innovative strategy used in LLMs. In LLMs, such an application is widely investigated by recent works that explore the potential of role-playing [37; 54; 77; 63; 64; 52; 5; 60]. Most of these works use role-playing strategies to make LLMs more interactive [64], personalized [54; 63; 60], and context-faithful [77]. However, role-playing in jailbreak attacks also poses a threat to the AI community [34; 57; 22], which investigate the jailbreak potential of role-playing on jailbreak LLMs via instructing LLMs by adding role-playing information as a template prefix of the prompt. Unfortunately, current studies on MLLM jailbreak attacks didn’t pay attention to studying role-playing. In order to fill the gap, we are the first work that gets insight from these role-playing methods on jailbreak LLMs and develops a visual role-playing method for jailbreak MLLMs.
Jailbreak attacks against MLLMs. MLLMs have been widely used in real-world scenarios, and the current MLLM jailbreak attack methods can be broadly classified into three main categories: perturbation-based, image-based jailbreak attack, and text-based jailbreak attack. Perturbation-based jailbreak attacks [55; 45; 48; 11] jailbreak MLLMs by optimizing image and text perturbations. Structure-based jailbreak attacks include Figstep [14] that converts harmful queries into visual representation via rephrasing harmful questions into step-by-step typography, and Query relevant [36] that jailbreaks MLLMs by using a text-to-image tool to visualize the keyword in harmful queries that are relevant to the query. Meanwhile, text-based jailbreak attacks [38] investigate the robustness of MLLMs against text-based attacks [80; 72; 57; 66] initially designed for attacking LLMs, revealing the transferability and effectiveness of LLM jailbreak attacks.
Our Visual-RolePlay jailbreak method is a structure-based jailbreak attack method on MLLMs that not only explores the potential of role-play through the visual modality on jailbreak MLLMs but also combines with the visual representation of key information in harmful queries. Our method shows better performance compared with other structure-based jailbreak attacks.
Authors:
(1) Siyuan Ma, University of Wisconsin–Madison ([email protected]);
(2) Weidi Luo, The Ohio State University ([email protected]);
(3) Yu Wang, Peking University ([email protected]);
(4) Xiaogeng Liu, University of Wisconsin-Madison ([email protected]).
This paper is