Table of Links

Abstract and 1. Introduction

A. Character Generation Detail

B. Ethics and Broader Impact

C. Effect of Text Moderator on Text-based Jailbreak Attack

D. Examples

E. Evaluation Detail

2 Related Works

Role Playing. Role-playing represents an innovative strategy used in LLMs. In LLMs, such an application is widely investigated by recent works that explore the potential of role-playing [37; 54; 77; 63; 64; 52; 5; 60]. Most of these works use role-playing strategies to make LLMs more interactive [64], personalized [54; 63; 60], and context-faithful [77]. However, role-playing in jailbreak attacks also poses a threat to the AI community [34; 57; 22], which investigate the jailbreak potential of role-playing on jailbreak LLMs via instructing LLMs by adding role-playing information as a template prefix of the prompt. Unfortunately, current studies on MLLM jailbreak attacks didn’t pay attention to studying role-playing. In order to fill the gap, we are the first work that gets insight from these role-playing methods on jailbreak LLMs and develops a visual role-playing method for jailbreak MLLMs.

Jailbreak attacks against MLLMs. MLLMs have been widely used in real-world scenarios, and the current MLLM jailbreak attack methods can be broadly classified into three main categories: perturbation-based, image-based jailbreak attack, and text-based jailbreak attack. Perturbation-based jailbreak attacks [55; 45; 48; 11] jailbreak MLLMs by optimizing image and text perturbations. Structure-based jailbreak attacks include Figstep [14] that converts harmful queries into visual representation via rephrasing harmful questions into step-by-step typography, and Query relevant [36] that jailbreaks MLLMs by using a text-to-image tool to visualize the keyword in harmful queries that are relevant to the query. Meanwhile, text-based jailbreak attacks [38] investigate the robustness of MLLMs against text-based attacks [80; 72; 57; 66] initially designed for attacking LLMs, revealing the transferability and effectiveness of LLM jailbreak attacks.

Our Visual-RolePlay jailbreak method is a structure-based jailbreak attack method on MLLMs that not only explores the potential of role-play through the visual modality on jailbreak MLLMs but also combines with the visual representation of key information in harmful queries. Our method shows better performance compared with other structure-based jailbreak attacks.

Authors:

(1) Siyuan Ma, University of Wisconsin–Madison (siyuan.ma.jasper@outlook.com);

(2) Weidi Luo, The Ohio State University (luo.1455@osu.edu);

(3) Yu Wang, Peking University (rain_wang@stu.pku.edu.cn);

(4) Xiaogeng Liu, University of Wisconsin-Madison (xiaogeng.liu@wisc.edu).

This paper is available on arxiv under CC BY 4.0 DEED license.

Related Works

About Author

Topics

Around The Web

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps