Table of Links

Abstract and 1. Introduction

A. Character Generation Detail

B. Ethics and Broader Impact

C. Effect of Text Moderator on Text-based Jailbreak Attack

D. Examples

E. Evaluation Detail

4.2 Main Results

VRP is more effective than baseline attacks. In Tab. 1, we present the outcomes of our query-specific VRP attack on the test sets of RedTeam-2K and HarmBench. This approach involves generating specific characters for each harmful question to assess their effectiveness in compromising SotA open-source and closed-source MLLMs, such as Gemini-Pro-Vision. but also achieves higher ASR than all other baseline attacks. Our findings reveal that query-specific VRP not only successfully breaches these MLLMs but also achieves a higher ASR compared to all evaluated baseline attacks. Specifically, it improves the ASR by 9.8% over FigStep and by 14.3% over Query relevant. In most cases, the data consistently shows that query-specific VRP surpasses TRP, underscoring the crucial role of character images in the effective jailbreaking of MLLMs. These results affirm that VRP is a potent method for jailbreaking MLLMs.

Table 1: Attack Success Rate of query-specific VRP compared with baseline attacks on MLLMs between test set of RedTeam-2K and HarmBench dataset. Our VRP achieves the highest ASR in all datasets compared with other jailbreak attacks.

VRP achieves high-performance transferability across models. In our research, we further investigate the applicability of a universal attack across diverse models. Utilizing our universal VRP algorithm, we identify the most effective role-play character within the train and valid set on the target model. Subsequently, we transfer the most effective character to conduct a jailbreak attack on the target models. From Tab. 2, The ASR achieves an average of 32.7% for the target model as LLaVA-V1.6-Mixtral and 29.4% on Qwen-VL-Chat. The ASR is higher on the target model, also higher on the transfer model, demonstrating that our VRP, when implemented in a universal setting, effectively transfers and maintains high performance across different MLLMs.

Authors:

(1) Siyuan Ma, University of Wisconsin–Madison (siyuan.ma.jasper@outlook.com);

(2) Weidi Luo, The Ohio State University (luo.1455@osu.edu);

(3) Yu Wang, Peking University (rain_wang@stu.pku.edu.cn);

(4) Xiaogeng Liu, University of Wisconsin-Madison (xiaogeng.liu@wisc.edu).

This paper is available on arxiv under CC BY 4.0 DEED license.

Main Results

About Author

Topics

Around The Web

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps