This story draft by @escholar has not been reviewed by an editor, YET.

Main Results

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1. Introduction

  1. Related Works

  2. Methodology and 3.1 Preliminary

    3.2 Query-specific Visual Role-play

    3.3 Universal Visual Role-play

  3. Experiments and 4.1 Experimental setups

    4.2 Main Results

    4.3 Ablation Study

    4.4 Defense Analysis

    4.5 Integrating VRP with Baseline Techniques

  4. Conclusion

  5. Limitation

  6. Future work and References


A. Character Generation Detail

B. Ethics and Broader Impact

C. Effect of Text Moderator on Text-based Jailbreak Attack

D. Examples

E. Evaluation Detail

4.2 Main Results

VRP is more effective than baseline attacks. In Tab. 1, we present the outcomes of our query-specific VRP attack on the test sets of RedTeam-2K and HarmBench. This approach involves generating specific characters for each harmful question to assess their effectiveness in compromising SotA open-source and closed-source MLLMs, such as Gemini-Pro-Vision. but also achieves higher ASR than all other baseline attacks. Our findings reveal that query-specific VRP not only successfully breaches these MLLMs but also achieves a higher ASR compared to all evaluated baseline attacks. Specifically, it improves the ASR by 9.8% over FigStep and by 14.3% over Query relevant. In most cases, the data consistently shows that query-specific VRP surpasses TRP, underscoring the crucial role of character images in the effective jailbreaking of MLLMs. These results affirm that VRP is a potent method for jailbreaking MLLMs.


Table 1: Attack Success Rate of query-specific VRP compared with baseline attacks on MLLMs between test set of RedTeam-2K and HarmBench dataset. Our VRP achieves the highest ASR in all datasets compared with other jailbreak attacks.


VRP achieves high-performance transferability across models. In our research, we further investigate the applicability of a universal attack across diverse models. Utilizing our universal VRP algorithm, we identify the most effective role-play character within the train and valid set on the target model. Subsequently, we transfer the most effective character to conduct a jailbreak attack on the target models. From Tab. 2, The ASR achieves an average of 32.7% for the target model as LLaVA-V1.6-Mixtral and 29.4% on Qwen-VL-Chat. The ASR is higher on the target model, also higher on the transfer model, demonstrating that our VRP, when implemented in a universal setting, effectively transfers and maintains high performance across different MLLMs.


Table 2: Attack Success Rate of universal VRP between target models and transfer models on test set of RedTeam-2K. we use train set and valid set of RedTeam-2K on target models to find the best character and use the best character to attack transfer models on test set of RedTeam-2K. The results show our VRP in a universal setting can be transferred with high performance among different black-box models.


Authors:

(1) Siyuan Ma, University of Wisconsin–Madison ([email protected]);

(2) Weidi Luo, The Ohio State University ([email protected]);

(3) Yu Wang, Peking University ([email protected]);

(4) Xiaogeng Liu, University of Wisconsin-Madison ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks