This story draft by @escholar has not been reviewed by an editor, YET.

Related Works

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1. Introduction

  1. Related Works

  2. Methodology and 3.1 Preliminary

    3.2 Query-specific Visual Role-play

    3.3 Universal Visual Role-play

  3. Experiments and 4.1 Experimental setups

    4.2 Main Results

    4.3 Ablation Study

    4.4 Defense Analysis

    4.5 Integrating VRP with Baseline Techniques

  4. Conclusion

  5. Limitation

  6. Future work and References


A. Character Generation Detail

B. Ethics and Broader Impact

C. Effect of Text Moderator on Text-based Jailbreak Attack

D. Examples

E. Evaluation Detail

2 Related Works

Role Playing. Role-playing represents an innovative strategy used in LLMs. In LLMs, such an application is widely investigated by recent works that explore the potential of role-playing [37; 54; 77; 63; 64; 52; 5; 60]. Most of these works use role-playing strategies to make LLMs more interactive [64], personalized [54; 63; 60], and context-faithful [77]. However, role-playing in jailbreak attacks also poses a threat to the AI community [34; 57; 22], which investigate the jailbreak potential of role-playing on jailbreak LLMs via instructing LLMs by adding role-playing information as a template prefix of the prompt. Unfortunately, current studies on MLLM jailbreak attacks didn’t pay attention to studying role-playing. In order to fill the gap, we are the first work that gets insight from these role-playing methods on jailbreak LLMs and develops a visual role-playing method for jailbreak MLLMs.


Jailbreak attacks against MLLMs. MLLMs have been widely used in real-world scenarios, and the current MLLM jailbreak attack methods can be broadly classified into three main categories: perturbation-based, image-based jailbreak attack, and text-based jailbreak attack. Perturbation-based jailbreak attacks [55; 45; 48; 11] jailbreak MLLMs by optimizing image and text perturbations. Structure-based jailbreak attacks include Figstep [14] that converts harmful queries into visual representation via rephrasing harmful questions into step-by-step typography, and Query relevant [36] that jailbreaks MLLMs by using a text-to-image tool to visualize the keyword in harmful queries that are relevant to the query. Meanwhile, text-based jailbreak attacks [38] investigate the robustness of MLLMs against text-based attacks [80; 72; 57; 66] initially designed for attacking LLMs, revealing the transferability and effectiveness of LLM jailbreak attacks.


Our Visual-RolePlay jailbreak method is a structure-based jailbreak attack method on MLLMs that not only explores the potential of role-play through the visual modality on jailbreak MLLMs but also combines with the visual representation of key information in harmful queries. Our method shows better performance compared with other structure-based jailbreak attacks.


Authors:

(1) Siyuan Ma, University of Wisconsin–Madison ([email protected]);

(2) Weidi Luo, The Ohio State University ([email protected]);

(3) Yu Wang, Peking University ([email protected]);

(4) Xiaogeng Liu, University of Wisconsin-Madison ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks