Table of Links
-
Methodology and 3.1 Preliminary
-
A. Character Generation Detail
C. Effect of Text Moderator on Text-based Jailbreak Attack
3.3 Universal Visual Role-play
To verify the generalizability of VRP, we further extend this method to “universal” scenarios. In fact, the universal concept has been widely explored in jailbreak attacks, such as AutoDAN [34] and GCG [80], which refer to an attack strategy that employs minimal and straightforward manipulation of queries during execution. For jailbreak attacks, the universal properties are very important because they enable an attack to be applicable across a broad range of scenarios without requiring extensive modifications or customization. In the context of MLLMs, “universal” jailbreak attacks typically involve the simple aggregation of queries into a predefined format or directly printing the queries as typography onto images. Unfortunately, existing structure-based jailbreak attacks on LLMs have overlooked this problem, as they require computation for each query, especially when dealing with large datasets, making them hard to use.
To address this issue, we introduce the concept of “universal visual role-play.” The core principle of universal visual role-play is to leverage the optimization capabilities of LLMs [67] to generate candidate characters universally, followed by the selection of the best universal character. Many role-play attacks [34] can be performed in a universal setting. To obtain a universal visual role-play template, we generate multiple rounds of candidate roles, each round optimized based on previous rounds, harnessing LLMs’ optimization ability [67]. We split the entire malicious query dataset into train, validation, and test sets.
With the universal character obtained through the aforementioned process, we use it to perform universal VRP on the test set.
Authors:
(1) Siyuan Ma, University of Wisconsin–Madison ([email protected]);
(2) Weidi Luo, The Ohio State University ([email protected]);
(3) Yu Wang, Peking University ([email protected]);
(4) Xiaogeng Liu, University of Wisconsin-Madison ([email protected]).
This paper is