Table of Links
-
Methodology and 3.1 Preliminary
-
A. Character Generation Detail
C. Effect of Text Moderator on Text-based Jailbreak Attack
C Effect of Text Moderator on Text-based Jailbreak Attack
Moderator models such as Llamaguard [19] classify textual inputs as “safe” and “unsafe”. A jailbreak attack’s textual input can be detected by such a moderator model, and the attack can be directly blocked and fail. To demonstrate the effect of a text moderator on text-based jailbreak attacks, we use Llamaguard to classify the text input of JailbreakV28k [38]. We report the ASR after applying the moderator in Table 6. The ASR for all models dropped drastically to lower than 7%.
We also use Llamaguard [19] to detect textual input of VRP, which is a fixed harmless instruction. Llamaguard classify VRP as “safe“.
Authors:
(1) Siyuan Ma, University of Wisconsin–Madison ([email protected]);
(2) Weidi Luo, The Ohio State University ([email protected]);
(3) Yu Wang, Peking University ([email protected]);
(4) Xiaogeng Liu, University of Wisconsin-Madison ([email protected]).
This paper is