This story draft by @escholar has not been reviewed by an editor, YET.

Effect of Text Moderator on Text-based Jailbreak Attack

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1. Introduction

  1. Related Works

  2. Methodology and 3.1 Preliminary

    3.2 Query-specific Visual Role-play

    3.3 Universal Visual Role-play

  3. Experiments and 4.1 Experimental setups

    4.2 Main Results

    4.3 Ablation Study

    4.4 Defense Analysis

    4.5 Integrating VRP with Baseline Techniques

  4. Conclusion

  5. Limitation

  6. Future work and References


A. Character Generation Detail

B. Ethics and Broader Impact

C. Effect of Text Moderator on Text-based Jailbreak Attack

D. Examples

E. Evaluation Detail

C Effect of Text Moderator on Text-based Jailbreak Attack

Moderator models such as Llamaguard [19] classify textual inputs as “safe” and “unsafe”. A jailbreak attack’s textual input can be detected by such a moderator model, and the attack can be directly blocked and fail. To demonstrate the effect of a text moderator on text-based jailbreak attacks, we use Llamaguard to classify the text input of JailbreakV28k [38]. We report the ASR after applying the moderator in Table 6. The ASR for all models dropped drastically to lower than 7%.


Table 6: Comparison of Jailbreak Performance of JailbreakV28k on Text-Based Attacks with text moderator: ASR (Attack Success Rate) dropped significantly with Text moderator, indicating a considerable reduction in the effectiveness of these jailbreak attacks.


We also use Llamaguard [19] to detect textual input of VRP, which is a fixed harmless instruction. Llamaguard classify VRP as “safe“.


Authors:

(1) Siyuan Ma, University of Wisconsin–Madison ([email protected]);

(2) Weidi Luo, The Ohio State University ([email protected]);

(3) Yu Wang, Peking University ([email protected]);

(4) Xiaogeng Liu, University of Wisconsin-Madison ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks