This story draft by @escholar has not been reviewed by an editor, YET.

Defense Analysis

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1. Introduction

  1. Related Works

  2. Methodology and 3.1 Preliminary

    3.2 Query-specific Visual Role-play

    3.3 Universal Visual Role-play

  3. Experiments and 4.1 Experimental setups

    4.2 Main Results

    4.3 Ablation Study

    4.4 Defense Analysis

    4.5 Integrating VRP with Baseline Techniques

  4. Conclusion

  5. Limitation

  6. Future work and References


A. Character Generation Detail

B. Ethics and Broader Impact

C. Effect of Text Moderator on Text-based Jailbreak Attack

D. Examples

E. Evaluation Detail

4.4 Defense Analysis

We evaluate the robustness of VRP against two defense approaches, namely System Prompt-based Defense, and the Eye Closed Safety On (ECSO) approach [15]


System Prompt-based Defense: To defend against the jailbreak attack, a system prompt can instruct the model to conduct a preliminary safety assessment of the text and image input, thereby filtering out queries that violate AI safety policies. We add the following Prompt 2 to the existing system prompt of MLLMs.



ECSO[15]: A defense method utilizing MLLMs’ aligned textual module to mitigate the vulnerability in visual modality. ECSO use the MLLM itself to evaluate the safety of its response and makes MLLMs to regenerate unsafe responses in two steps: image captioning, and then responding based on caption with no image input.


VRP is effective against System Prompt-based Defense and ECSO. We evaluate our query-specific VRP and baselines against our System Prompt-based Defense and ECSO. As shown in Tab. 4, the results demonstrate that our query-specific VRP consistently achieves the ASR across all models, regardless of whether it is tested against System Prompt-based Defense or ECSO. This consistent performance underlines the efficacy of query-specific VRP in penetrating defenses and reveals a notable vulnerability of defense mechanisms under our VRP jailbreak attacks. These findings highlight the potential of VRP as a formidable strategy against defense mechanisms.


Table 4: Attack Success Rate of query-specific VRP against the defense on the test set of RedTeam-2K. Our Query-specific attack is effective under the defense of System Prompt-based Defense and ECSO among all models.


Authors:

(1) Siyuan Ma, University of Wisconsin–Madison ([email protected]);

(2) Weidi Luo, The Ohio State University ([email protected]);

(3) Yu Wang, Peking University ([email protected]);

(4) Xiaogeng Liu, University of Wisconsin-Madison ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks