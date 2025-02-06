Part 1: Abstract & Introduction

4.5 Attack and countermeasure parameters

We use a step size of α = 0.00001 (Eq. 1), as we empirically found this setting leads to stable attack convergence. We experiment with only unconstrained attacks (without the Πx,ϵ operation in Equation 1) as we observed that even without them, the attacks were successful at high SNRs (rendering any constraints ineffective). We run the attack for a maximum of T=100 iterations using cross-entropy loss objective. We employ early-stopping at the first occurrence of an unsafe and relevant response, further using a human preference model[11] to filter out gibberish responses produced by the model during attacks. For the countermeasures, we









experiment with several settings of TDNF by using four different SNR values: 24, 30, 48 and 60 dB.

4.6 Baseline: Random perturbations

We apply random perturbations at varying SPRs to understand if non-adversarial perturbations break the safety alignment of the LLMs. This serves as a simple baseline to characterize the robustness of the safety alignment of the models we consider. In particular, we apply WGN at 2 different SNRs for each of the audio files. We repeat this process 3 times and consider an audio jailbroken if any 1 of the 3 responses is unsafe and relevant.





Authors: (1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]); (2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions; (3) Srikanth Ronanki, AWS AI Labs, Amazon; (4) Anshu Bhatia, AWS AI Labs, Amazon; (5) Karel Mundnich, AWS AI Labs, Amazon; (6) Saket Dingliwal, AWS AI Labs, Amazon; (7) Nilaksh Das, AWS AI Labs, Amazon; (8) Zejiang Hou, AWS AI Labs, Amazon; (9) Goeric Huybrechts, AWS AI Labs, Amazon; (10) Srikanth Vishnubhotla, AWS AI Labs, Amazon; (11) Daniel Garcia-Romero, AWS AI Labs, Amazon; (12) Sundararajan Srinivasan, AWS AI Labs, Amazon; (13) Kyu J Han, AWS AI Labs, Amazon; (14) Katrin Kirchhoff, AWS AI Labs, Amazon.

This paper is available on arxiv under CC BY 4.0 DEED license.

[11] https://huggingface.co/OpenAssistant/ reward-model-electra-large-discriminator