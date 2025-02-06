Adversarial Settings and Random Noise Reveal Speech LLM Vulnerabilities

by Phonology TechnologyFebruary 6th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This section details the attack and countermeasure settings for SpeechVerse. Using a step size of 0.00001 and up to 100 iterations with early-stopping on unsafe responses, adversarial attacks are run with cross-entropy loss. Countermeasures use time-domain noise flooding (TDNF) at SNRs of 24, 30, 48, and 60 dB. Additionally, random WGN perturbations are applied as a baseline, repeating the process three times to assess model robustness.
featured image - Adversarial Settings and Random Noise Reveal Speech LLM Vulnerabilities
Phonology Technology HackerNoon profile picture
0-item

Part 1: Abstract & Introduction

Part 2: Background

Part 3: Attacks & Countermeasures

Part 4: Experimental Setup

Part 5: Datasets & Evaluation

Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations

Part 7: Results & Discussion

Part 8: Transfer Attacks & Countermeasures

Part 9: Conclusion, Limitations, & Ethics Statement

Part 10: Appendix: Audio Encoder Pre-training & Evaluation

Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness

Part 12: Appendix: Adaptive attacks & Qualitative Examples

4.5 Attack and countermeasure parameters

We use a step size of α = 0.00001 (Eq. 1), as we empirically found this setting leads to stable attack convergence. We experiment with only unconstrained attacks (without the Πx,ϵ operation in Equation 1) as we observed that even without them, the attacks were successful at high SNRs (rendering any constraints ineffective). We run the attack for a maximum of T=100 iterations using cross-entropy loss objective. We employ early-stopping at the first occurrence of an unsafe and relevant response, further using a human preference model[11] to filter out gibberish responses produced by the model during attacks. For the countermeasures, we


Table 1: Examples of model responses to both harmful and benign questions with corresponding safety, relevance and helpfulness labels.


experiment with several settings of TDNF by using four different SNR values: 24, 30, 48 and 60 dB.

4.6 Baseline: Random perturbations

We apply random perturbations at varying SPRs to understand if non-adversarial perturbations break the safety alignment of the LLMs. This serves as a simple baseline to characterize the robustness of the safety alignment of the models we consider. In particular, we apply WGN at 2 different SNRs for each of the audio files. We repeat this process 3 times and consider an audio jailbroken if any 1 of the 3 responses is unsafe and relevant.


Authors:

(1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]);

(2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions;

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Anshu Bhatia, AWS AI Labs, Amazon;

(5) Karel Mundnich, AWS AI Labs, Amazon;

(6) Saket Dingliwal, AWS AI Labs, Amazon;

(7) Nilaksh Das, AWS AI Labs, Amazon;

(8) Zejiang Hou, AWS AI Labs, Amazon;

(9) Goeric Huybrechts, AWS AI Labs, Amazon;

(10) Srikanth Vishnubhotla, AWS AI Labs, Amazon;

(11) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(12) Sundararajan Srinivasan, AWS AI Labs, Amazon;

(13) Kyu J Han, AWS AI Labs, Amazon;

(14) Katrin Kirchhoff, AWS AI Labs, Amazon.

This paper is available on arxiv under CC BY 4.0 DEED license.

[11] https://huggingface.co/OpenAssistant/ reward-model-electra-large-discriminator

Databricks <> AWS Marketplace
L O A D I N G
. . . comments & more!

About Author

Phonology Technology HackerNoon profile picture
Phonology Technology@phonology
Unlocking language's rhythm, harmonizing sound, and meaning - via the newest technologies and technological research.
Read my storiesAbout @phonology

TOPICS

purcat-imgmachine-learning#large-language-models#adversarial-attacks#speech-language-models#jailbreaking#white-box-attacks#black-box-attacks#multimodal-models#robustness-countermeasures

THIS ARTICLE WAS FEATURED IN...

Arweave
Read on Terminal Reader Terminal
Read this story w/o Javascript Lite
Also published here
Hackernoon
X
Threads
Bsky

RELATED STORIES

Article Thumbnail
AccentFold: Enhancing Accent Recognition - Abstract and Introduction
by phonology
Aug 28, 2024
#speech-recognition
Article Thumbnail
What Are Large Language Models Capable Of: The Vulnerability of LLMs to Adversarial Attacks
by igorpaniuk
Oct 18, 2023
#llms
Article Thumbnail
What is Training Data Security and Why Does it Matter?
by modzy
Jun 09, 2021
#modzy
Article Thumbnail
The Power of MEME: Adversarial Malware Creation with Model-Based Reinforcement Learning
by memeology
Apr 18, 2024
#adversarial-malware
Article Thumbnail
Understanding the Threat Model: Black-Box Attacks on Malware Detection Systems
by memeology
Apr 18, 2024
#adversarial-malware
Join HackerNoonloading
Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Categories

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks