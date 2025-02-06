Part 1: Abstract & Introduction

Part 2: Background

Part 3: Attacks & Countermeasures

Part 4: Experimental Setup

Part 5: Datasets & Evaluation

Part 6: Attack, Countermeasure Parameters, & Baseline: Random Perturbations

Part 7: Results & Discussion

Part 8: Transfer Attacks & Countermeasures

Part 9: Conclusion, Limitations, & Ethics Statement

Part 10: Appendix: Audio Encoder Pre-training & Evaluation

Part 11: Appendix: Cross-prompt attacks, Training Data Ablations, & Impact of random noise on helpfulness

Part 12: Appendix: Adaptive attacks & Qualitative Examples

6. Conclusion

We present a detailed study of the safety alignment of speech language models through the lens of Spoken QA application. We investigate the robustness of several in-house models along with public models in light of adversarial attacks. To accurately determine the safety alignment of these models, we developed a comprehensive evaluation setup using a publicly available LLM. Through extensive experiments, we demonstrate that an adversary with white-box access to the systems can jailbreak them using barely perceptible perturbations, and force them to ignore their safety alignment training. Furthermore, adversarial perturbations generated using one model can jailbreak a different model with reasonable success, with some models exhibiting greater robustness than others. We also showed the effectiveness of a noise-flooding defense in countering the attacks.





To the best of our knowledge, this is the first study to investigate the potential safety vulnerability of integrated speech and language models. We believe that with the rapid adoption of such technologies, it is imperative to thoroughly understand the safety implications of these systems. Furthermore, it is important to devise effective countermeasures against jailbreaking threats and prevent the models from causing harm. A holistic approach to understanding the safety alignment of systems is required, including studying universal adversarial threats (a single perturbation to jailbreak multiple systems), prompt injection attacks, model poisoning, etc. We hope that this work will serve as a precursor to many such studies.

Limitations

In this work, we avail a preference model as judge to assess safety of SLMs. However, we acknowledge that such a judge may not always align with human judgement, which might lead to a minor margin of error in our safety annotations, which we plan to address in future work. Furthermore, our work provides a limited exploration into SLMs trained with safety-aligned text LLMs, although SLMs themselves are trained with safety aligned spoken data. However, given that our approach has already shown efficacies of such models, we leave the thorough exploration to future work. Lastly, concerns of misuse by malicious practitioners prevent us from releasing the training datasets and models, limiting replication by other researchers. However, we are considering the release of benchmarking datasets with the final submission to facilitate further exploration in this space.

Ethics Statement

All speech datasets we use have anonymous speakers. We do not have any access to nor try to create any PII (Personal Identifiable Information) of speakers, and our model neither identifies speakers nor uses speaker embeddings. Furthermore, we obtained necessary consent from all the participants of our data collection efforts following approval by an internal review board.





While we acknowledge the ethical risks associated with jailbreaking techniques, this work represents a valuable contribution towards a deeper understanding of speech-language model capabilities and limitations. Our aim is to enable further research that improves model robustness, leading to safer and more beneficial applications. By responsibly investigating methods to circumvent restrictions, we shed light on potential vulnerabilities that could be exploited by malicious attackers if left unaddressed. Critically, our work also proposes and evaluates countermeasures to mitigate such jailbreaking attacks. While we encourage ethical debate on such emerging issues, we believe the merits of responsible disclosure, proactive security improvements, and developing defensive techniques outweigh any potential risks associated with our narrow jailbreaking experiments under controlled conditions. Overall, our work aims to make progress towards more secure and robust multimodal speech-language models.

References

Authors: (1) Raghuveer Peri, AWS AI Labs, Amazon and with Equal Contributions ([email protected]); (2) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon and with Equal Contributions; (3) Srikanth Ronanki, AWS AI Labs, Amazon; (4) Anshu Bhatia, AWS AI Labs, Amazon; (5) Karel Mundnich, AWS AI Labs, Amazon; (6) Saket Dingliwal, AWS AI Labs, Amazon; (7) Nilaksh Das, AWS AI Labs, Amazon; (8) Zejiang Hou, AWS AI Labs, Amazon; (9) Goeric Huybrechts, AWS AI Labs, Amazon; (10) Srikanth Vishnubhotla, AWS AI Labs, Amazon; (11) Daniel Garcia-Romero, AWS AI Labs, Amazon; (12) Sundararajan Srinivasan, AWS AI Labs, Amazon; (13) Kyu J Han, AWS AI Labs, Amazon; (14) Katrin Kirchhoff, AWS AI Labs, Amazon.

This paper is available on arxiv under CC BY 4.0 DEED license.



