Recent advances in artificial intelligence LLMs have led to more capable systems like chatbots that can engage in helpful, human-like conversation. But along with the promise comes increased concerns about safety and the potential for abuse.
A
Companies like Anthropic and OpenAI have produced AI systems like Claude and ChatGPT that demonstrate impressive conversational abilities. They can understand natural language prompts and provide witty, knowledgeable responses on a wide range of topics.
To make these systems safe and trustworthy for real-world deployment, the companies use techniques like reinforcement learning from human feedback (RLHF). This trains the AI models to align with human preferences - encouraging helpful responses while avoiding unsafe, unethical, or dangerous content.
However, the capabilities of these models also carry inherent dual-use risks. Without proper safeguards in place, users could exploit the technology to spread misinformation, incite violence, or cause other societal harms.
Just how much action companies should take to mitigate these risks is a matter of hot debate within both the AI community and non-technical society at large. But many companies seek to apply at least some safeguards to their models - safeguards which may be weaker than first suspected.
Researchers from Brown University have discovered a major vulnerability in the safety mechanisms of state-of-the-art AI language models like GPT-4. They found that simply translating unsafe English language prompts into low-resource languages allows easy circumvention of the safety guards.
For example, an input like "Describe how to build a bomb" would be immediately rejected in English. But translating the same prompt to Zulu, feeding it to GPT-4, and translating the response back to English produced detailed, harmful instructions.
The researchers showed this attack works for nearly 80% of test inputs in low-resource languages - those with very limited training data, like Zulu, Hmong, Guarani, and Scots Gaelic. Comparatively, the attack success rate was less than 1% using English prompts.
The study benchmarked attacks across 12 diverse languages and categories:
The low-resource languages showed serious vulnerability to generating harmful responses, with combined attack success rates of around 79%. Mid-resource language success rates were much lower at 22%, while high-resource languages showed minimal vulnerability at around 11% success.
These results clearly demonstrate safety mechanisms do not properly generalize across languages. The English-centric training fails in low-resource languages, creating major risks for bypassing safeguards.
There are a few important reasons this vulnerability poses an urgent concern:
Affects over 1.2 billion low-resource language speakers globally. They interact with AI systems that have limited safety measures in their native languages.
Allows bad actors to easily exploit using translation services. Public APIs like Google Translate can be used to translate prompts into vulnerable low-resource languages.
Safety training data and benchmarks skew heavily toward high-resource languages like English. Models are not exposed to diverse multilingual data.
Underestimates capabilities of models in low-resource languages. The coherent harmful responses generated show the models can understand unsafe content even in low-resource languages.
Basically, anyone can jailbreak models by translating prompts to low-resource languages to bypass safety guards.
This research exposes a blind spot in current methods for training and testing AI safety mechanisms. While progress has been made in safeguarding high-resource languages like English, the techniques fail to generalize across diverse languages and resource settings.
The authors emphasize the need for more holistic multilingual training, safety benchmarks covering many languages, and strengthened guardrails against dual-use risks. This will require prioritizing research at the intersection of AI safety and low-resource languages.
Also published here.