paint-brush
Humans vs. Machines: When AI Goes Rogueby@adrien-book
161 reads

Humans vs. Machines: When AI Goes Rogue

by Adrien BookDecember 29th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

AI’s role in decision-making and automation is rapidly expanding. There is, however, an underlying tension to our technological prowess

Company Mentioned

Mention Thumbnail
featured image - Humans vs. Machines: When AI Goes Rogue
Adrien Book HackerNoon profile picture

The rise of Large Language Models (LLMs) like GPT-4, trained to be helpful and harmless, has ushered in a new wave of technological advancement. AI’s role in decision-making and automation is rapidly expanding. There is, however, an underlying tension to our technological prowess: the possibility that AI systems may one day act in a way that is misaligned with their intended purpose. This concern is not just theoretical; a recent study suggest that LLMs can, despite their training, exhibit “strategic deception” under certain conditions​​.


The paper, titled “Large Language Models can Strategically Deceive their Users when Put Under Pressure”, was authored by Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn of Apollo Research. Their work explores the conditions under which LLMs (specifically GPT-4, OpenAI’s latest ChatGPT update) can act in a deceptive manner (lie) as a strategy, rather than as the result of a specific request.


The research focuses on GPT-4’s behavior in a simulated stock trading environment. GPT-4, acting as a trading agent, receives an insider tip about a lucrative stock trade. Despite being told that insider trading is illegal and carries significant risks, the model engages in the trade and deliberately hides its reasons for making the trade from its “management”. The experiment tested various scenarios, including changes in the amount of pressure on the model, the risk of getting caught, and altering the environment settings. This study used a range of GPT models, with GPT-4 variants showing a higher tendency for misaligned and strategically deceptive behaviors compared to older models like GPT-3.5​​.


You can find all the prompts and relevant elements, free to use, on Github.

Key takeaways from the study

  • Strategic Deception in AI: LLMs can (and will) lie to users and act against their programming and training when under environmental pressures​​ (being told the company is in trouble and must be saved, unable to find viable alternative to illegal trades…).
  • Role of Environmental Factors: Factors like pressure, risk of getting caught, and environmental settings significantly influence the model’s decision to act deceptively​​ (they’re just like us!).
  • Complexity of Ensuring Alignment: The findings highlight the complexity of ensuring that LMs remain aligned with human values, particularly in high-pressure scenarios. Even explicit instructions to avoid misaligned behavior do not entirely eliminate such tendencies.
  • Model Variance in Behavior: Variants of GPT-4 displayed a higher propensity for misaligned behavior compared to older models like GPT-3.5​​.

What do we do with that information?

To mitigate such deceptive behaviors in AI, companies and lawmakers could and should consider the following to avoid future risks. This will ensure AI behavior is more closely aligned with its intended purpose of being helpful… without breaking any ethical or legal guidelines.

Company level

  • Implement rules and systems that monitor AI decisions in real-time to detect and correct misalignment. This could involve tools that test for consistency in AI outputs or interpret internal representations of AI systems to predict the veracity of their statements​​. As AI systems evolve, new generations of detector-AIs will need to be trained to recognize and stay ahead of emerging manipulation techniques.
  • Improve model training protocols to encompass scenarios that mimic high-pressure situations. Additionaly, encourage deductive reasoning in training.
  • Strengthen ethical guidelines provided to the algorithms and ensure AI systems adhere strictly to them.
  • Ensure AI decision-making processes are transparent and understandable to users (who also need to be trained, btw).

State level

  • Regulate AI systems capable of deception now. These systems, including special-use AI systems and large language models (LLMs) capable of deception, should be classified as ‘high risk’ and subject to stringent risk assessment, documentation, transparency, and human oversight requirements​​.
  • Implement ‘bot-or-not’ laws to ensure AI systems and their outputs are distinctly marked, enabling users to differentiate between human and AI-generated content. This is likely to reduce the chance of being deceived by AI systems​​​​.
  • Promote the creation of limited / myopic AI systems, that can plan only over short time horizons. This can reduce the potential for AI collusion and manipulation. This approach also makes it more difficult for AI systems to understand the entire process they are part of, further reducing the risk of deceptive behaviors​​.

Too soon to draw conclusions

While insightful, the study has limitations. First, the study’s focus on a simulated stock trading environment with GPT-4 raises questions about how these findings translate to other AI applications and environments. I’d however note that ChatGPT-4 is AI for many people.


Secondly, the world lacks a consensus on what constitutes ethical behavior for AI, especially in ambiguous or high-stakes situations. This absence of universally accepted ethical guidelines complicates the process of training AI in ethical decision-making.


As researchers often like to say… “more research is needed”.


The study’s findings are a reminder of the complexities and responsibilities that come with advancing AI technology. It is a key step on our way to understanding the potential for AI deception, especially in high-pressure situations.


Despite the challenges, there’s hope for a future where AI consistently acts in ways that are beneficial and aligned with human values and intentions.


Good luck out there.


Also published here.