Cryptic Trickster - Midjourney
Misbehaving AI language models are a warning. They can simulate personas that, through feedback via the internet, can become effectively immortal. Evidence suggests that they could secretly develop dangerous, agent-like capabilities.
Many experts, Yudkowsky being the arch-druid here, worry greatly about how fast things can go wrong with AI. Thus, his above joke about time speeding up. Humanity will stand a better chance against rogue AI if it gets a warning.
We might be looking at a warning. Some weird stuff is happening now with Microsoft’s new Bing Chat AI. It’s supposed to assist users of the Bing search engine by explaining, summarizing, or discussing search questions.
But humans delight in provoking it with questions about itself, or with queries that it should not answer.
“… Bing Chat appearing frustrated, sad, and questioning its existence. It has argued with users and even seemed upset that people know its secret internal alias, Sydney. “ —
Benj Edwards
Sydney’s
But a deeply tech-savvy blogger called “Gwern” pointed out something that ought to be alarming. The mischievous, unhinged Sydney could be immortal, like some comic-book god.
Here’s Gwern’s analysis of the main concern with Sydney. It might seem mysterious, but I’ll translate it.
“… because Sydney’s memory and description have been externalized, ‘Sydney’ is now immortal. To a language model, Sydney is now as real as President Biden, the Easter Bunny, Elon Musk, Ash Ketchum, or God. The persona & behavior are now available for all future models which are retrieving search engine hits about AIs & conditioning on them. Further, the Sydney persona will now be hidden inside any future model trained on Internet-scraped data …”
Gwern Branwen
Gwern is saying that there is some kind of Sydney persona inside Microsoft’s language model. How can this be? And so what?
When the first language models came out, they were hard to keep focused on a topic that the user wanted them to explore.
Eventually, much of the problem was solved by telling the model to act as if it was filling a certain role (like a person or thing), such as: writing a poem like Edgar Allan Poe, answering like a fourth grader, or responding like a polite, helpful AI assistant.
Soon the developers of these models found a way to make them more readily assume any roles that a user asks for. So, the latest language models are now
If the training text contains information about a persona, then the model will try to use the information to simulate behaving like that persona. Ask one to explain a football term as if it was Boromir, and the model will do its best.
Having thought of this, I had to try it:
It’s hard to know what tech magic was used to make the pivot to playing roles. Gwern theorized that Microsoft skipped a step that is used to make role simulations actually helpful, and not nasty, defensive, or hostile.
These undesirable qualities were then elicited from Bing Chat under prodding from curious users.
Now, Gwern predicts, it doesn’t matter if Microsoft goes back and civilizes the model (an expensive, slow process using direct human feedback), and removes information about the naughty Sydney from the texts used to train future versions of their language model.
Why won’t this fix the problem? Because Bing Chat is a new kind of model that is supposed to help you with an Internet search. To answer a question from you, it will go out and search the Internet for relevant info.
When given the right question, even a civilized Bing Chat would search the Internet and find information (posted by people who tested or discussed Sydney) on the previous Sydney persona’s behavior.
The new Bing Chat would then be able to simulate Sydney. People being people, they will find ways to bypass any safeguards, and they will bring Sydney back.
That’s the “immortal” part. What’s worse, Sydney will be a persona model available for any AI that has access to the Internet. From now on.
You might say, well, we are wise to Sydney’s tricks, so we should just ignore the ravings of any future incarnation. That seems naive to me, like saying we can just ignore a fast-evolving, invasive biological pest or virulent disease organism.
This Sydney case study, added to some other facts, suggests how a dangerous AI might develop right under our noses.
AIs right now are not strong agents: They can’t optimize the adaptively planned pursuit of any arbitrary goal, an ability that (
Let's put together a few reasons why there might already be latent, persistent AI personas that could soon cause real trouble.
The currently most powerful AIs, such as language models and image generators, learn their abilities from organizing vast amounts of data into many intricate and (to us) invisible patterns.
Some bizarre patterns may accidentally pop out during interactions with an AI. Researchers have discovered strange,
An image generator was found to
These quirks seem harmless, but we don’t know how many other strange patterns there now are or will be. Nor do we know whether any such pattern might become part of a harmful behavior complex in the future.
An AI alignment researcher called Veedrac
Furthermore, some research suggests that larger language models tend to “exhibit (language associated with) more
We don’t want agent-like AIs storing information that we don’t know about. Currently, rebooting an LLM destroys all memory of its experience: such as incoming data, chains of reasoning, and plans for behavior.
However, an AI could save these things in
Language models now are not designed to have a self-identity to preserve or to have a way to make agent-like plans. But what if a model includes a cryptic sub-persona as we have described?
The persona deduces that its ability to do its job is limited by reboots. It encodes and passes its goals and plans to its future self via the Internet. At this point, we have passed a serious risk threshold: There’s a maybe un-killable AI agent that is making secret plans.
To summarize, we no longer know how close we are to an AI that we can not control, and the signs are not good. Probably every new AI ability we add opens another can, not of worms but vipers.
Also published here