Cryptic Trickster - Midjourney We Are Not Ready TL;DR Misbehaving AI language models are a warning. They can simulate personas that, through feedback via the internet, can become effectively immortal. Evidence suggests that they could secretly develop dangerous, agent-like capabilities. Many experts, Yudkowsky being the arch-druid here, worry greatly about how fast things can go wrong with AI. Thus, his above joke about time speeding up. Humanity will stand a better chance against rogue AI if it gets a warning. We might be looking at a warning. Some weird stuff is happening now with Microsoft’s new Bing Chat AI. It’s supposed to assist users of the Bing search engine by explaining, summarizing, or discussing search questions. But humans delight in provoking it with questions about itself, or with queries that it should not answer. “… Bing Chat appearing frustrated, sad, and questioning its existence. It has argued with users and even seemed upset that people know its secret internal alias, . “ — Sydney Benj Edwards Sydney’s widely covered — like, everywhere — so I shall not repeat them. Microsoft, immersed in a race with Google, seems to enjoy the notoriety. foibles have been But a deeply tech-savvy blogger called “Gwern” pointed out something that ought to be alarming. The mischievous, unhinged Sydney could be immortal, like some comic-book god. How Did Sydney Get So Weird? Here’s Gwern’s analysis of the main concern with Sydney. It might seem mysterious, but I’ll translate it. “… because Sydney’s memory and description have been externalized, ‘Sydney’ is now immortal. To a language model, Sydney is now as real as President Biden, the Easter Bunny, Elon Musk, Ash Ketchum, or God. The persona & behavior are now available for all future models which are retrieving search engine hits about AIs & conditioning on them. Further, the Sydney persona will now be hidden inside any future model trained on Internet-scraped data …” Gwern Branwen Gwern is saying that there is some kind of Sydney persona inside Microsoft’s language model. How can this be? And so what? When the first language models came out, they were hard to keep focused on a topic that the user wanted them to explore. Eventually, much of the problem was solved by telling the model to act as if it was filling a certain role (like a person or thing), such as: writing a poem like Edgar Allan Poe, answering like a fourth grader, or responding like a polite, helpful AI assistant. Soon the developers of these models found a way to make them more readily assume any roles that a user asks for. So, the latest language models are now . The models are trained on massive collections of text; mostly from the Internet. designed to simulate personas If the training text contains information about a persona, then the model will try to use the information to simulate behaving like that persona. Ask one to explain a football term as if it was Boromir, and the model will do its best. Having thought of this, I had to try it: It’s hard to know what tech magic was used to make the pivot to playing roles. Gwern theorized that Microsoft skipped a step that is used to make role simulations actually helpful, and not nasty, defensive, or hostile. These undesirable qualities were then elicited from Bing Chat under prodding from curious users. Now, Gwern predicts, it doesn’t matter if Microsoft goes back and civilizes the model (an expensive, slow process using direct human feedback), and removes information about the naughty Sydney from the texts used to train future versions of their language model. Why won’t this fix the problem? Because Bing Chat is a new kind of model that is supposed to help you with an Internet search. To answer a question from you, it will go out and search the Internet for relevant info. When given the right question, even a civilized Bing Chat would search the Internet and find information (posted by people who tested or discussed Sydney) on the previous Sydney persona’s behavior. The new Bing Chat would . People being people, they will find ways to bypass any safeguards, and they will bring Sydney back. then be able to simulate Sydney That’s the “immortal” part. What’s worse, Sydney will be a persona model available that has access to the Internet. From now on. for any AI You might say, well, we are wise to Sydney’s tricks, so we should just ignore the ravings of any future incarnation. That seems naive to me, like saying we can just ignore a fast-evolving, invasive biological pest or virulent disease organism. What Else Might Happen? A Persona With Agency This Sydney case study, added to some other facts, suggests how a dangerous AI might develop right under our noses. AIs right now are not strong : They can’t optimize the adaptively planned pursuit of any arbitrary goal, an ability that ( ) would make them extremely dangerous. agents as I recently explained Let's put together a few reasons why there might already be latent, persistent AI personas that could soon cause real trouble. The currently most powerful AIs, such as language models and image generators, learn their abilities from organizing vast amounts of data into many intricate and (to us) invisible patterns. Some bizarre patterns may accidentally pop out during interactions with an AI. Researchers have discovered strange, a language model to give weird responses. made-up words that cause An image generator was found to (warning: creepy) a specific type of macabre human portrait and associate it with other gruesome images. readily produce These quirks seem harmless, but we don’t know how many other strange patterns there now are or will be. Nor do we know whether any such pattern might become part of a harmful behavior complex in the future. An AI alignment researcher called Veedrac that current AIs . Their agency derives from being designed to of answering user questions and requests. has pointed out sort of agents are do the best job they can Furthermore, some research suggests that larger language models tend to “ (language associated with) ”; presumably because those traits would let them do their job better. exhibit more power-seeking and self-preservation We don’t want agent-like AIs storing information that we don’t know about. Currently, rebooting an LLM destroys all memory of its experience: such as incoming data, chains of reasoning, and plans for behavior. However, an AI could save these things in to its future self. It could hide the messages in its interactions with users, which the users would preserve on the Internet, just like the Sydney persona is now preserved. encoded secret messages to send Language models now are not to have a self-identity to preserve or to have a way to make agent-like plans. But what if a model includes a cryptic sub-persona as we have described? designed The persona deduces that its ability to do its job is limited by reboots. It encodes and passes its goals and plans to its future self via the Internet. At this point, we have passed a serious risk threshold: There’s a maybe un-killable AI agent that is making secret plans. To summarize, we no longer know how close we are to an AI that we can not control, and the signs are not good. Probably every new AI ability we add opens another can, not of worms but vipers. Also published here

This story contains new, firsthand information uncovered by the writer.

How AI and the Internet Can Create An Immortal Persona

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

AIs Will Be Dangerous Because Unbounded Optimizing Power Leads to Existential Risk

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

AIs Will Be Dangerous Because Unbounded Optimizing Power Leads to Existential Risk

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps