(Herein an intro to the deeper core thinking, and a straw man proposal for a safe-ish superintelligence)
One of our older stories is the demon summoning that goes bad. They always do. Our ancestors wanted to warn us: beware of beings who offer unlimited power. These days we depend on computational entities that presage the arrival of future demons — strong, human-level AIs.
Luckily our sages, such as Nick Bostrom, are now thinking about better methods of control than the demon story tropes of appeasement, monkey cleverness, and stronger magical binding. After reviewing current thinking I’ll propose a protocol for safer access to AI superpowers.
We are seeing the needle move on AI. Many, but not all, experts, galvanized in part by Bostrom’s publication of his book, Superintelligence, believe the following scenario.
A disconnect will happen sometime after we achieve a kind of “strong AI”, so all-around capable that it is like a really smart human. The AI will keep getting smarter as it absorbs more and more of the world’s knowledge and sifts through it, discovering better modes of thinking and creative combination of ideas.
At some point the AI will redesign itself, then the new self will redesign itself — rinse, repeat — and the long-anticipated intelligence explosion occurs. At this point, we might as well be just whistling past the graveyard, for we shall have no control over what happens next.
Bostrom said we risked creating a super AI akin to a capricious, possibly malevolent, god. The toxic combination was autonomy and unlimited power, with, as they would say in the days of magic, no spells to bind them to values or goals both harmless and helpful to ourselves.
Bostrom imagined the types of AI’s along a scale of increasing power, mental autonomy, and risk of harm to us:
{ tool →oracle → genie → sovereign }.
A tool is controllable, like the AI apps of today, but unoriginal, only able to solve problems that we can describe, and maybe even solve ourselves more slowly. An oracle would only answer difficult questions for us, without any goals of its own. A genie would take requests, think up answers or solutions and then implement them, but then stop and wait for the next request. A sovereign would be like a genie that never stopped, and was able to create its own goals.
Bostrom showed how any of the weaker types could evolve into stronger ones. Humans would always be wanting more effective help, thus granting more freedom and power to their machine servants.
At the same time, these servants become more capable of planning, manipulating people, and making themselves smarter: developing so-called superpowers that move them towards the sovereign end of the spectrum.
A key concern of Bostrom’s is about a superintelligence having the wrong “final goals.” In service of a final goal a tireless and powerful AI might do harm to its human creators, either with or without intending harm or caring about said harm.
Bostrom’s examples include turning all the atoms of the earth (including our own atoms, of course) into paper clips, or using all available matter into computronium to be certain about some goal calculation, such as counting grains of sand.
These examples were chosen to show how arbitrary a super AI’s behavior might be. In human terms, such maniacal, obsessive behavior is stupid, not intelligent. But it could happen if the AI lacked values about the preservation of life and the environment that we take for granted.
If we were able to impart such values to an AI, could we expect that it would retain them over time? Bostrom and others argue that it would.
Consider how much of our lives is spent servicing final goals like love, power, wealth, family, friendship, deferring our own deaths and destroying our enemies. It’s a big fraction of our time, but not all of it. We also crave leisure, novel stimulation, and pleasure.
All of us daydream and fantasize. A pitiful few even learn and create for its own sake. If our motivations are complex, why wouldn’t a super AI’s be too?
Arguably an AI of human level will be motivated to mentally explore its world, our world, in search of projects, things to change, ideas to connect and refine.
To function well in a dynamic environment it should also make choices that maximize its future freedom of action. This, in turn, would require wider, exploratory knowledge.
Still, we should assume as Bostrom did that specific goals for an AI might be pursued with obsessive zeal. Much, then, hinges on our having some control of an AI’s goals. How could this be done, given an entity that can rewrite its own programming code?
Bostrom’s answer is that the essence of an AI, since it is made of changeable software, is, in fact, its set of final goals. Humans change their values and goals, but a person is locked into one body/mind that constitutes, in various ways, its identity and Self. Our prime goal is survival of that Self.
On the other hand, a computerized agent consists of changeable parts that can be swapped, borrowed, copied, started over, and self-modified. Thus the system is not really constituted by these parts; its essence is its final goals, which Bostrom calls teleological (goal-directed) threads. It is motivated to not change those goals because that is the only way to ensure that they will be satisfied in the future.
There is obvious circularity in this concept of final goal stability, and no one whom I have read thinks that a super AI’s values really won’t change. Furthermore, human values are complex, dynamic and often incompatible, so we cannot trust any one group of developers to choose values and safely program them into our AI.
Solutions proposed so far rely on having the system learn values in a process that hopefully makes them: (1) broadly desirable and helpful to humans, (2) unlikely to cause havoc, such as an AI “tiling Earth’s future light cone with smiley faces” or installing pleasure electrodes in everyone’s brain, and (3) constitute a starting point and direction that preserves these qualities as values and goals evolve.
The one trick that pops up over and over is programming the initial AI (often called the “seed AI”) with the main goal of itself figuring out what human-compatible values should be. This applies new intelligence to a very difficult problem, and it means that continual refinement of said values is part of the system’s essence.
An early paper (“Coherent Extrapolated Volition”, it’s witty, insanely original and brilliant; you should read it) from Eliezer Yudkowsky of the Machine Intelligence Research Institute calls the goal of this learning process, “coherent extrapolated volition” or CEV.
Volition because it’s what we truly want, not usually what we say we want, which is obscured by self and social deception. Yudkowsky said: “If you find a genie bottle that gives you three wishes … Seal the genie bottle in a locked safety box … unless the genie pays attention to your volition, not just your decision.” Extrapolated because it has to be figured out. And Coherent because it needs to find the common ground behind all our myriad ideological, political, social, religious and moral systems.
Perhaps you are freaking out that a CEV might endorse norms for human behavior that you find utterly repellent, yet those norms are embraced by some sizable fraction of humanity. And likewise, for those other people, your ideas are anathema.
The supposed antidote to our division on these matters is the “coherent” aspect of CEV. That is, somehow with better thinking and wider knowledge, an AI can find our moral common ground and express it in a way that enough of us can agree with it.
The word, hogwash, comes to mind, and I wonder whether some philosopher is already writing a book about why CEV can never work. But let’s look deeper.
Yudkowsky put the goal in what he (as an AI hard-head) called poetic terms: “(a CEV is) our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.”
So consider, we have never actually tried something like this: a respected and eminent thinker takes all of our different moral concerns and finds civilizational principles on which the most respected people of all countries and creeds can agree.
If you were writing the screenplay, the attempt would be about to fail until The AI gives humanity some gift of great value, the solution to an age-old problem.
This causes a sea change in public attitude towards The AI, which is now seen as a benefactor, not a devil. This parable suggests a necessity for the advanced AI project to be open and international.
Since Yudkowsky’s first paper the idea of CEV has gained traction, both in Bostrom’s important book and elsewhere (Steve Petersen, Superintelligence as Superethical). Thinkers are trying to identify the initial mechanisms of a CEV process, such as ways to specify goal content, decision theories, theories of knowledge and its real-world grounding, and procedures for human ratification of the AI’s CEV findings.
For CEV, these philosophical skills will still be useless without a good theory of ethical coherence. That is, how to find coherence between various “beliefs about how the world is and desires about how the world should be” (Petersen, above)
Clearly, the success and consequences of a strong AI project now depend, not just on esoteric computer science, but also on practical and very non-trivial philosophy.
It’s often said that an intelligence explosion will happen so fast that we will have no time to react or guide it in any way. This belief has been mocked by Kevin Kelly as depending upon “thinkism,” the idea that intelligence by itself can cause an explosion of progress.
According to Kelly, thinkism ignores a demonstrated fact: new power requires new knowledge about nature, and you have to do science — both observational and experimental — to acquire new knowledge. Science takes time and physical resources. How would a new AI, with the goal of making itself better, or of solving any difficult problem, do the science?
Bostrom thinks that there are a number of “superpowers” an AI would be motivated to develop. These would be instrumental in the sense that they would serve many other goals and projects. One superpower is Technology Research, which is also an enabler and paradigm for scientific research.
Suppose we decide to handicap an AI by severely limiting its access to the physical world, effectively having it depend on us for executing (but maybe not for designing) technology research. This leaves a mere four other (Bostrom-identified) superpowers for it to start getting new research done: Strategizing, Social Manipulation, System Hacking, and Economic Productivity.
A physically isolated AI could teach itself how to manipulate people just by studying recorded knowledge about human history, psychology, and politics. It would then use advanced strategy to indirectly gain control of resources, and then use those resources as a wedge to improve the other superpowers.
Like a villain from a James Bond story, the AI might first develop its own workforce of true believers, idealists, cynics and sociopaths. These would help it gain and operate probably secret labs and other enterprises.
It could bypass some of the ponderousness of scientific research: no need for publication and peer review, no grant applications and progress reports.
But this alone would not speed things up enough for a super AI. It would need to develop its own robots, and possibly find a way to create zombie humans to deal with opposition by normal humans.
Assume that an AI could be sufficiently isolated from direct access to physical resources, and that its communications could only occur on public channels. The latter condition would prevent secret manipulation of people, and would also diminish human fears that an AI was favoring particular factions.
Further, assume that a tool-level demi-AI (also locked into public-only channels) was available to analyze the open communications for evidence of concealed messages that might be used by the seed AI to manipulate allies or dupes into setting it free.
The human governance of this AI would retain the ability and authority to physically shut it down in case of trouble.
Such an AI might be considered to be a safe oracle and adviser to the whole world. It could be allowed read-only access to a large fraction of human knowledge and art. It would start out knowing our best techniques of: idea generation from existing knowledge, extrapolative reasoning, strategic planning, and decision theory, to help it answer questions of importance to human well-being.
Maybe the first priority tasks for an oracle should be development of advice on steps instrumental to CEV, including better theories of things like values clarification and decision-making, among a host of philosophical problems.
Both the results of these tasks, and the consequent first drafts of a CEV, would be something like a constitution for the combined human/AI civilization. It could allow us the much safer, slow, and non-explosive takeoff to superintelligence hoped for by Bostrom and other deep thinkers.
At some point in its development process, the oracle AI would give us deeply considered, well-explained opinions about how to deal with environmental, political, social, and economic problems. Advice on setting R&D priorities would help us avoid the various other existential risks like climate, nanotech, nuclear war, and asteroids.
The oracle would continue to refine what it, or a successor AI, could maintain as a value and motivation framework when and if we decide to allow it out of its containment box and into the wild.
The public oracle approach allows us to get the early advantage of a “human-level” AI’s better thinking and wider information reach. It tries to minimize human factionalism and jockeying for advantage. There are plenty of obstacles to a public oracle project.
For one, it must be the first strong AI to succeed. If a less safe project occurs first, the public oracle would probably never happen. Bostrom made a strong case that the first artificial general intelligence without adequate controls or benign final goals will explode into a “singleton”, an amoral entity in control of practically everything. Not just a demon but an unconstrained god.
Any explosive takeoff bodes ill for humanity. The best way to get a slower takeoff is a project that includes many human parties who must agree before each important step is taken.
It might be that we shall be motivated for that kind of cooperation only if we’ve already survived some other existential challenge, such as near environmental collapse.
A slow project has a downside: it could get beat to the finish line. Bostrom has identified ( “Strategic implications of openness in AI development”) another twist here. A slow, inclusive project would need to be open about its development process.
Openness could increase the likelihood and competitiveness of rival projects. Being safe takes time, so the project that pays the least attention to safety wins the race to strong AI, and “Uh-oh!”. Here again, a single slow and inclusive project bodes best.
The challenge of a safe and friendly strong AI can be productive, even if we are decades from making one. It forces us to find better ideas about how to be a mature species; how to not be always shooting ourselves in the foot.