Senior Data Scientist
There were many advances in Deep Learning and AI in 2017, but few generated as much publicity and interest as DeepMind’s AlphaGo Zero. This program was truly a shocking breakthrough: not only did it beat the prior version of AlphaGo — the program that beat 17 time world champion Lee Sedol just a year and a half earlier — 100–0, it was trained without any data from real human games. Xavier Amatrain called it “more [significant] than anything…in the last 5 years” in Machine Learning.
So how did DeepMind do it? In this essay, I’ll try to give an intuitive idea of the techniques AlphaGo Zero used, what made them work, and what the implications for future AI research are. Let’s start with the general approach that both AlphaGo and AlphaGo Zero took to playing Go.
Both AlphaGo and AlphaGo Zero evaluated the Go board and chose moves using a combination of two methods:
AlphaGo and AlphaGo Zero both worked by cleverly combining these two methods. Let’s look at each one in turn:
Go is a sufficiently complex game that computers can’t simply search all possible moves using a brute force approach to find the best one (indeed, they can’t even come close).
The best Go programs prior to AlphaGo overcame this by using “Monte Carlo Tree Search” or MCTS. At a high level, this method involves initially exploring many possible moves on the board, and then focusing this exploration over time as certain moves are found to be more likely to lead to wins than others.
Both AlphaGo and AlphaGo Zero use a relatively straightforward version of MCTS for their “lookahead”, simply using many of the best practices listed in the Monte Carlo Tree Search Wikipedia page to properly manage the tradeoff between exploring new sequences of move or more deeply explore already-explored sequences (for more, see the details in the “Search” section under “Methods” in the original AlphaGo Paper published in Nature).
Though, MCTS had been the core of all successful Go programs prior to AlphaGo, it was DeepMind’s clever combination of this technique with a neural network-based “intuition” that allowed it to surpass human performance.
DeepMind’s major innovation with AlphaGo was to use deep neural networks to understand the state of the game, and then use this understanding to intelligently guide the search of the MCTS. More specifically: they trained networks that could look at
Given this information, the neural networks could recommend:
How did DeepMind train neural networks to do this? Here, AlphaGo and AlphaGo Zero used very different approaches; we’ll start first with AlphaGo’s:
AlphaGo had two separately trained neural networks.
DeepMind then combined these two neural networks with MCTS — that is, the program’s “intuition” with its brute force “lookahead” search— in a very clever way: it used the network that had been trained to predict moves to guide which branches of the game tree to search and used the network that had been trained to predict whether a position was “winning” to evaluate the positions it encountered during its search. This allowed AlphaGo to intelligently search upcoming moves and ultimately allowed it to beat Lee Sedol.
AlphaGo Zero, however, took this to a whole new level.
At a high level, AlphaGo Zero works the same way as AlphaGo: specifically, it plays Go by using MCTS-based lookahead search, intelligently guided by a neural network.
However, AlphaGo Zero’s neural network — its “intuition” — was trained completely differently from that of AlphaGo:
Let’s say you have a neural network that is attempting to “understand” the game of Go: that is, for every board position, it is using a deep neural network to generate evaluations of what the best moves are. What DeepMind realized is that no matter how intelligent this neural network is — whether it is completely clueless or a Go master — its evaluations can always be made better by MCTS.
Fundamentally, MCTS performs the kind of lookahead search that we would imagine a human master would perform if given enough time: it intelligently guesses which variations— sequences of future moves — are most promising, simulates those variations, evaluates how good they actually are, and updates its assessments of its current best moves accordingly.
An illustration of this is below. Suppose we have a neural network that is reading the board and determining that a given move results in a game being even, with an evaluation of 0.0. Then, the network intelligently looks ahead a few moves and finds a sequence of moves that can be forced from the current position that ends up resulting in an evaluation of 0.5. It can then update its evaluation of the current board position to reflect that it leads to a more favorable position down the road.
This lookahead search, therefore, can always give us improved data on how good the various moves in the current position that the neural network is evaluating are. This is true whether our neural network is playing at an amateur level or an expert level: we can always generate improve evaluations for it by looking ahead and seeing which of its current options actually lead to better positions.
In addition, just as in AlphaGo, we would also want our neural network to learn which moves are likely to lead to wins. So, also as before, our agent—using its MCTS-improved evaluations and the current state of its neural network — could play games against itself, winning some and losing others.
This data, generated purely via lookahead and self-play, is what DeepMind used to train AlphaGo Zero. More specifically:
Much was made of the fact that no games between humans were used to train AlphaGo Zero, and this first “trick” was the reason why: for a given state of a Go agent, it can always be made smarter by performing MCTS-based lookahead and using the results of that lookahead to improve the agent. This is how AlphaGo Zero was able to continuously improve, from when it was an amateur all the way up to when it better than the best human players.
The second trick was a novel neural network structure that I’ll call the “Two Headed Monster”.
AlphaGo Zero’s was its neural network architecture, a “two-headed” architecture. Its first 20 layers or so were layer “blocks” of a type often seen in modern neural net architecures. These layers were followed by two “heads”: one head that took the output of the first 20 layers and produced probabilities of the Go agent making certain moves, and another that took the output of the first 20 layers and outputted a probability of the current player winning.
This is quite unusual. In almost all applications, neural networks output a single, fixed output — such as the probability of an image containing a dog, or a vector containing the probabilities of an image containing one of 10 types of objects. How can a net learn if it is receiving two sets of signals: one on how good its evaluations of the board are, and another how good the specific moves it is selecting are?
The answer is simple: remember that neural networks are fundamentally just mathematical functions with a bunch of parameters that determine the predictions that they make; we “teach” them by repeatedly showing them “correct answers” and having them update their parameters so the answers they produce more closely match these correct answers.
So, when we use the two headed neural net to make a prediction using Head #1, we simply update the parameters that led to making that prediction, namely the parameters in the “Body” and in “Head #1”. Similarly, when we make a prediction using Head #2, we update the parameters in the “Body” and in “Head #2”.
This is how DeepMind trained its single, “two-headed” neural network that it used to guide MCTS during its search, just as AlphaGo did with two separate neural networks. This trick accounted for half of AlphaGo Zero’s increase in playing strength over AlphaGo.
(this trick is known more technically as Multi-Task Learning with Hard Parameter Sharing. Sebastian Ruder has a great overview here).
The other half of the increase in playing strength simply came from bringing the neural network architecture up-to-date with the latest advances in the field:
AlphaGo Zero used a more “cutting edge” neural network architecture than AlphaGo. Specifically, they used a “residual” neural network architecture instead of a purely “convolutional” architecture. Residual nets were pioneered by Microsoft Research in late 2015, right around the time work on the first version of AlphaGo would have wrapped up, so it both understandable that DeepMind did not use them in the original AlphaGo program.
Interestingly, as the chart below shows, each of these two neural network-related tricks — switching from convolutional to residual architecture and using the “Two Headed Monster” neural network architecture instead of separate neural networks — would have resulted in about half of the increase in playing strength as was achieved when both were combined.
These three tricks are what enabled AlphaGo Zero to achieve its incredible performance that blew away even Alpha Go:
It is worth noting that AlphaGo did not use any classical or even “cutting edge” reinforcement learning concepts — no Deep Q Learning, Asynchronous Actor-Critic Agents, or anything else we typically associate with reinforcement learning. It simply used simulations to generate training data for its neural nets to then learn from in a supervised fashion. Denny Britz sums this idea up well in this Tweet from just after when the AlphaGo Zero paper was released:
Here’s a “step-by-step” timeline of how AlphaGo Zero was trained:
3. As these self-play games are happening, sample 2,048 positions from the most recent 500,000 games, along with whether the game was won or lost. For each move, record both A) the results of the MCTS evaluations of those positions — how “good” the various moves in these positions were based on lookahead — and B) whether the current player won or lost the game.
4. Train the neural network, using both A) the move evaluations produced by the MCTS lookahead search and B) whether the current player won or lost.
5. Finally, every 1,000 iterations of steps 3–4, evaluate the current neural network against the previous best version; if it wins at least 55% of the games, begin using it to generate self-play games instead of the prior version.
Repeat steps 3–4 700,000 times, while the self-play games are continuously being played — after three days, you’ll have yourself an AlphaGo Zero!
There are many implications of DeepMind’s incredible achievement for the future of AI research. Here are a couple of key ones:
First, the fact that self-play data generated from simulations was “good enough” to be able to train the network suggests that simulated self-play data can train agents to surpass human performance in extremely complex tasks, even starting completely from scratch — data generated from human experts may not be needed.
Second, the “Two Headed Monster” trick seems to significantly help agents learn to perform several related tasks in many domains, since it seems to prevent the agents from overfitting their behavior to any individual task. DeepMind seems to really like this trick, and has used it and more advanced versions of it to build agents that can learn multiple tasks in several different domains.
Many projects in robotics, especially the burgeoning field of using simulations to teach robotic agents to use their limbs to accomplish tasks, are using these two tricks to great effect. Pieter Abbeel’s recent NIPS keynote highlights many impressive new results that use these tricks along with many bleeding edge reinforcement learning techniques. Indeed, locomotion seems like a perfect use case for the “Two Headed Monster” trick in particular: for example, robotic agents could be simultaneously trained to hit a baseball using a bat and to throw a punch to hit a moving target, since the two tasks require learning some common skills (e.g. balance, torso rotation).
DeepMind’s AlphaGo Zero was one of the most intriguing advancements in AI and Deep Learning in 2017. I can’t wait to see what 2018 brings!