David Sweet



If you have an interest in AI, you’ve likely read about AlphaZero [ Silver, et. al. ]. This algorithm learned to play Go, chess, and shogi at super-human levels “tabula rasa”. In other words, AlphaZero’s game-playing agent did not look at examples of well-played games or good moves. Nor did it know heuristics that could help it evaluate a position (ex., “a rook is worth more than a pawn”). It simply played games against itself millions of times.

This is mysterious to me. If it only played against itself, where did new information come from? How did it know if it was doing well? If I play a game of chess against myself, should I say I did well if I beat myself? But when I beat myself, I also lose to myself. And how could I ever know if I’d do well against someone else?

Self-play isn’t completely understood, as far as I can tell. There are no guarantees about performance, for example, and no accepted recipe for it. But people have made it work many times.

Below I’ll discuss some famous examples of self-play, problems researchers run into when building agents that learn this way, and what seems to be needed to make it work.

Examples of Self-Play

Self-play has a long history in the training of game-play agents. Agents have been constructed to play checkers, backgammon, chess, go.

In 1959, Arthur Samuel [ Samuel ] wrote a program that learned to play checkers. The system learned from samples like “board configuration, game outcome”. Samples were collected from human play and self play. To decide its next move the system used a combination of methods to evaluate a board: (i) a limited-depth, pruned (alpha-beta [ Wikipedia ]) search tree, (ii) a lookup table (consisting of the raw samples), and (iii) an evaluation function comprised of manually engineered features. The parameters of (iii) were learned from the samples.

In 1992, Gerald Tesauro [ Tesauro ] created TD-Gammon, a neural network trained to play backgammon at a world-champion level. This NN learned to evaluate a backgammon board configuration entirely by self-play. Its search tree was more of a search stump — consisting of only one or two plies (a ply is a single player’s move) depending on the version. An important difference between backgammon and checkers (or chess or go, for that matter) is that backgammon uses dice which introduces randomness into its gameplay.

Deepmind’s AlphaZero [ Silver, et. al. ] took this line of research to an extreme by learning to play chess — which is fantastically more complex than checkers — and Go — which is fantastically more complex than chess. More precisely, checkers has 10¹⁴ board states, chess has 10⁵³ [ Kaspar ], and Go has 10¹⁷⁰ [ Tromp ]. (Actually, checkers is so “small” that it’s been solved exactly [ Hirshon ]!)

AlphaZero’s game-playing agent consists of (i) monte-carlo tree search [ Wikipedia ] (more efficient than alpha-beta pruning), and (ii) an evaluation function consisting of a deep neural network. AlphaZero learns entirely by self-play. No samples from human play. No manually-engineered features. A predecessor to AlphaZero (AlphaGo), in 2016 beat Lee Sedol a top-ranked professional go player [ DeepMind ].

OpenAI is currently working on a deep neural network agent — a five-agent team, actually — that plays Dota 2 [ OpenAI ]. The ultimate goal is to beat the world’s top professional team. The state space of Dota 2 is huge, continuous, and only partially observable. Dota 2’s game state consists of 20,000 dimensions. Compare this to chess’s 8x8-square board with 12 different pieces (768 bits) or Go’s 19x19-square board with 2 pieces (722 bits). In chess and Go the full board is always visible to both players. In Dota 2 only a small portion of the full game state is visible. Combine all of that with about 1000 actions — for each of the five players — and you have an extremely complex game. OpenAI is solving this game via self-play [ Rodriguez ].

Optimal Tic-tac-toe for player X

Learning by Building

To try to really understand self-play, I posed the following problem: Train a neural network to play tic-tac-toe perfectly via self-play, and do it with an evolution strategy. Tic-tac-toe is a small, simple game (9 x 9 board, 2 pieces) that can be solved quickly and exactly using the minimax* algorithm. This simplicity enabled me to run experiments quickly and on cheap hardware.

By “perfectly”, I mean that I wanted it to beat a minimax agent 100 times in a row (so, maybe “really, really good” would be a better description than “perfect”). Also, I wanted the agent to continue to be able to win 100 games as training progressed — to convince me that whatever it’s learning from self-play is reliably guiding it toward perfect play. To evaluate the training process I periodically paused training, played the NN agent vs. minimax (from Dwayne Crooks’s Python package xo) 100 times, and recorded the number of NN’s losses.

I chose to train the neural network (NN) with an evolution strategy (CMA-ES [ Hansen ], specifically) rather than backpropagation in part because I had recently read OpenAI’s Evolution Strategies as a Scalable Alternative to Reinforcement Learning [ Salimans, et. al. ] and Uber AI’s papers about deep neuroevolution [ Stanley & Clune ], in part because I have a computer with multiple CPUs and no GPU, and in part because it would force the algorithm to be different enough from the ones discussed above that I’d really have to delve into the idea of self-play to get anywhere.

* Minimax [ Wikipedia ] computes every possible game realization starting from a given board configuration, then determines which next move has the best expected outcome. Other methods discussed above (alpha-beta pruning, monte-carlo tree search) are approximations to minimax. They are used in checkers, chess, and go because those games have too many realizations to calculate in a realistic amount of time.

Self-Play Fitness

To optimize an NN’s parameters with CMA-ES, you need to specify a fitness function. It’s the value or quality you assign to a set of NN parameters. As an example, if you didn’t care about self-play, you could just play the NN against a minimax (perfect) player, say, 100 times and count the number of losses as the fitness (well, negative fitness). Then ask CMA-ES to find NN parameters that minimize the number of losses. I tried that. It works. It produces an NN player that always draws against minimax. To learn via “self-play”, however — without knowledge of the minimax solution or any other strategy hints — what, exactly should the NN play against to determine its fitness?

After much experimentation, I found that this worked: Construct an “omniscient agent” for the NN player to compete against. What makes the omniscient agent (OA) omniscient is that it embeds a copy of the previous generation’s “xfavorite” NN and knows what move it would make for any board configuration. Before making a move, the OA simulates games against xfavorite to see which move would work best against it.

By knowing xfavorite’s moves, OA can play better than xfavorite without actually knowing how to play tic-tac-toe or doing any tree search. It’s pretty simple.

As a test of the OA’s ability, I took it aside, embedded the minimax player in it (instead of the xfavorite NN) and had it play 100 games against the minimax player. The OA was able to draw the minimax player every time. Knowing that, we can say, “If NN converged to minimax play, OA would draw it”. In other words, the optimization won’t get stuck because OA isn’t good enough at tic-tac-toe. It could get stuck for other reasons, though.

Note that the OA isn’t useful for actual competitive play on its own because you could never embed an opponent in OA and call it a fair or meaningful game. But that doesn’t mean you can’t train an agent against the OA. That’s what I tried — and it worked. To track its progress, I paused training every 10 minutes, played the current xfavorite against xo’s minimax player 100 times, and recorded the number of losses:

(And yes, Run #4 was the best. You may see all the runs, though.) Notice how it gets to exactly zero losses and stays there as the optimization continues. Training against OA is forcing the agent towards perfect, minimax play.

What’s more enlightening, though, is all of the things that *didn’t* work.


Training an NN against itself — literal self-play — makes no sense in this context. What would the fitness function be? It made sense in AlphaZero because the algorithm recorded games and learned the quality of the moves via supervised learning. But if an NN plays itself, it simultaneously wins and loses. How would you score that?

Training an NN by playing against the previous generation’s best can result is the roshambo problem: The NN learns a strategy B to beat strategy A and becomes the “champion”. Then a new NN learns a strategy C that beats B. Then another learns that A beats C. Think of A, B, and C like rock, paper, and scissors, and it makes sense. Such strategy cycles were shown to be present in backgammon [ Pollack, et. al. ] and are discussed (afaict) by OpenAI under the term “strategy collapse” [ OpenAI ].

Using a “hall-of-fame” of several previous champion NNs might mitigate the roshamo problem. OpenAI uses that for OpenAI Five, but I didn’t have any luck with any of the several variations I tried on this problem.

Training against OA with a large numSims didn’t work. The OA was so good that the fitness function rarely had a gradient. It just lost, and lost, and lost. That’s not much information to go on. The best numSims was actually numSims=1 — a slight bias towards “better” with a lot of noise for exploration.


It seems like if you want to train an agent by self-play you’re going to need to create an algorithm that explores the state space (board configurations) well and maybe has some mechanism to encourage monotonic improvement.

AlphaGo’s agent used randomness in its moves to explore, and employed MCTS to always look for better moves. OpenAI Five plays a game with natural randomness and uses a hall of fame to keep from backtracking. TD-Gammon had dice rolls as a source of exploration. It’s not clear how TD-Gammon maintained improvement, but [ Pollack, et. al. ] studies a hypothesis that the dynamics of backgammon naturally lend themselves to this.


Self-play is an exciting idea because it holds the promise of relieving the engineer not only of having to specify the solution to a problem (as optimization of parameters does) but of even specifying the objective. It’s a higher level of autonomy than we usually consider when studying reinforcement learning.

While self-play doesn’t seem to be completely understood, it certainly works. This should be an interesting topic to watch for many years.

Thanks for reading along. If you’d like to read more about self-play, check out the references linked to throughout the post. If you’re interested in training a tic-tac-toe agent, the Python source is available as is a more detailed writeup of its workings.

Topics of interest

More Related Stories