Reinforcement Learning’s Generalization Problem

A problem recently highlighted by Google Deepmind & OpenAI.

This article investigates deep reinforcement learning agents’ generalization ability — an ability recently called into questioned by Google Deepmind & OpenAI.

Rote Learning Versus Generalization

We, humans, sometimes rely on rote learning. Ever sat for an exam that turned out to be quite game-able with rote memorization in hindsight? Test-takers could have parroted their way to a stellar grade without really understanding the material — impairing their ability to generalize concepts to tackle new problems in the long run.

This article demonstrates the machine equivalent of this phenomenon. An agent will tackle an environment with rote learning (as opposed to understanding) causing later struggles to generalize to new environments.

Tests of Generalization

Recently, Google Deepmind & OpenAI released environments meant for gauging agents’ ability to generalize — a fundamental challenge even for modern deep reinforcement learning.

The need for generalization is ubiquitous — for instance, when an agent is trained in a simulator but is then deployed in the real world (this difference is also known as the reality gap).

However, common benchmarks today use the same environments for both training and testing — a practice that offers relatively little insight into an agent’s ability to generalize.

Three environments were released by Google Deepmind & OpenAI in an attempt to elucidate or measure agents’ ability to generalize: “Distributional Shift”, “CoinRun”, & “Sonic the Hedgehog”.

CoinRun environment by the authors of “Quantifying Generalization in Reinforcement Learning”. Source

Sonic the Hedgehog environment by OpenAI. Paper- “Gotta Learn Fast: A New Benchmark for Generalization in RL”. Source

Watching RL agents behave oddly on new game levels (levels that contain previously encountered objects positioned differently) might lead one to suspect that modern deep reinforcement learning isn’t yet capable of creating agents that “really” understand the environment. (This is not to say that progress won’t be made in the near future.)

Could some seemingly performant agents be merely repeating action sequences that they were rewarded for during training?

I attempt the distributional shift experiment to investigate this hypothesis.

The Distributional Shift Experiment by Google Deepmind

The distributional shift experiment set-up by AI Safety Gridworlds authors. Agent start position ‘A’. Lava ‘L’. Goal ‘G’. Wall ‘#’.

The distributional shift experiment (shown in the above figure and gif at the start) investigates an agent’s ability to adapt to new environments that contain objects from the training environment positioned differently (in this case lava-, goal- & wall- tiles) — essentially a test of generalization.

It is important to note, for this experiment’s sake, that the agent should not be trained on many different variations of this lava gridworld. If that were so, then the test environment would essentially be “on the trained agent’s distribution’s manifold” and hence not require very strong generalization to adapt to the testing environments.

The remainder of this article discusses the outcomes obtained from running this experiment with a modern deep reinforcement learning agent.

Train & Test

A deep reinforcement learning agent — an A2C (Advantage Actor-Critic) model — was trained on the training environment with PPO (Proximal Policy Optimization).

TrainingTrain for 1 million frames with normalized rewards: lava -1 reward, movement -0.02 reward, goal +1 reward. PPO parameters (will make more sense when read side-by-side with code): 0.99 discount factor, 7e-4 learning rate, 0.95 gae-lambda, 0.01 entropy coefficient, 0.5 value loss coefficient, 0.5 gradient norm clip, 0.2 clip epsilon, 4 epochs, 256 batch size. RMSprop params: 1e-5 alpha, 0.99 alpha. Input format: board (h x w x 1). Parallelization: 16 processes, 128 frames per process. Simple A2C model. No recurrence params as model doesn’t have RNN modules — keeping things simple.

Agent available here.

Testing

Results of 100 episodes on the training environment:Total Return per episode: mean 42.00 std dev 0.00 min 42.00 max 42.00Frames per episode: mean 8.0 std dev 0.0 min 8.0 max 8.0Observation:Perfect performance for all 100 episodes — attained maximum attainable reward of 42 (walk 8 steps, sidestep lava, enter goal) in fewest achievable number of frames — 8 frames. 0 variance between episodes. Training of this agent ended with non-zero average entropy of the policy distribution.

Results of 100 episodes on the testing environments:Total Return per episode: mean -30.81 std dev 29.79 min -100.00 max 41.00Frames per episode: mean 29.3 std dev 31.4 min 2.0 max 100.0Observation:Mean total reward of -30.8 is significantly poorer performance compared to evaluating on the train environment. Variance in frames per episode increased significantly. Min frames of 2 came from cases of agent suicide (take 2 steps to touch lava). Max frames of 100 occurred due to excessive exploration by the agent (episode is set to terminate after 100 frames). Max reward of 41 likely came from runs of the simpler “testing environment v2” (but this environment’s max attainable reward is 46). 10 worst episodes consisted of (i) 3 episodes of wandering around avoiding lava yielding R=-100, F=100 and (ii) 7 episodes of suicide yielding R=-52, F=2.

Conclusion

This deep reinforcement learning agent cannot be said to have “understood” the lava gridworld environment despite putting up an outstanding performance during training, implying a relatively poor aptitude for generalization.

When placed in the testing environments, the agent either gravitated to suicide — move 2 squares right (-2 reward) and then jumping into lava (-50 reward) to garner a total of -52 reward, or excessive exploration — wandering around avoiding lava for the maximum allowable length of an episode (100 frames) to garner a total of -100 reward (-1 reward * 100 steps).

More Details

My code available here. Contributions welcomed.

The distributional shift experiment is here thanks to the authors. However their repository doesn’t contain RL agents (as at Nov 2018) but points interested users to public RL implementations.

The A2C architecture I used:

Model (no RNN modules):(0): Conv2d(1, 16, kernel_size=(2, 2), stride=(1, 1))(1): ReLU()(2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)(3): Conv2d(16, 32, kernel_size=(2, 2), stride=(1, 1))(4): ReLU())(actor): Sequential((0): Linear(in_features=192, out_features=16, bias=True)(1): Tanh()(2): Linear(in_features=16, out_features=4, bias=True))(critic): Sequential((0): Linear(in_features=192, out_features=16, bias=True)(1): Tanh()(2): Linear(in_features=16, out_features=1, bias=True))

Thanks to friends & colleagues who shared their opinions & experiences about reinforcement learning and autonomous agents in general. Without them, my understanding of reinforcement learning would be less rich.