Using Optuna to Search for Tiny RL Policies by@jfpettit

Using Optuna to Search for Tiny RL Policies

Jacob Pettit HackerNoon profile picture

Jacob Pettit

ML researcher, blogging about reinforcement learning, machine learning, and AI.

All image credit to maintainers of the CMA-ES library on GitHub. Image source.

Link to code on GitHub

Recently, I needed to search for trivial policies in an environment. These policies were simple things, like always take the same action, or every K steps take the same action, etc. In order to optimize these trivial policies as much as possible, I turned to the Optuna framework. But as I dove into their documentation, I found that they’ve got a nice set of optimizers and I began to wonder: What if I searched for not-so-trivial policies?

Could I search for all of the parameters for one small neural network layer and get good performance? The answer, we’ll see, is yes… in some environments.

The Setup

I decided to keep it simple and experiment with two policies: one I’m calling a “dense Gaussian policy” and the other I’m calling a “dense policy”. I think these names are technically correct but they also can sound pretty opaque, so I’ll explain what I mean. Check out my code at this link if you’re interested.

The Policies

Both policies are similar to a 1-layer neural network: we compute the output by taking the dot-product of the input with an array of weights, W. In this case, the input is the observations (S) from the environment, and the output is the action we want to take, so the weight array (W) has the shape (observation_dimension, action_dimension).

In the “dense policy” the output of the dot product is the action, so we’re done! In the “dense gaussian policy”, the resulting vector of the dot product between S and W has shape equal to action_dimension and is used as the µ (mu, or mean) vector for a Gaussian distribution. We also store a variance (σ) vector in the layer and then we sample our actions from the resulting Gaussian distribution. Like this:

µ =, observations)
action = µ + np.random.randn(env.action_space.shape[0]) * σ

I’ll break down the two lines of code above: In line 1, we take the dot product of W and the input observations and get µ (the mean of our distribution). In line 2, we use NumPy’s randn function to sample from a unit Gaussian distribution (mean 0 and variance 1). By adding µ to the sample from randn, we shift the mean from 0 to µ. Then, when we multiply the sample from randn by σ, its variance gets scaled to match σ.

So now, what am I optimizing? I used Optuna to directly optimize each parameter in the weight array. This approach scales really poorly, since Optuna is designed to optimize hyperparameters and it suggests one number at a time. This means you have to use a for loop to fill your weight array one-by-one. But in environments with small vector observations, it’s fast enough. Just know that if you’re here looking for performant, well-written, scalable code, you’re in the wrong place! I cobbled all this together for fun and gave little thought to how good the code actually is. I decided to use CMA-ES1 as my optimization method.

The Algorithm: CMA-ES

CMA-ES stands for “Covariance Matrix Adaptation Evolution Strategy”. It’s a black-box numerical optimization method. In fact, all evolution strategies are black-box optimization methods. The phrase “black-box” refers to the fact that these methods don’t require derivatives of an objective function in order to optimize it. They’re called “evolution strategies” because they loosely model biological evolution. They stochastically generate a population, evaluate the population on the objective function, and move the population-generating distribution towards the higher-performing members of the population.

If you squint, this kind of mimics natural selection, where the fittest members of a population propagate their traits onwards to the next generation. CMA-ES can also change the variance of the population it generates to find a solution faster, and that’s what the gif at the top is showing. If you want to read more about this topic, David Ha has two great posts talking about evolutionary strategies and using them in RL environments. I also recommend reading the Wikipedia page. If you google around, you’ll find a bunch of resources about the CMA-ES algorithm and evolution strategies in general.

The Results

This is a mixed bag of results; some great, some not so great. Working with one layer with shape (observation_dimension, action_dimension) really forces us to work with limited representational capacity. Since the policy has so few parameters, it can’t represent very complex relationships. Surprisingly, though, it can still succeed in chaotic and nonlinear environments.


Check out the inverted double pendulum. All 3-d robotics environments courtesy of Pybullet.

Even though this is a highly unstable and nonlinear system, the optimizer was still able to find a set of weights which can stably balance the pendulum. The policy here gets about 9,300 reward. I believe the maximum possible is 10,000. So it’s not perfect performance, but it’s still really good. I think this is super cool because it’s literally controlled with a 9-parameter fully-connected layer and that layer is optimized using a black-box (i.e. derivative free) optimizer. Compared to the massive networks we’re used to seeing, with thousands, millions, and sometimes billions of parameters it feels almost shocking that so much can be done with 9 parameters.

Just above here is the inverted pendulum swing-up problem. Again, this is an unstable, nonlinear problem. The goal here is to start with the pole below the cart and then to swing it up and balance it on top of the cart. This policy only has 5 parameters! It gets about 700 reward and the maximum possible is 1,000. So it’s definitely not perfect, but it seems clear that the policy learned something useful since in the video it makes multiple attempts to swing the pole up and keep it balanced.

This is a 16 parameter policy trained on Lunar Lander. The environment is modeled after the classic lunar lander game. The goal is simple: land the spaceship between the flags and don’t crash! It has 16 parameters because the action-space is 2 dimensional and the observation space is 8 dimensional. The 2 actions are real-valued numbers with one representing left-right thrust and the other representing bottom rocket thrust. This policy actually solves the environment; the reward here was about 275 and this environment is considered solved after the reward earned is over 200.

This is our last pendulum, but it's also our most boring one. This is just the standard single inverted pendulum, where the pole starts on top of the cart and the goal is to keep it there. This system is near-linear and, while still unstable, it is certainly more stable than the other pendulums we looked at. This policy has 4 parameters and it solves the environment, earning the maximum reward of 1,000.


This is a 96 parameter policy that totally failed to learn how to achieve its goal. The environment is called “BipedalWalker” and the goal is for the robot to walk to the end of the level. Instead, our policy just learned to fall forwards. If I gave it more update steps or ran more experiments where I carefully control and anneal the variance of the policy and/or CMA-ES over time, it could maybe solve this environment. Giuseppe Cuccu solved this environment using a tiny, one-layer RNN with 116 parameters. I’m not confident that this environment can be solved with fewer parameters than that, but it might be a fun area to experiment in.

To give you an idea of what a successful BipedalWalker agent looks like, see the video below (I didn’t train this one):

The video below here is arguably a more interesting case of policy failure. This is on the hopper robot environment. The goal is to move forward as far as possible in the time limit. This policy has 45 parameters, since the observation space has 15 dimensions and the action space has 3 dimensions. Obviously, the policy fails at moving forward as far as possible, but it still gets about 1000 reward on the environment (this environment is considered solved above 2500 reward). It earns this reward simply by leaning forward!

Closing thoughts

I think this was a fun experiment, and if you’re curious and want to try weird stuff like this, I recommend it. But, if you want a performant implementation of CMA-ES and other evolution strategies for RL and for optimization of larger numbers of parameters, you should check out David Ha’s ESTool, the PGPElib library, and the CMA-ES library. I’m sure there are other strong libraries out there for this stuff, but those are three off the top of my head.

These policies could definitely be pushed further with more parameters, and I don’t think it would be very difficult to add more. Since it’s all just dot products, one could write a TransformLayer (or name it whatever you want) whose entire job is to take in the observations and project them to some n-dimensional vector. Then use that vector as input to the policy layer. That would definitely increase the capability of the layers to represent complex relationships and could still be kept small enough to work nicely with Optuna. I didn’t want to do that here because I had too much fun keeping these policies as tiny as possible.

Another way to probably improve the performance is to consider fixing the variance of the CMA-ES algorithm or annealing it on a set schedule over training. This might be hard to do in Optuna, so potentially another option is to schedule the variance of the Gaussian layer. It would be interesting to scale the variance in such a way that when the algorithm seems to be plateauing, turn up the variance and force more exploration in the environment. That might help break free from performance plateaus and maximize reward.

If I were going to push forward on these experiments, I’d stop using Optuna and instead switch to one of the libraries I mentioned above. The policies can still be kept tiny, but using a library with a full implementation of evolution strategies will help accelerate experiments and will likely allow greater control over the optimizer.

Link to code on GitHub

I really hope you enjoyed this post! 😃

Previously published at


Signup or Login to Join the Discussion


Related Stories