*This story was first written in 2016 and since then the machine learning field has advanced a lot, but it still serves as a good introduction to reinforcement learning. I’m now leading **Scaleout** where we help businesses take machine learning from R&D to real production.*

Recently there has been a lot of attention around Google DeepMinds victory over Lee Sedol in the board game Go. This is a remarkable achievement since Go has for long been considered a game unbreakable by artificial intelligence, mainly because the game was thought to rely to a large part upon human intuition in order to handle the extremely large amount of possible states during gameplay.

The technique DeepMind used was a combination of deep neural networks and reinforcement learning. In machine learning deep neural networks has for the past few years been shown to achieve remarkable results in a number of different fields (such as image recognition, speech recognition, language processing and so on, for some cool examples see The Unreasonable Effectiveness of Recurrent Neural Networks). Reinforcement learning has not had the same amount of academic and public attention and has often been used to solve various toy problems. But, recently the combination of deep neural nets and reinforcement learning has proven to be very powerful and before DeepMind put their attention to Go they showed that a combination of these techniques could be used to achieve better than human results in a number of Atari computer games, using only the input and output that a regular human player would have access to.

The reason for combining a neural net with reinforcement learning is that a neural net will be able to handle a large amount of possible states. In plain reinforcement learning you often use a lookup table, and as long as the amount of possible states are finite and not too large this is fine. But when the number of possible states grows or continuous inputs are used then something that can handle a large state space is needed.

To go beyond the toy examples, video games and board games this post is a tutorial for combining (deep) neural nets and self reinforcement learning and some real data and see if it is be possible to create a simple self learning quant (or algorithmic financial trader).

A serious warning for anyone thinking about copy/pasting this to make a live algorithmic trading robot: Whatever the result will be in the end, a real algorithmic trader will be a very different beast to implement as there are numerous other factors that must be handled in live trading with real assets.

Ok? Let’s go!

I’m doing this in Python (2.7) with a few different imported libraries. Again, my goal is to explain and show the concept of self reinforcement learning combined with a neural network. If you think you understand the basic concepts, then just search the internet for better and more mathematical correct explanations.

In reinforcement learning there are a few basic notations and concepts:

- State S, this is a representation of the current world as the algorithm sees it
- State S’, a new state one time step later than S.
- Action A, one of the possible actions than can be taken at time step S.
- Q, a function that approximates the reward for action A at time step S’. Can be written as Q(s,a). In our case Q is a neural network.
- Reward R, the actual reward at state S’ given action A.

Now, I will go through a few different cases where the complexity of data and the algorithm gradually increase. Some key parts of the code is copied, but to keep the post readable and at a reasonable length the entire code for each example is not explained or copied here. Instead the code can be found on Github at https://github.com/danielzak/sl-quant.

Example 1: Straight line

In the first example I will see if I can learn the system to recognise an asset with a linearly increasing price. In terms of a quant trader this means that the trader should always buy (or go long).

The data is simply created by a function that returns a straight line:

So what is this self reinforcement thing then? What we have here is a pretty basic system with a set of states, some actions that can be taken and some way of measuring rewards based to these actions. We also have a Q function that should learn to approximate the reward. In a simple world we could just let Q be a table of all possible states and then find a way to explore all possible states, actions and rewards, save these to the table and then look up the best action for a given state when needed. In a more complex world we need a way to generalise our knowledge and to be able to handle a very large number of different states.

The self learning comes from a concept of looping through a number of different states and actions many times, and each time update the Q function a little bit. So in each loop the Q function will know a little bit more about the world around it and should be able to approximate the real reward a little bit better for each possible action. Also, one very important thing in the learning process is to add a bit of randomness in order to explore as much as possible of the world. In our case we do this by adding a chance of selecting a random action instead of the action suggested by the Q function, this is the epsilon value in the code below.

In my code this main loop looks like this:

There is also a Q function, in this case the neural net. This is a simple three layer neural network with just 4 neurons in each layer that should be sufficient to learn what a straight line looks like.

Ok, that are the basic components of the system. Let’s run the code and see what happens.

After one epoch (one training loop in the main loop above) I ask the system to suggest trades for each time step. This is the result:

Clearly our self learning quant has no clue what it is doing. Let’s run the code for 10 epochs and see what the output is:

Wow! Only long trades! Without modelling anything or giving any prior knowledge we have a system that has learnt what a straight line looks like :-) Good stuff.

Example 2: Sine wave

Now we can make things a little more complicated by replacing the straight line with a sine wave shaped line. Please note! In example 1 and 2 there is no separate training and testing set of data, that is of course outrageous for anyone interested in machine learning but is done only to keep things simple. A train/test split will come in example 3.

Let’s also also introduce another concept in self reinforcement learning — the gamma parameter. Remember that we evaluate each action based on its reward. By default this means that on each time step the system will learn what is the best choice of action to maximize its reward the next time step. Gamma is chosen between 0 and 1 and by setting a large gamma we will value a high long term reward as well, so the system can learn to value a path of choices that will give a high reward several time steps into the future.

One thing that hasn’t been covered yet is the reward function, I will just copy it here to give an example of what it looks like. The reward function will give a reward if the action (or signal) is in the same direction as the price movement, and it will also give a small extra reward if the action is the same as the last action.

So what does this look like after one epoch?

Again, the system has no clue what is going on. What it has learned in one epoch is that it gets a reward if the action is the same as the previous action.

Let’s do 100 epochs:

Ahh, it can learn a what a smooth wavy line looks like. With this simple neural network the result will not be much better even if we increase the number of epochs far beyond this.

Example 3: Bitcoin price data

Now let’s try to dip our toes a little deeper into the water. In this final example I have done a few changes to the basic code used above (remember, the full code is available at https://github.com/danielzak/sl-quant)

Notable changes:

1. Daily Bitcoin price data is used as input data (source Kraken via Quandl)

2. I have made a train/test split of the data (600 data points for training, 200 for testing).

3. I have added some common technical indicators to the input data as well (Simple Moving Average (15 and 60 periods), Relative Strength Index, Average True Range)

4. The neural network is now a two layer recurrent neural network (LSTM) with 64 neurons in each layer.

5. The self reinforcement learning loop is using a trick called experience replay that greatly improve the speed of learning by making each update batch bigger which is computationally efficient when updating a neural network.

Ok, so in the end the best result for this system using this input data was to buy and hold rather than to do shorter trades during this time frame. This is a very basic example using just a few common financial indicators, real traders use much more sophisticated tools. But if nothing else I would say this shows the potential for self learning systems, and hopefully you’ve learnt a bit as well.

End notes: For those interested in more information and a tutorial based approache to learn the concepts of self reinforcement learning, I recommend to read the 3 part blog post series at http://outlace.com/Reinforcement-Learning-Part-1/