Most beginners in Machine Learning start with learning Supervised Learning techniques such as classification and regression. However, one of the most important paradigms in Machine Learning is Reinforcement Learning (RL) which is able to tackle many challenging tasks. One example is the game of Go which has been played by a RL agent that managed to beat the world’s best players.

Many have heard about RL but don’t actually know what makes it different from Supervised Learning. They get confused by the two paradigms and why they co-exist. This post is intended to clarify the differences and introduce how Deep Learning fits into the picture.

Let’s start off with the most important question: Why should we care about RL?

#### Why RL in the first place?

Supervised Learning can address a lot of interesting problems, from classifying images to translating text. Now let’s look at problems like playing games or teaching a robot limb to grab objects. Why can’t we do this properly with Supervised Learning?

Consider the case of playing Go. Suppose we had a data set that contained the history of all Go games played by humans. Then we could use as input *X* the game state and as output labels *Y* the optimal moves that are taken for that state. In theory that sounds nice but in practice a few issues arise.

1. Data sets like this don’t exist for all domains we care about

2. Creating such as a data set might be expensive and unfeasible

3. The approach learns to imitate a human expert instead of actually learning the best possible strategy

RL comes to the rescue here. Intuitively, RL attempts to learn actions by **trial and error**. We learn the optimal strategy by sampling actions and then observing which one leads to our desired outcome. In contrast to the supervised approach, we learn this optimal action not from a label but from a time-delayed label called a **reward**. This scalar value tells us whether the outcome of whatever we did was good or bad. Hence, the goal of RL is to take actions in order to **maximize reward**.

#### What is the formal definition of the problem?

Mathematically, a RL problem can be seen as a **Markov Decision Process**. This process is memoryless, so everything we care about we know through the current state. The RL setup can be visualized like this:

There is an agent in an environment that takes actions and in turn receives rewards. Let’s briefly review the supervised learning task to clarify the difference.

In Supervised Learning, given a bunch of input data *X* and labels *Y* we are learning a function *f: X → Y* that maps *X* (e.g. images) to *Y* (e.g. class label). The function will be able to predict *Y* from novel input data with a certain accuracy if the training process converged.

Now let’s move on to the RL setup, which is defined by the 5-tuple *(S,A,P,R,𝛾). *We are given a set of states *S* and a set of actions *A*. *P *is the state transition probability. The reward is a value that tells us how good we did in terms of the goal we want to optimize towards. It is given by a reward function *R: S×A → R*. We will come to 𝛾 in a bit.

The task is to learn a function *π: S → A* that maps from states to actions. This function is called the **policy function**. The objective is now to find an optimal policy that maximizes the expected sum of rewards. This is also called the **control problem**.

The game of Go can be modeled with this approach in the following way:

**State**: Position of all pieces

**Actions**: Where the player put its piece down

**Reward**: 1 if the player wins at the end the game, 0 otherwise

#### How do we interpret the reward?

The problem with the rewards in RL is that we don’t know which action had the deciding effect on the outcome. Was it the move we made three actions before or the current one? We call this the **credit assignment problem**.

To deal with this problem, the discount factor 𝛾 ∈ (0, 1] is introduced to calculate the optimal policy π*. Our optimization problem is maximizing the expected sum of **discounted** rewards. Thus, the optimal policy can be found by calculating the result of this equation:

Intuitively, we’re blaming each action assuming that its effects have **exponentially decaying** impact into the future.

To learn the optimal policy, there are different approaches such as policy gradient and Q-Learning. While policy gradient tries to learn the policy directly, Q-Learning is learning a function of state-action pairs. I will delay a detailed explanation of these algorithms to a future post.

#### Why do we use Deep Learning in RL?

In Supervised Learning, we use Deep Learning because it is unfeasible to manually engineer features for unstructured data such as images or text. In RL, we use deep learning largely for the same reason. With neural networks, RL problems can be tackled without need for much domain knowledge.

To exemplify this, consider the game of Pong. In traditional learning, we need to extract features from the game positions to gain meaningful information. Using neural networks we can feed the **raw game pixels** into the algorithm and let it create high-level non-linear representations of the data.

For doing this, we construct a policy network that is trained **end-to-end,** meaning that we input our game states and out comes a probability distribution over possible actions we can take.

If we consider the example of Pong, the action is either going UP or DOWN. This is an example setup from learning how to play Pong:

At first glance this might look the same way as a typical supervised learning setup, for example for image classification. Remind yourself, however, that we don’t have labels for each game state given and thus we can’t just train this network in the same easy fashion.

#### Conclusion

I hope this post allowed you to gain a better intuition about the difference between Supervised Learning and RL. Both approaches have their rightful place and there are many success stories. In the future, I’ll explain in more depth how RL systems are trained. For anyone that wants to learn more, I’ve attached some resources that I personally found useful.

#### Further Reading

https://arxiv.org/abs/1701.07274

http://karpathy.github.io/2016/05/31/rl/