As a human, we learn from our experiences. And experience is gained with prolonged interactions with our environment. Be it designing, planning, playing, developing, or coding, the longer we indulge in these activities, the better we become. And as it popularly goes,
Practice makes a man perfect.
Inspired by this trait of human learning, a separate branch of machine learning has emerged. This branch of machine learning, focusing on learning with interaction, is called Reinforcement Learning. From playing simple games to Self-driving vehicles, Reinforcement learning has found its utility in a wide range of applications.
To put in a single statement - “Reinforcement Learning is learning what to do, in an uncertain environment, as to maximize a reward signal“.
Let us break this statement to understand it better.
“Learning what to do” - Reinforcement learning is goal-based learning. This means that any learning problem in the reinforcement setting is set to achieve a certain goal. Now this goal can be to win a game of chess or balance a pole. Reinforcement Learning aims to excel at this single task.
“In an uncertain environment“ - Any Reinforcement Learning problem is set in an uncertain environment. With more and more interactions, the uncertainty of the environment is reduced. With better information about the environment, the learning agent can make an informed decision. A reward signal evaluates the likelihood of any environment.
“Maximizing a reward signal“ - For any action that the learning agent takes, it receives feedback. The feedback informs the agent about the action performed. An action that brings the state of the environment to a more acceptable state gives a higher reward.
Thus, the learning agent interacts with its environment to maximize a reward signal. Reward signal is chosen in such a way that the task of maximizing the reward signal aligns with the goal of the agent. Agent-Environment interaction is a cycle of agents sensing the environment, taking appropriate action, and receiving rewards from the environment for its action.
Reinforcement learning is also a closed-loop learning problem. By a closed-loop learning problem, we mean that actions taken by the learning agent at any time impact its future decisions as well. Any action taken by a Reinforcement Learning agent changes the environment. This works as an input for further decisions of the agent. And hence we have a closed-loop problem.
Any reinforcement learning task involves an explore-exploit dilemma. To maximize the reward signal the agent exploits its knowledge. But to expand its knowledge, the agent must explore. But exploring comes at a cost of diminished rewards. But in the long run, the agent might benefit from exploring.
And thus, there is a dilemma. Both exploration and exploitation can not be done without failing the task itself.
The decision between exploring and exploiting is generally dependent on many factors. All these factors must be accounted for before choosing one or the other path. Let us take an example to understand it better.
Suppose you are playing a game where N boxes are kept, each containing some amount of rewards (think of it as money or gold). You have K moves, and in a single move, you can open any box. You are rewarded with the worth found in the box opened by you. (Note that the box is not emptied. You will receive the same reward if you choose the box again in the next round.) Your goal is to maximize accumulated rewards.
In the game above, you can easily experience the explore-exploit dilemma. If you know that box 1 gives you a reward of 100, would you choose it in all K moves? Or will you risk opening box 2, which might contain less or more reward?
We have been talking in abstract terms like agent and environment until now. But now, we will formally define some of the elements of reinforcement learning. Broadly speaking, the following are the elements of reinforcement learning:
Agent is defined as anything that makes the decision of taking an action. An agent senses the environment and uses its knowledge to decide which action will provide the maximum reward signal. The ultimate goal of any Reinforcement Learning task is to make agents learn how to achieve a goal.
Environment is everything outside of the agent. This does not necessarily mean the physical environment. It is more formally defined as anything over which the agent does not have arbitrary control. However, the agent can take actions that change the state of the environment. It is this state of the environment that the agent uses to make decisions.
Policy is the algorithm inside the agent, which it uses to take any decision. As learning progresses, the agent learns the optimal policy which gives him the maximum reward.
Reward Signal is the feedback that the agent receives from the environment for any action it takes. It is a numerical value that the agent wants to maximize over time. A higher reward signal means a step closer to the target.
Value function is an estimate of total reward that can be accumulated over the future from the current state of the environment. It is much harder to estimate value function, and this is something the agent learns to predict with time.
With each passing second, reinforcement learning is finding its use in more and more applications. The idea that reinforcement learning does not need a labeled data set makes it suitable for many tasks that other branches of machine learning fail to solve.
With that, this article comes to an end. I hope you got to learn something new today!