Reinforcement Learning (Introduction)

“Reinforcement Learning is simply Science of Decision Making. That’s what makes it so General.” — David Silver

Abstract

You might have observed a level of saturation in Machine Learning recently. Well that’s actually saturation in ‘Supervised Learning’ actually (poor Kaggle).

Most of us don’t know any other learning algorithm than Back-Propagation. There are a few recent ones like ‘Equilibrium Propagation’ and ‘Synthetic Gradients’ but more-or-less fall under similar paradigm as Back-propagation.

While, Unsupervised learning and Reinforcement learning remain unscratched (comparatively, in mainstream at least).

In this blog we’ll be diving into Reinforcement Learning or as I like to call it ‘Stupidity-followed-by-Regret’ or ‘What-If’ learning. Those are actually pretty accurate names.

Intro

Reinforcement learning is interaction based learning in an ‘cause-effect’ environment.

Effect or consequence is based upon the actions and cause if the goal to be achieved.

Much like talking to your crush. The goal is to impress her but every word you say will have a consequence.

She’ll either:

find you charming or
randomly remember she has to be somewhere

#BelieveMeIAmAGuy #IntrovertProblems

RL is Goal oriented learning, where the learner (referred to as ‘agent’ in RL) is not dictated what action to take, rather it is encouraged to explore and discover the best action to yield best results. Vaguely speaking, exploration is based upon trial-and-error approach.

In RL, your present action not only affect your immediate reward (getting that kiss) but also your future rewards (her agreeing for a second date). So the agent might opt for higher future rewards instead of high current reward.

Remember the following statement, you’ll understand it better later:

RL is not defined by characterising a learning method but by characterising the learning problem itself.

Agent

The learner in RL is referred to as agent.

An ‘agent’ has the sense of it’s environment and is able to take actions in form of interactions with the environment in order to maximise the reward received from the environment.

Q) Who is a Supervisor?

Ans) In Supervised learning, the training set supervises the learning of the learner. Hence it acts like a supervisor.

Q) How and when is an RL Agent better than a Supervisor?

Ans) In an uncharted territory, a supervisor (your ever-single friend) does not have much idea about the correct actions (step to impress your crush). In this case much better learning can be done by own (agent) experiences.

Exploration-Exploitation Trade-Off

Obviously there is a trade-off, there is always a trade-off!

Supervised learning has Bias-Variance trade-off and RL has Exploration-Exploitation trade-off.

To obtain high reward, the agent can either prefer

actions that were fruitful in past
explore possibility of better actions

Let’s break it down

Exploitation : use the pre-learned knowledge to obtain GOOD reward
Explore : try something new to obtain either GREAT or WORSE rewards

In Stochastic tasks, an action has to be tried multiple times to gain a reliable estimate about it’s reward. Yet, this can’t ensure the which is the best possible action. Agent can’t explore and exploit at same step, hence the trade-off.

Information gained during exploration can be exploited multiple times during future steps.

Extra

At times RL involves supervised learning. This is done to determine which capabilities are critical for achieving rewards.

Example:

Evolution, makes sure you know how to breath otherwise no matter how awesome you are, you won’t be able to achieve anything.

Growing up, do you achieve noble prize of keep playing with fidget spinners is not the concern of the supervised part.

These algorithms are categorised as Deep-Reinforcement-Learning.

More on that later, for now just have a look how good they can perform!

Fundamentals of RL

RL has 4 fundamental concepts:

Policy
Reward Function
Value Function
Model (of Environment) [ideally]

Policy

It defines the behaviour of the agent at any given time.

Roughly, policy is mapping of currently known states and actions to be taken when in those states.

In terms of psychology, it corresponds to stimulus-response rules.

Policy can be in form of a look-up table or an extensive search process.

Reward & Value

Reward is the gain that indicates what is good action in immediate sense.

Value is cumulative gain to estimate of how good an action will be in long run.

Example:

Eating cake will make me happy right now. Though in long run, cake is bad for my health.

The Reward System, combines reward and value under single concept and conveys:

how rewarding an action is w.r.t rewards
hence also how desirable a state is

In biological sense it refers to pleasure and pain.

In relationship sense it refers to her saying “you are looking cute” and “you should shave that beard off” (it still hurts).

Mostly, the reward system is unaltered by the agent, though it may serve as bases for altering the policy.

Example:

Falling off the cycle hurts, so very low rewarding. I can’t alter the reward system to take pleasure in that pain. BUT. I can later the way I ride cycle so I don’t fall again.

Our main concern is the Value and not Reward.

Q) Why?

Ans) Reward only makes sure that immediate gain is max, future gains could be way worse. Value makes sure that we have max overall gain.

Q) What’s the problem then?

Ans) Usually immediate gains have more surety than future gains.

Q) What are value functions?

Ans) As the name suggest, value functions are functions used to compute value of an action. The tricky part is how they do it. These usually use complex inference based algorithms.

Example:

You buy a stock. Current market status is good. So with much surety you can sell it tomorrow and get good reward. Now if you keep the shares for another year, the value might sky-rocket to multiple folds but one can never be very sure about that.

Evolutionary Algorithms

Although they sound very similar, RL and Evolutionary algorithms are not the same thing.

Evolutionary algorithms don’t use Value function. They directly search into spaces of policies.

Pros:

Effective when agent doesn’t have much sense of the environment

Cons:

They don’t notice which states agent passes through it’s lifetime
They do not exploit the fact that the policy they are searching for is a function from state to action.

This was just the first chapter, a basic intro to RL. We still have a world to explore in upcoming chapters.

Yeah, that’s a random GIF.