David Silver UCL-RL Course: Lecture 1 Notes

You can find me on Twitter @bhutanisanyam1, connect with me on Linkedin here

You can find the Lecture Notes Markdown Here. Feel free to Contribute and improve them further.

Reinforcement Learning: It sits at the Intersection of many fields of Science. It’s the science of Decision making, a method to understand optimum decisions.

Ex:

Engineering: Optimum Control works on the same problem with different terminologies.
Neuroscience: Studies reveal that the working of Human Brain’s reward system that works based on Dopamine is very relevant.
Psychology: Much work is done on why animals make certain decisions.
Math: Operations Research.
Economics: Game Theory, Utility Theory and Bounded rationality all work on the same question.

So it’s really commo to a lot of branches, and is a general approach to solving the Reward based problems.

RL Vs OtherLearning:

There is no supervisor, it works using trial and error, improves using the Reward Signal.
The Reward signal may or may not be instantaneous.
Time is relevant. An event at a given time is very relevant to the actions in the next event.
In a RL situation, the ‘Agent’ gets to influence the Data that it sees. It affects the environment.

Real Life use cases:

Stunts in Helicopter
Backgammon Game
Investment Management
Power Station Control
Teaching a Humanoid to Walk
Atari Games

RL Problem

Reward: Rt, is just a scalar feedback signal. It defines the performance of the ‘Agent’ at Timestep t. Agent’s goal is to Maximise the Goal, which is a sum of Reward.

(Informal) Reward Hypothesis: All goals can be described by the maximisation of expected cummalative reward.

All rewards can be weighed out based on the actions after comparisions. Thus all rewards can be decomposed into a scale of comparisions and hence a Scalar reward must be ultimately derived.

By Definition, Goal may be intermediate or a Final goal or Time Based, etcetra.

First step is understanding the Reward Signal.

Sequential Decision Making

Goal: Select actions to maximise future awards.
Actions have long term consequences, so we have to think ahead. Greedy approaches may not be useful here.
Rewards may be delayed (Not immediate)
Sacrifising an immediate reward for a (greater) long term reward.

Formalism

Agent

We control the brain here-Brain is the agent.

Agent can take actions
At Each step, the actions are affected by the Observations at that given timestep.
It receives a Reward Signal.
We have to figure out the algorithms that decide actions based on these

Environment:

Environment exists outside the Environment.

At every step, the agent receives observations which are generated by the environment and the agent itself influences the environment by making actions.

The Machine Learning problem of RL is related to the stream of data coming from the trail and error interaction.

History and State

The Stream of Experience, Sequence of Observation, Actions and rewards.

H(t) is

The history of all the observations upto time t
Our next action depends on the History
We build a Mapping (algorithm) from the History H(t) to a set of actions.
Environment returns Observation (emits) and rewards based on H(t)

State: It’s a summary of the information to determine the next action. It captures the history to determine all that should happen next.

State is a function of History.
S(t) = f[H(t)]
Def 1: Environment State: Information used inside the environment to determine the next event based on the Environment’s history. Information necessary to determining the next event for the environment, this isn’t usually visible to agent. They help in understanding the environment. Usually, it’s irrelevant to the algorithm.

Note: For a multi-agent problem, an agent can consider other agents as part of the Environment.

Def 2: Agent State: State of numbers living inside our algorithm. It summarises the Agent’s information and internal state. Agent state is utilized by our RL algorithm and determines the next action. The Function is defined by us.

Information State: Markov State

An information systems contains all useful information from History.

Markov Property: A state is Markov if and only if: Probability of the next state, given your current state is the same as all of the previous states. In other words, only current state is determining the next state and the history is not relavant.

In other words, If Markov property holds. The Future is independent of the History, given the Present. Since the state characterises everything about the past.

Another definition: State is a sufficient statistics of the Future.

By definition: Environment state is Markov.

And the entire history is also a Markov state. (Not a useful one)

Fully Observable Environment: Agent gets to see the complete environment, Agents State = Environment State = Information System. This is known as MDP (Markov Decision Process).
Partially Observable Environment: Agent observes environment indirectly. In this case, agent state != Environment State. This is known as Partially Observable MDP (POMDP). Possible agent state representations:

Naive Approach: Agent State = Complete History.

Bayesian Approach: We devolop Beliefs, where Agent State is a vector of probabilites, to select the next step.
RNN: A linear combination of the Previous state and the latest observations. So it’s a linear transformation of the old state to the new state, with the given observations-with some non linearlity along the complete thing.

Inside an RL Agent

An RL Agent may (or may not) include on of these:

Policy: It’s how the agent picks its action. Maps it’s states to Actions
Value Function: Values the Each state or actions.
Model: Agent’s perception of the environment’s working.

Policy:

It’s a map from state from action. Determines what the agent will do if it’s in a state.

Deterministic Policy: a = F(S). We want to learn this policy from experience and we want to maximise the reward via it.
Stochastic Policy: Allows random exploratory Activities. It’s a probability (Stochastic map) of taking an action given a state.

Value Function:

It’s a prediction of expected future reward. We chose between actions by deciding to go for the highest rewards, an estimate of this is obtained by the Value function.

Value function depends on the way in which we are behaving, it depends on the policy. It gives the reward if we follow an action, thus helps in optimize our behaviour.

Gamma: Discounting. It affects if we care about current/later states. It decides the horizon for evaluating the future. (Horizon-how far along do we need to calculate outcomes of our actions).

Model:

It’s used to learn the environment, predicts what the environment will do next. It isn’t necessary to create a model of the environment. But it’s useful when we do.

It can be divide into two states:

Transition: Predicts dynamics of the environment. The next state.
Rewards: Predicts immediate awards. This is divide into:

State Transition model: Predicts the state transition, given the current rewards.
Reward Model: It predicts the expected Reward given the current state.

RL Agents:

We catgorize our agents based on which of the above three concepts, it follows. Say, if we have a value based agent: if it has a value function and a policy is implicit.

Value Based
Policy Based
Actor Critic

Policy based: maintains a data structure of the every state without storing the value function.

Actor Critic: Combines both the policy and also the value function.

So RL Problems can be categorized as:

Model Free: We don’t try to understand the environment and we directly see experience and figure out policy.
Model Based: Involves a RL Agent.

Problems within RL

Learning and Planning:

There are two problems when it comes to sequential decision making.

RL Problem:

Initially environment is unknown (via trial and error).
Interacts with Environment.
Improves it Policy.

Planning:

We describe the environment, model of agent is known to agent.
Agent computes on its model and improves its policy

Exploration Vs Exploitation

Another key aspect of RL.

RL is like trial and error learning.
We might miss out on rewards when we are exploring.
We want to figure out the best policy.

Exploration: Chosing to give up some known reward, in order to find more about the environment.

Exploitation: Exploits known information to maximise reward.

There is an Exploration Vs Exploitation Tradeoff.

Prediction and Control

Prediction: An estimate of the future, given the current policy

Control: Find the Best policy.

In RL, we need to evaluate all our policies to find out the best one.

Subscribe to Newsletter for a Weekly Curated Lists of Deep Learning and Computer Vision Reads.

You can find me on Twitter @bhutanisanyam1, connect with me on Linkedin here