From zero to hero, step by step. Kai (Image by the author) Welcome to my reinforcement learning course! ❤️ Let’s walk this beautiful path from the fundamentals to cutting edge reinforcement learning (RL), step-by-step, with coding examples and tutorials in Python, together! This first part covers the bare minimum concept and theory you need to embark on this journey. Then, in each following chapter, we will solve a different problem, with increasing difficulty. Ultimately, the most complex RL problems involve a mixture of reinforcement learning algorithms, optimization, and Deep Learning. You do not need to know deep learning (DL) to follow along with this course. I will give you enough context to get you familiar with DL philosophy and understand how it becomes a crucial ingredient in modern reinforcement learning. Part 1 In this first lesson, we will cover the fundamentals of reinforcement learning with examples, 0 maths, and a bit of Python. Contents - What is a Reinforcement Learning problem? 🤔 - Policies 👮🏽 and value functions. - How to generate the training data? 📊 - Python boilerplate code.🐍 - Recap ✨ - Homework - 📚What’s next? ❤️ Let’s start! 1. What is a reinforcement learning problem? 🤔 Reinforcement Learning (RL) is an area of Machine Learning (ML) concerned with learning problems where An intelligent agent 🤖 needs to learn, through trial and error, how to take actions inside and environment 🌎 in order to maximize a cumulative reward . Reinforcement Learning is the kind of machine learning closest to how humans and animals learn. What is an agent? And an environment? What are exactly these actions the agent can take? And the reward? Why do you say cumulative reward? If you are asking yourself these questions you are on the right track. The definition I just gave introduces a bunch of terms that you might not be familiar with. In fact, they are ambiguous on purpose. This generality is what makes RL applicable to a wide range of seemingly different learning problems. This is the philosophy behind mathematical modelling, which stays at the roots of RL. Let’s take a look at a few learning problems, and see them through the lenses of Reinforcement Learning 🔍. Example 1: learning to walk 🚶🏽‍♀️🚶🏿🚶‍♀️ As a father of a baby who recently started walking, I cannot stop asking myself, how did he learn that? Kai and Pau (Image by the author) As a Machine Learning engineer, I fantasize about understanding and replicating that incredible learning curve with software and hardware. Let’s try to model this learning problem using the RL ingredients: - The is my son, Kai. And he wants to stand up and walk. His muscles are strong enough at this point in time to have a chance at it. The learning problem for him is: how to sequentially adjust his body position, including several angles on his legs, waist, back, and arms to balance his body and not fall. agent - The is the physical world surrounding him, including the laws of physics. The most important of which is gravity. Without gravity the learning-to-walk problem would drastically change, and even become irrelevant: why would you wanna walk in a world where you can simply fly? Another important law in this learning problem is Newton’s third law, which in plain words tells that if you fall on the floor, the floor is going to hit you back with the same strength. Ouch! environment - The are all the updates in these body angles, which determine his body position and speed as he starts chasing things around. Sure he can do other things at the same time, like imitating the sound of a cow, but these are probably not helping him accomplish his goal. We ignore these actions in our framework. Adding unnecessary actions does not change the modeing step, but it makes the problem harder to solve later on. actions - The he receives is a stimulus coming from the brain, that makes him happy or makes him feel pain. There is the negative reward he experiences when falling on the floor, which is physical pain maybe followed by frustration. On the other side, there are several things that contribute positively to his happiness, like the happiness of getting to places faster 👶🚀, or the external stimulus that comes from my wife Jagoda and I when we say “ ” or “ ” to each attempt and marginal improvement he shows. reward Good job! Bravo! An important (and obvious) remark is that Kai does not need to learn the physics of Newton to stand up and walk. He will learn through observing the of the environment, taking an action, and collecting a reward from this environment. He does not need to learn a model of the environment to achieve his goal. state 💰 A little bit more about rewards The reward is a signal to Kai that what he has been doing is good or bad for his learning. As he takes new actions and experiences pain or happiness, he starts to adjust his behavior to collect more positive feedback and less negative feedback. In other words, he learns Some actions might seem very appealing for the baby at the beginning, like trying to run to get a boost of excitement. However, he soon learns that in some (or most) cases he ends up falling on his face, and experiencing an extended period of pain and tears. This is why intelligent agents maximize , and not marginal reward. They trade short-term rewards with long-term ones. An action that would give immediate reward, but put my body in a position about to fall, is not an optimal one. cumulative reward Great happiness followed by greater pain is not a recipe for long-term well-being. This is something that babies often learn easier than we grown-ups. The frequency and intensity of the rewards are key for helping the agent learn. Very infrequent (sparse) feedback means harder learning. Think about it, if you do not know if what you do is good or bad, how can you learn? This is one of the main reasons why some RL problems are harder than others. Reward shaping is a modeling decision for many real-world RL problems. tough Example 2: learning to play monopoly like a pro 🎩🏨💰 As a kid, I spent a lot of time playing Monopoly with friends and relatives. Well, who hasn’t? It is an exciting game that combines luck (you roll the dices) and strategy. Monopoly is a real-estate board game for two to eight players. You roll two dices to move around the board, buying and trading properties, and developing them with houses and hotels. You collect rent from your opponents, with the goal being to drive them into bankruptcy. Photo by Suzy Hazelwood from Pexels If you were so into this game that you wanted to find intelligent ways to play it, you could use some reinforcement learning. What would the 4 RL ingredients be? - The is you, the one who wants to win at Monopoly. agent - Your are the ones you see on this screenshot below: actions Action space in Monopoly. Credits to aleph aseffa - The is the current state of the game, including the list of properties, positions, and cash amounts each player has. There is also the strategy of your opponent, which is something you cannot predict and lies outside of your control. environment - And the is 0, except in your last move, where it is +1 if you win the game, and -1 if you go bankrupt. This reward formulation makes sense but makes the problem hard to solve. As we said above, a more sparse reward means a harder solution. Because of this, there are to model the reward, making them noisier but less sparse. reward other ways When you play against another person in Monopoly, you do not know how she or he will play. What you can do is play against yourself. As you learn to play better, your opponent does too (because it is you), forcing you to level up your game to keep on winning. You see the positive feedback loop. This trick is called self-play. It gives us a path to bootstrap intelligence without using the external advice of an expert player. Self-play is the main difference between and , the two models developed by DeepMind that play the game of Go better than any human. AlphaGo AlphaGo Zero Example 3: learning to drive 🚗 In a matter of decades (maybe less), machines will drive our cars, trucks, and buses. Photo by from Ruiyang Zhang Pexels But, how? Learning to drive a car is not easy. The goal of the driver is clear: to get from point A to point B, comfortably for her and any passengers on board. There are many external aspects to the driver that make driving challenging, including: - other drivers behavior - traffic signs - pedestrian behaviors - pavement conditions - weather conditions - .… even fuel optimization (who wants to spend extra on this?) How would we approach this problem with reinforcement learning? - The is the driver who wants to get from A to B, comfortably. agent - The of the environment the driver observes has lots of things, including the position, speed and acceleration of the car, all other cars, passengers, road conditions or traffic signs. Transforming such a big vector of inputs into an appropriate action is challenging as you can imagine. state - The are basically three: the direction of the steering wheel, throttle intensity and break intensity. actions - The after each action is a weighted sum of the different aspects you need to balance when driving. A decrease in distance to point B brings a positive reward, while an increase brings a negative one. To ensure there are no collisions, getting too close (or even colliding) with another car, or even a pedestrian should have a very big negative reward. Also, in order to encourage smooth driving, sharp changes in speed or direction contribute to a negative reward. reward After these 3 examples, I hope the following representation of RL elements and how they play together makes sense: Reinforcement learning ingredients (Image by the author) Now that we understand how to formulate an RL problem, we need to solve it. How? Keep on reading! 2. Policies and value functions Policies The agent picks the action she thinks is the best based on the current state of the environment. This is the agent’s strategy, commonly referred to as the agent’s . policy A policy is a learned mapping from states to actions. Solving a reinforcement learning probem means finding the best possible policy. Policies are either , when they map each state to one action, deterministic or when they map each state to a probability distribution over all possible actions. stochastic is a word you often read and hear in Machine Learning and it essentially means , . In environments with high uncertainty, like Monopoly where you are rolling dices, stochastic policies are better than deterministic ones. Stochastic uncertain random There exist several methods to actually compute this optimal policy. These are called . policy optimization methods Value functions Sometimes, depending on the problem, instead of directly trying to find the optimal policy, one can try to find the associated with that optimal policy. value function But, ? what is a value function And before that, what does value mean in this context? The value is a number associated with each state s of the environment that estimates how good it is for the agent to be in state s . It is the cumulative reward the agent collects when starting at state s and choosing actions according to policy π . A value function is a learned mapping from states to values. The value function of a policy is commonly denoted as Value functions can also map pairs of (action, state) to values. In this case, they are called functions. q-value The optimal value function (or q-value function) satisfies a mathematical equation, called the . Bellman equation This equation is useful because it can be transformed into an iterative procedure to find the optimal value function. But Because you can infer an optimal policy from an optimal q-value function. , why are value functions useful? The optimal policy is the one where at each state the agent chooses the action that maximizes the q-value function. How? s a So, you can jump from optimal policies to optimal q-functions, and vice versa 😎. There are several RL algorithms that focus on finding optimal q-value functions. These are called . Q-learning methods The zoologic of reinforcement learning algorithms 🐘🐅🦒 There are lots of different RL algorithms. Some try to directly find optimal policies, others q-value functions, and others both at the same time. The zoologic of RL algorithms is diverse and a bit intimidating. There is no when it comes to RL algorithms. You need to experiment with a few of them each time you solve an RL problem and see what works for your case. one-size-fits-all As you follow along this course you will implement several of these algorithms and gain an insight into what works best in each situation. 3. How to generate training data? 📊 Reinforcement learning agents are VERY data-hungry. Photo by Karsten Winegeart To solve RL problems you need a lot of data. A way to overcome this hurdle is by using . Writing the engine that simulates the environment usually requires more work than solving the RL problem. Also, changes between different engine implementations can render comparisons between algorithms meaningless. simulated environments This is why guys at OpenAI released the back in 2016. OpenAIs’s gym offers a standardized API for a collection of environments for different problems, including Gym toolkit the classic Atari games, robotic arms or landing on the Moon (well, a simplified one) There are proprietary environments too, like ( ). MuJoCo is an environment where you can solve continuous control tasks in 3D, like learning to walk 👶. MuJoCo recently bought by DeepMind OpenAI Gym also defines a standard API to build environments, allowing third parties (like you) to create and make your environments available to others. If you are interested in self-driving cars, then you should check out CARLA, the most popular open urban driving simulator. 4. Python boilerplate code 🐍 You might be thinking: What we covered so far is interesting, but how do I actually write all this in Python? And I completely agree with you 😊 Let’s see how all this looks like in Python. env = load_env() agent = get_rl_agent() state = env.reset() epsilon = get_epsilon(episode) action = env.action_space.sample() action = agent.get_best_action(state) next_state, reward, done, info = env.step(action) agent.update_parameters(state, action, reward, next_state)

            state = next_state import random def train ( n_episodes: int ): """ Pseudo-code of a Reinforcement Learning agent training loop """ # python object that wraps all environment logic. Typically you will # be using OpenAI gym here. # python object that wraps all agent policy (or value function) # parameters, and action generation methods. for episode in range ( 0 , n_episodes): # random start of the environmnet # epsilon is parameter that controls the exploitation-exploration trade-off. # it is good practice to set a decaying value for epsilon done = False while not done: if random.uniform( 0 , 1 ) < epsilon: # Explore action space else : # Exploit learned values (or policy) # environment transitions to next state and maybe rewards the agent. # adjust agent parameters. We will see how later in the course. Did you find something unclear in this code? What about line 23? What is this epsilon? Don’t panic. I didn’t mention this before but I won’t leave you without an explanation. Epsilon is a key parameter to ensure our agent explores the environment enough, before drawing definite conclusions on what is the best action to take in each state. It is a value between 0 and 1, and it represents the probability the agent chooses a random action instead of what she thinks is the best one. This tradeoff between exploring new strategies vs sticking to already known ones is called the . This is a key ingredient in RL problems and something that distinguishes RL problems from supervised machine learning. exploration-exploitation problem Technically speaking, we want the agent to find the global optimum, not a local one. It is good practice to start your training with a large value (e.g. 50%) and progressively decrease after each episode. This way the agent explores a lot at the beginning, and less as she perfects her strategy. 5. Recap ✨ The key takeaways for this 1st part are: - Every RL problem has an agent (or agents), environment, actions, states and rewards. - The agent sequentially takes actions with the goal of maximizing total rewards. For that she needs to find the optimal policy. - Value functions are useful as they give us an alternative path to find the optimal policy. - In practice, you need to try different RL algorithms for your problem, and see what works best. - RL agents need a lot of training data to learn. OpenAI gym is a great tool to re-use and create your environments. - Exploration vs exploitation is necessary when training RL agents, to ensure the agent does not get stuck in local optimums. 6. Homework 📚 A course without a bit of homework would not be a course. I want you to pick a real-world problem that interests you, that you could model and solve using reinforcement learning. Pick a problem you care about. These are the ones you want to spend your precious time on. Define what are the agent(s), actions, states, and rewards. Feel free to send me an e-mail at with your problem, and I will give you feedback. plabartabajo@gmail.com 7. What’s next? In part 2 we solve our first reinforcement learning problem using Q-learning. See you there! Do you want to become an (even) better data scientist, and access top courses about machine learning and data science? 👉🏽 Subscribe to the datamachines newsletter . Have a great day 🧡❤️💙 Pau

BUNCH

Reinforcement Learning [Part 2]: The Q-learning Algorithm

How to Scrape NLP Datasets From Youtube

Become a (better) data scientist

Nominated for 2022 - HackerNoon Contributor of the Year - Online Education

Nominated for 2022 - HackerNoon Contributor of the Year - Algorithms

Nominated for 2022 - HackerNoon Contributor of the Year - Machine Learning

Nominated for 2022 - HackerNoon Contributor of the Year - Natural Language Processing

Nominated for 2022 - HackerNoon Contributor of the Year - Learning

Too Long; Didn't Read

Reinforcement Learning Course: Part 1

Reinforcement Learning Course: Part 1

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Adversarial Examples In Machine Learning Explained

How Learning Management System Revolutionizing the Education Sector

Most of MOZ Academy Curriculum Available For Free Until May 31, 2020

Non-Fungible Education And Proof of Knowledge On The Decentralized Internet

Why You Should Learn Ethics of AI With This Free Online Course

Choosing the Right Machine Learning Algorithm

Adversarial Examples In Machine Learning Explained

How Learning Management System Revolutionizing the Education Sector

Most of MOZ Academy Curriculum Available For Free Until May 31, 2020

Non-Fungible Education And Proof of Knowledge On The Decentralized Internet

Why You Should Learn Ethics of AI With This Free Online Course

Choosing the Right Machine Learning Algorithm

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps