In this episode: Q-Values, Reinforcement learning, and more. Make sure to check out part. 1 and part. 2 too! Introduction Today, we’re gonna learn how to create a virtual agent that discovers how to interact with the world. The technique we’re going to use is called and it’s . Q-Learning, super cool The agent, the state and the goal Let’s take a look at our agent! Henry, the robot This is , he’s a young virtual robot who has a dream: The problem is he doesn’t know much about the world. In fact, he doesn’t event know how to ! He only knows his , the and . Henry travel the world. move GPS location position of its feet, if they are on the ground What Henry perceives of the world These elements can be split into two parts: the , and the The of our agent is the ensemble of the informations relative to his body, while the is what the agent wants to increase. In our case, the agent want to use its feet to change its position. state goal. state goal : [Right foot extension, Right foot on ground, Left foot extension, Left foot on ground] : [Position] State Goal The actions Fortunately, Henry is capable of Even though he cannot perceive much of the world, he know what he is capable of doing, and that depends on his . when both his feet are on the ground, he can one. And when one foot is in the air, he can move it or . performing actions. state For example, lift forward backward The actions available to our robot, depending on his state. In reality, the diagram is a bit more complex, as the robot takes into account the extension of the foot. NB: Here’s Henry performing random actions Policies To reach his goal, the agent follows a set of instructions called a . These instructions can be learnt by the agent or written down manually by humans. policy Following a hard-coded policy Let’s write down a simple policy for our agent to follow. A simple, hard-coded policy The agent following the simple policy. Look at him go! This looks promising, although the robot is loosing a lifting and putting down his feet. It’s not really . lot of time efficient In fact, we write good policies manually. This is either because we , or because they are often rarely don’t know any (e.g. complex task) not robust (if the agent somehow starts with the left foot up, the policy above would immediately fail as the agent cannot execute the desired action). Q-Learning To help our agent fullfil his dream, we will use a technique called to help the robot a and policy. This technique consists in attributing a number, or , to the event of performing a when the agent is in a . This value represents the made with this action. reinforcement learning Q-Learning learn robust efficient Q-Value certain action certain state progress Some State-Action pairs associated with their Q-Value This value is determined by a indicating whether the action had or impact on reaching the goal. , if Henry moves from where he was, it’s , if he’s going , it’s . reward function , positive negative In our case away good back bad Our agent’s basic reward function The key point here is that it’s the action that has , but rather the fact of when the agent is in a . not itself value performing an action specific state With the knowledge of , the agent , at each step, pick the which would bring the best reward depending on his state. This would be his . all Q-Values could action policy The issue here is that Henry doesn’t have a clue of what the Q-Values are! This means he is to pick an action that will bring him to his goal! not able closer Fortunately, we have a way of making him discover the Q-Values by himself. Learning the Q-Values with the Epsilon-Greedy algorithm Training the agent To learn the Q-Values of its State-Action pairs, Henry must try these actions by himself. The problem is that there might be of possible combinations, and the program cannot try them all. We want Henry to combinations as , but on the . billions try as many possible focus best ones To achieve that, we are going to use the algorithm to train the robot. It works like this: at , the robot has a of performing a available action, otherwise it picks the best action according to the Q-Values it knows. The result of the action is used to the Q-Values. Epsilon-Greedy each step probability epsilon random update Updating the Q-Values and the problem of distant rewards A simple way to update the Q-Values would be to the value the robot has in , by the value he has just by making the action. However, this brings a issue: the robot cannot see more than one step in the future. This makes the robot blind to any reward. replace memory experienced really problematic future Lifting a foot is necessary to walk, but how could Henry that it is a good action, when it doesn’t bring any discover immediate reward? The solution to this problem is given by the , which proposes a way to account for rewards. Bellman equation future Bellman equation , instead of replacing the old value by the new value, the old value fades away at a ( ). This enables the robot to take into account noise. Some actions might work sometimes, and sometimes not. With this evolution of the Q-Value, one faulty reward mess up the whole system. First certain rate alpha, the learning rate progressive doesn’t , the new value is calculated using not only the immediate reward, but also the . This value consists of the we expect to receive from the actions available. This has a impact on the of the learning process. With this, the rewards are Also expected maximal value best possible reward dramatic effectiveness . propagated back in time With this change, the Q-Value of raising a foot has become , as it benefits from the optimal expected value of taking a step forward. positive future Q-Values updated with Bellman equation. Note how now is seen as instead of , as is benefits from the reward of taking the step. putting a foot down positive neutral next Exploration vs Exploitation dilemma By playing on the value epsilon, we are facing the dilemma of Exploration vs Exploitation. An agent with a will try actions ( ), while an agent with a will try actions ( ). We need to find a sweet spot that will enable our robot to many , without too much time on lead. high epsilon mostly random exploration low epsilon rarely new exploitation try new things wasting unpromising Two agents, trained with different epsilons, following their best policy. Here, the agent that focused more on discovered a very effective technique to move. Contrarily, the robot doesn’t take full advantage of the extension of his feet because it didn’t spend enough time trying random actions. that a must be found, as spending too much time on exploration prevents the agent from learning exploration yellow Note balance complex policies. Conclusion Henry’s travelling the world at an astonishing pace! He managed to find by itself a better way to move than the policy we gave him, plus he can , as from any state he only has to follow his trusted to guide him! Look at that! simple recover from small errors Q-Values You can play with the code here: _QLearningWalkingRobot - A simple Q-Learning example with a cute walking robot_github.com despoisj/QLearningWalkingRobot 🎉 You’ve reached the end! I hope you enjoyed this article. If you did, please like it, share it, subscribe to the newsletter, send me pizzas, follow me on medium, or do whatever you feel like doing! 🎉 If you like Artificial Intelligence, subscribe to the newsletter to receive updates on articles and much more!