Codes and demo are available. This article explores what are states, actions and rewards in reinforcement learning, and how agent can learn through simulation to determine the best actions to take in any given state. Intuition After a long day at work, you are deciding between 2 choices: to head home and write an article or hang out with friends at a bar. If you choose to hang out with friends, your friends will make you feel happy; whereas heading home to write an article, you’ll end up feeling tired after a long day at work. In this example, enjoying yourself is a reward and feeling tired is viewed as a negative reward, so why write articles? Because in life, we don’t just think about immediate rewards; we plan a course of actions to determine the possible future rewards that may follow. Perhaps writing an article may brush up your understanding of a particular topic really well, get recognised and ultimately lands you that dream job you’ve always wanted. In this scenario, is a delayed reward from a list of actions you took, then we want to assign some for being at those states (for example “going home and write an article”). In order to determine the of a state, we call this the “value function”. getting your dream job value value So how do we learn from our past? Let’s say you made some great decisions and are in the best state of your life. Now look back at the various decisions you’ve made to reach this stage: what do you attribute your success to? What are the previous states that led you to this success? What are the actions you did in the past that led you to this state of receiving this reward? How is the action you are doing now related to the potential reward you may receive in the future? Reward vs Value Function A is immediate. It can be scoring points in a game for collecting coins, winning a match of tic-tac-toe or securing your dream job. This is what you (or the agent) wants to acquire. reward reward In order to acquire the , the value function is an efficient way to determine the of being in a state. Denoted by , this value function measures potential future we may get from being in this state . reward value V(s) rewards s Define the Value Function In figure 1, how do we determine the of state A? There is a 50–50 chance to end up in the next 2 possible states, either state B or C. The value of state A is simply the sum of all next states’ multiplied by the for reaching that state. The of state A is 0.5. value probability reward value In figure 2, you find yourself in state D with only 1 possible route to state E. Since state E gives a of 1, state D’s is also 1 since the only outcome is to receive the . reward value reward If you are in state F (in figure 2), which can only lead to state G, followed by state H. Since state H has a negative of -1, state G’s will also be -1, likewise for state F. reward value , getting 2 s in a row (state J in figure 3) does not win the game, hence there is no . But being at state J places you one step closer to reaching state K, completing the row of to win the game, thus being in state J yields a good . In this game of tic-tac-toe X reward X value In figure 4, you’ll find yourself in state L contemplating where to place your next . You can place it at the top thus bringing you to state M with 2 s in the same row. The other choice would be to place it at the bottom row. State M should have a higher significance and as compared to state N because it results in a higher possibility of victory. X X value Therefore, at any given state, we can perform the that brings us (or the agent) closer to receiving a , by picking the state that yields us the highest . action reward value Tic Tac Toe — Initialise the Value Function The Value function for a tic-tac-toe game is the probability of winning for achieving state . This initialisation is done to define the winning and losing state. We initialise the states as the following: V(s) s = 1 — if the agent won the game in state , it is a terminal state - V(s) s - = 0 — if the agent lost or tie the game in state , it is a terminal state V(s) s - = 0.5 — otherwise 0.5 for non-terminal states, which will be finetuned during training V(s) Tic Tac Toe — Update the Value Function Updating the value function is how the agent learns from past experiences, by updating the of those states that have been through in the training process. value State is the next state of the current state . We can update the of the current state by adding the differences in between state and . α is the . s’ s value s value s s’ learning rate As multiple actions can be taken at any given state, so constantly picking only one action at a state that used to bring success might end up missing other better states to be in. In reinforcement learning, this is the . explore-exploit dilemma With explore strategy, the agent takes random actions to try unexplored states which may find other ways to win the game. With exploit strategy, the agent is able to increase the confidence of those actions that worked in the past to gain . With a good balance between exploring and exploiting, and by playing infinitely many games, the for every state will approach its true probability. This good balance between exploring and exploit is determined by the . rewards value epsilon greedy parameter We can only update the of each state that has been played in that particular game by the agent when the game has ended, after knowing if the agent has won (reward = 1) or lost/tie (reward = 0). A terminal state can only be 0 or 1, and we know exactly which are the terminal states as defined in during the initialisation. value The goal of the agent is to update the value function after a game is played to learn the list of actions that were executed. As every state’s is updated using the next state’s , during the end of each game, the update process read the state history of that particular game backwards and finetunes the for each state. value value value Tic Tac Toe — Exploit the Value Function Given enough training, the agent would have learnt the (or probability of winning) of any given state. So, when we play a game against our trained agent, the agent uses the exploit strategy to maximise winning rate. . value See if you can win against the agent At each state of the game, the agent loop through every possibility, picking the next state with the highest , therefore selecting the best course of action. In figure 6, the agent would pick the bottom-right corner to win the game. value Conclusion At any progression state except the terminal stage (where a win, loss or draw is recorded), the agent takes an action which leads to the next state, which may not yield any but would result in the agent a move closer to receiving a . reward reward The value function is the algorithm to determine the of being in a state, the probability of receiving a future reward. value The of each state is updated reversed chronologically through the state history of a game, with enough training using , the agent will be able to determine the true of each state in the game. value both explore and exploit strategy value There are many ways to define a value function, this is just one that is suitable for a tic-tac-toe game. Explore the demo on Github View source code on Github Hi! I’m , currently a data scientist at Alibaba Group, a PhD student at Nanyang Technological University, and a passionate writer on and . Follow me on or connect with me on . Hong Jing (Jingles) Towards Data Science Hackernoon Medium LinkedIn This article was originally published on Towards Data Science