1,200 reads

Reinforcement Learning — Part 5

by Prakhar MishraApril 29th, 2018

Too Long; Didn't Read

If you are wondering what is OpenAI Gym, please refer to <a href="https://hackernoon.com/reinforcement-learning-part-4-5cb408a4b368" target="_blank">Reinforcement Learning — Part 4</a>. Today we will be seeing some reinforcement learning in practice, wherein we will try to teach our agent (car in this case) to climb up to the right hill where the goal point is situated in minimum steps (a.k.a minimum time). Officially, you can read <a href="https://gym.openai.com/envs/MountainCar-v0/" target="_blank">Mountain-v0 Environment Gym</a> for more clarity on the problem statement that we are trying to solve.

featured image - Reinforcement Learning — Part 5

We will be talking on how to train your player go up the victory hill without prior rules using OpenAI Gym

[Source]

If you are wondering what is OpenAI Gym, please refer to Reinforcement Learning — Part 4. Today we will be seeing some reinforcement learning in practice, wherein we will try to teach our agent (car in this case) to climb up to the right hill where the goal point is situated in minimum steps (a.k.a minimum time). Officially, you can read Mountain-v0 Environment Gym for more clarity on the problem statement that we are trying to solve.

We will be using good old school technique epsilon-greedy approach to this problem. Just to make sure we are on the same page, please go through basics of Gym and then come back here.

Let’s Code…

import gymimport numpy as npimport operator

# create environmentenv = gym.make('MountainCar-v0')possible_actions = env.action_space.nprint 'Possible actions are {}'.format(possible_actions)

In the above snippet, we are instantiating our environment and by calling make method by passing in the environment name as an argument. You can find the full list of possible gym environments here. Next, we would like to see possible actions that our agent can take in this particular environment. This action set is needed to make an informed decision at any time-step during the game-play.

class EpsilonGreedy():

def \_\_init\_\_(self, episodes=1000, epsilon=0.2):  
    self.episodes = episodes  
    self.epsilon = epsilon  
    self.values = {0:0.0, 1:0.0, 2:0.0}  
    self.counts = {0:0, 1:0, 2:0}

def explore(self):  
    return np.random.choice(self.counts.keys())

def exploit(self):  
    return max(self.values.items(), \\  
           key=operator.itemgetter(1))\[0\]

def select\_action(self, observation):  
    if np.random.uniform(0,1) < self.epsilon:  
        # explore  
        return self.explore()  
    else:  
        # exploit  
        return self.exploit()  
  
def update\_counts(self, action):  
    self.counts\[action\] = self.counts\[action\] + 1

def update\_values(self, action, reward):  
    current\_value = self.values\[action\]  
    n = self.counts\[action\]

    self.values\[action\] = ((n-1)/float(n))\* \\  
                          current\_value + (1/float(n))\*reward

def update\_all(self, action, reward):  
    self.update\_counts(action)  
    self.update\_values(action, reward)

Here, we define a class EpsilonGreedy with instance initializations as greedy , epsilon, values, counts because these are the only ones needed, you can think of episodes as player environment interaction when the game starts and ends (maybe because player died or time-up state was reached) and epsilons as the probability of taking random action, here our agent would explore the environment(take random action) with a probability of 20% and exploit (take highly weighted actions) with 80% probability at every time-step. Certain level of exploration is required for model to find the best possible action in a given situation and avoid trap of local minima. Also values and counts as the weighted average of the previously estimated value and the reward we just received and action taken count respectively. Also, if you see we define six methods — explore, exploit, select_option, update_all, update_values, update_counts. select_option is a method that is responsible for triggering an action of either to explore or exploit at any time-step in the game and update_all is responsible for updating action counts and values dict. If you notice carefully, then you will see that we have specified a value of 0(Zero) to possible actions in count and values dictionary initially.. just because we need to start somewhere with the memory.

epsilonlearn = EpsilonGreedy()

for episode in xrange(epsilonlearn.episodes):observation = env.reset() # get new state

while True:env.render()action = epsilonlearn.select_action(observation)next_observation, reward, done, _ = env.step(action)epsilonlearn.update_all(action, reward)

    observation = next\_observation

    if done: break # break if the goal is reached

env.destroy() # destroy the current environment

Above snippet is the final piece of code that we will be writing for this task. We first get the instance/object of the class EpsilonGreedy (we defined this in one of the above snippets). Next, we iterate through all the episodes (possible lives in a game) and for each of it we reset the environment properties. We then run a (till goal achieved) loop in which we select an action for every time-step and use the reward and next state observation to choose consecutive action wisely (in-order to maximize our total rewards). At last, on completion of the game, we destroy the environment and restart with the next episode.

That’s what it took to create an agent that can now learn it’s environment and educated actions on to be taken at each time-step in-order to maximize the total reward.