We will be talking on how to train your player go up the victory hill without prior rules using OpenAI Gym [Source] If you are wondering what is OpenAI Gym, please refer to . Today we will be seeing some reinforcement learning in practice, wherein we will try to teach our agent (car in this case) to climb up to the right hill where the goal point is situated in minimum steps (a.k.a minimum time). Officially, you can read for more clarity on the problem statement that we are trying to solve. Reinforcement Learning — Part 4 Mountain-v0 Environment Gym We will be using good old school technique approach to this problem. Just to make sure we are on the same page, please go through and then come back here. epsilon-greedy basics of Gym Let’s Code… import gymimport numpy as npimport operator # create environmentenv = gym.make('MountainCar-v0')possible_actions = env.action_space.nprint 'Possible actions are {}'.format(possible_actions) In the above snippet, we are instantiating our environment and by calling method by passing in the environment name as an argument. You can find the full list of possible gym environments . Next, we would like to see possible actions that our agent can take in this particular environment. This action set is needed to make an informed decision at any time-step during the game-play. make here class EpsilonGreedy(): def \_\_init\_\_(self, episodes=1000, epsilon=0.2): self.episodes = episodes self.epsilon = epsilon self.values = {0:0.0, 1:0.0, 2:0.0} self.counts = {0:0, 1:0, 2:0} def explore(self): return np.random.choice(self.counts.keys()) def exploit(self): return max(self.values.items(), \\ key=operator.itemgetter(1))\[0\] def select\_action(self, observation): if np.random.uniform(0,1) < self.epsilon: # explore return self.explore() else: # exploit return self.exploit() def update\_counts(self, action): self.counts\[action\] = self.counts\[action\] + 1 def update\_values(self, action, reward): current\_value = self.values\[action\] n = self.counts\[action\] self.values\[action\] = ((n-1)/float(n))\* \\ current\_value + (1/float(n))\*reward def update\_all(self, action, reward): self.update\_counts(action) self.update\_values(action, reward) Here, we define a class EpsilonGreedy with instance initializations as greedy , epsilon, values, counts because these are the only ones needed, you can think of as player environment interaction when the game starts and ends and as the of taking here our agent would environment(take random action) with a probability of and with probability at every time-step. Certain level of exploration is required for model to find the best possible action in a given situation and avoid trap of . Also and as the of the previously estimated value and the reward we just received and action taken count respectively. Also, if you see we define six methods — , , is a method that is responsible for triggering an action of either to explore or exploit at any time-step in the game and is responsible for updating action counts and values dict. If you notice carefully, then you will see that we have specified a value of 0(Zero) to possible actions in count and values dictionary initially.. just because we need to start somewhere with the memory. episodes (maybe because player died or time-up state was reached) epsilons probability random action, explore the 20% exploit (take highly weighted actions) 80% local minima values counts weighted average explore exploit select_option, update_all, update_values, update_counts. select_option update_all epsilonlearn = EpsilonGreedy() for episode in xrange(epsilonlearn.episodes):observation = env.reset() # get new state while True:env.render()action = epsilonlearn.select_action(observation)next_observation, reward, done, _ = env.step(action)epsilonlearn.update_all(action, reward) observation = next\_observation if done: break # break if the goal is reached env.destroy() # destroy the current environment Above snippet is the final piece of code that we will be writing for this task. We first get the instance/object of the class EpsilonGreedy Next, we iterate through all the episodes and for each of it we reset the environment properties. We then run a (till goal achieved) loop in which we select an action for every time-step and use the and next state observation to choose consecutive action wisely . At last, on completion of the game, we destroy the environment and restart with the next episode. (we defined this in one of the above snippets). (possible lives in a game) reward (in-order to maximize our total rewards) That’s what it took to create an agent that can now learn it’s environment and educated actions on to be taken at each time-step in-order to maximize the total reward. / / / My series on Reinforcement Learning can be found @ 1 2 3 4 Feel free to comment and share your thoughts. Do share and clap if you ❤ it.