[Source]
If you are wondering what is OpenAI Gym, please refer to Reinforcement Learning — Part 4. Today we will be seeing some reinforcement learning in practice, wherein we will try to teach our agent (car in this case) to climb up to the right hill where the goal point is situated in minimum steps (a.k.a minimum time). Officially, you can read Mountain-v0 Environment Gym for more clarity on the problem statement that we are trying to solve.
We will be using good old school technique epsilon-greedy approach to this problem. Just to make sure we are on the same page, please go through basics of Gym and then come back here.
import gymimport numpy as npimport operator
# create environmentenv = gym.make('MountainCar-v0')possible_actions = env.action_space.nprint 'Possible actions are {}'.format(possible_actions)
In the above snippet, we are instantiating our environment and by calling make method by passing in the environment name as an argument. You can find the full list of possible gym environments here. Next, we would like to see possible actions that our agent can take in this particular environment. This action set is needed to make an informed decision at any time-step during the game-play.
class EpsilonGreedy():
def \_\_init\_\_(self, episodes=1000, epsilon=0.2):
self.episodes = episodes
self.epsilon = epsilon
self.values = {0:0.0, 1:0.0, 2:0.0}
self.counts = {0:0, 1:0, 2:0}
def explore(self):
return np.random.choice(self.counts.keys())
def exploit(self):
return max(self.values.items(), \\
key=operator.itemgetter(1))\[0\]
def select\_action(self, observation):
if np.random.uniform(0,1) < self.epsilon:
# explore
return self.explore()
else:
# exploit
return self.exploit()
def update\_counts(self, action):
self.counts\[action\] = self.counts\[action\] + 1
def update\_values(self, action, reward):
current\_value = self.values\[action\]
n = self.counts\[action\]
self.values\[action\] = ((n-1)/float(n))\* \\
current\_value + (1/float(n))\*reward
def update\_all(self, action, reward):
self.update\_counts(action)
self.update\_values(action, reward)
Here, we define a class EpsilonGreedy with instance initializations as greedy , epsilon, values, counts because these are the only ones needed, you can think of episodes as player environment interaction when the game starts and ends (maybe because player died or time-up state was reached) and epsilons as the probability of taking random action, here our agent would explore the environment(take random action) with a probability of 20% and exploit (take highly weighted actions) with 80% probability at every time-step. Certain level of exploration is required for model to find the best possible action in a given situation and avoid trap of local minima. Also values and counts as the weighted average of the previously estimated value and the reward we just received and action taken count respectively. Also, if you see we define six methods — explore, exploit, select_option, update_all, update_values, update_counts. select_option is a method that is responsible for triggering an action of either to explore or exploit at any time-step in the game and update_all is responsible for updating action counts and values dict. If you notice carefully, then you will see that we have specified a value of 0(Zero) to possible actions in count and values dictionary initially.. just because we need to start somewhere with the memory.
epsilonlearn = EpsilonGreedy()
for episode in xrange(epsilonlearn.episodes):observation = env.reset() # get new state
while True:env.render()action = epsilonlearn.select_action(observation)next_observation, reward, done, _ = env.step(action)epsilonlearn.update_all(action, reward)
observation = next\_observation
if done: break # break if the goal is reached
env.destroy() # destroy the current environment
Above snippet is the final piece of code that we will be writing for this task. We first get the instance/object of the class EpsilonGreedy (we defined this in one of the above snippets). Next, we iterate through all the episodes (possible lives in a game) and for each of it we reset the environment properties. We then run a (till goal achieved) loop in which we select an action for every time-step and use the reward and next state observation to choose consecutive action wisely (in-order to maximize our total rewards). At last, on completion of the game, we destroy the environment and restart with the next episode.
That’s what it took to create an agent that can now learn it’s environment and educated actions on to be taken at each time-step in-order to maximize the total reward.
Feel free to comment and share your thoughts. Do share and clap if you ❤ it.