Where next? After SVMs, CNNs and Word Embeddings

The plethora of knowledge involved in Machine Learning is the most fabulous thing about the subject. The theoretical and coding balance requires a steady and disciplined approach. In this five series tutorial, we saw CNNs, where we saw various approaches to different scenarios, and then worked on word embeddings, which was our gateway to Natural Language Processing, and finally ended with Support Vector Machines(SVMs) which were as powerful as Artificial Neural Networks, during the time of their inception.

Is this all there is to Machine Learning?

If we have this questions running, then we are absolutely right to think the same! The above mentioned topics are just a glimpse into the supervised learning models, that are available for various domains. CNNs capture deep learning, embeddings capture NLP & SVMs capture supervised ML algorithm, but broadly, all these fit the supervised learning approach. Thus, we should also realise that initially, when we talk about Machine Learning, we are told that there are three classes of ML.

Supervised LearningUnsupervised LearningReinforcement Learning

Unsupervised Learning:

It’s supervised learning revisited, but this time with no ground truth data. The dataset has no information about the correctness of data, and thus, the algorithms need to find patterns in the data. There are various types of unsupervised learning algorithms, ranging from clustering to auto-encoders in deep learning.

k-means clustering

The goal of clustering is to create groups of data points such that points in different clusters are dissimilar while points within a cluster are similar, k being the number of classes. In k-means the way these groups are defined is by creating a centroid for each group.

Hierarchical Clustering:

The challenge is here is unique as we intend to build a hierarchy of clusters, that is, get in a sense of granularity in various classes. The algorithm goes like this…

Start with the initial k clustersMerge the two cluster that are closest to each other to obtain k-1 clustersRecompute the distances between the clustersrepeat till a single cluster of k data points is obtained.Dimensionality Reduction:Principal component analysis (PCA)

First, a little linear algebra refresher — let’s talk about spaces and bases.

You’re familiar with the coordinate plane with origin O(0,0) and basis vectors would be i(1,0) and j(0,1). The basis vectors tell us the span of the coordinate plane it covers. Hence we can have multiple vectors covering the same space.

This means we can change the basis of a space. Now imagine much higher-dimensional space. Like, 50K dimensions. You can select a basis for that space, and then select only the 200 most significant vectors of that basis. These basis vectors are called principal components, and the subset you select constitute a new space that is smaller in dimensionality than the original space but maintains as much of the complexity of the data as possible.

Another way of thinking about this is that PCA remaps the space in which our data exists to make it more compressible. The transformed dimension is smaller than the original dimension.

Singular Vector Decomposition:

SVD is a matrix decomposition technique which is robust and efficient. It works on rectangular matrices as well, unlike the other techniques like eigenvalue decomposition techniques. The pseudoinverse is the generalization of the matrix inverse for square matrices to rectangular matrices where the number of rows and columns are not equal, aka, Moore-Penrose Inverse.

from numpy import array<br>from scipy.linalg import svd<br># define a matrix<br>A = array([[1, 2], [3, 4], [5, 6]])<br>U, s, VT = svd(A)

NumPy provides the function pinv() for calculating the pseudoinverse of a rectangular matrix.

from numpy import array<br>from numpy.linalg import pinv<br>A = array([[0.1, 0.2],[0.3, 0.4],[0.5, 0.6],[0.7, 0.8]])<br># calculate pseudoinverse<br>B = pinv(A)

Application in Dimensionality Reduction

Data with a large number of features, such as more features than observations (rows) may be reduced to a smaller subset of features that are most relevant to the prediction problem. The result is a matrix with a lower rank that is said to approximate the original matrix. To do this we can perform an SVD operation on the original data and select the top k largest singular values in Sigma. These columns can be selected from Sigma and the rows selected from V^T.

What next?

There are energy based models under unsupervised learning methods, which are fascinating, for the reason that, these probabilistic models are being paralleled to energy distributions of particles in physics… Well certainly do explore these areas! Restricted Boltzmann Machines, Deep Belief Nets and Auto-Encoders.

Reinforcement Learning:

This type of ML is one of the most interesting ones, for this represents the way we take actions. Dopamine is the reward that we get for doing an action successfully. In technical terminologies, the reinforcement learning agent decides how to perform a given task, be it playing games, or finding optimal path between two points, or learning how to fly a drone and do backflips. Trial and error is the best way to put it.

For readers interested in going deep into RL and the math behind it, I suggest you to go through David Silver’s Lectures and Richard Sutton’s book on RL. Some amazing math backs this beautiful subdomain of Machine Learning.

Terminologies:

Markov Decision Process (MDPs): Inspired from finite state machines, these are models with finite states. The process is modelled probabilistically such that at each state, the future state is predicted based on the probabilities. Makes sense right? During the training of the agent, the rewards are associated with each transition that takes place. The definition if MDP is that the current state is independent of the previous states. Hence, it is memoryless.

Q-Learning: We have a function Q that takes a state and action as input and return the reward of that action at the current state. Before we explore the environment, Q gives the same (arbitrary) fixed value. But then, as we explore the environment more, Q gives us a better and better approximation of the value of an action a at a state s. We update our function Q as we go.

Discount factor helps in deciding important states and rewards, also an hyperparameter.

Coding Time:

Let’s have a look at Andrej Karpathy’s code of 130 lines, that implements Reinforcement Learning from scratch using numpy only, built on OpenAI gym’s ATARI 2600 Pong. I suggest you read his blog here to gain insights into RL.

""" Trains an agent with (stochastic) Policy Gradients on Pong. Uses OpenAI Gym. """
import numpy as np
import cPickle as pickle
import gym

# hyperparameters
H = 200 # number of hidden layer neurons
batch_size = 10 # every how many episodes to do a param update?
learning_rate = 1e-4
gamma = 0.99 # discount factor for reward
decay_rate = 0.99 # decay factor for RMSProp leaky sum of grad^2
resume = False # resume from previous checkpoint?
render = False

# model initialization
D = 80 * 80 # input dimensionality: 80x80 grid
if resume:
  model = pickle.load(open('save.p', 'rb'))
else:
  model = {}
  model['W1'] = np.random.randn(H,D) / np.sqrt(D) # "Xavier" initialization
  model['W2'] = np.random.randn(H) / np.sqrt(H)
  
grad_buffer = { k : np.zeros_like(v) for k,v in model.iteritems() } # update buffers that add up gradients over a batch
rmsprop_cache = { k : np.zeros_like(v) for k,v in model.iteritems() } # rmsprop memory

def sigmoid(x): 
  return 1.0 / (1.0 + np.exp(-x)) # sigmoid "squashing" function to interval [0,1]

def prepro(I):
  """ prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
  I = I[35:195] # crop
  I = I[::2,::2,0] # downsample by factor of 2
  I[I == 144] = 0 # erase background (background type 1)
  I[I == 109] = 0 # erase background (background type 2)
  I[I != 0] = 1 # everything else (paddles, ball) just set to 1
  return I.astype(np.float).ravel()

def discount_rewards(r):
  """ take 1D float array of rewards and compute discounted reward """
  discounted_r = np.zeros_like(r)
  running_add = 0
  for t in reversed(xrange(0, r.size)):
    if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
    running_add = running_add * gamma + r[t]
    discounted_r[t] = running_add
  return discounted_r

def policy_forward(x):
  h = np.dot(model['W1'], x)
  h[h<0] = 0 # ReLU nonlinearity
  logp = np.dot(model['W2'], h)
  p = sigmoid(logp)
  return p, h # return probability of taking action 2, and hidden state

def policy_backward(eph, epdlogp):
  """ backward pass. (eph is array of intermediate hidden states) """
  dW2 = np.dot(eph.T, epdlogp).ravel()
  dh = np.outer(epdlogp, model['W2'])
  dh[eph <= 0] = 0 # backpro prelu
  dW1 = np.dot(dh.T, epx)
  return {'W1':dW1, 'W2':dW2}

env = gym.make("Pong-v0")
observation = env.reset()
prev_x = None # used in computing the difference frame
xs,hs,dlogps,drs = [],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 0
while True:
  if render: env.render()

  # preprocess the observation, set input to network to be difference image
  cur_x = prepro(observation)
  x = cur_x - prev_x if prev_x is not None else np.zeros(D)
  prev_x = cur_x

  # forward the policy network and sample an action from the returned probability
  aprob, h = policy_forward(x)
  action = 2 if np.random.uniform() < aprob else 3 # roll the dice!

  # record various intermediates (needed later for backprop)
  xs.append(x) # observation
  hs.append(h) # hidden state
  y = 1 if action == 2 else 0 # a "fake label"
  dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken (see http://cs231n.github.io/neural-networks-2/#losses if confused)

  # step the environment and get new measurements
  observation, reward, done, info = env.step(action)
  reward_sum += reward

  drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)

  if done: # an episode finished
    episode_number += 1

    # stack together all inputs, hidden states, action gradients, and rewards for this episode
    epx = np.vstack(xs)
    eph = np.vstack(hs)
    epdlogp = np.vstack(dlogps)
    epr = np.vstack(drs)
    xs,hs,dlogps,drs = [],[],[],[] # reset array memory

    # compute the discounted reward backwards through time
    discounted_epr = discount_rewards(epr)
    # standardize the rewards to be unit normal (helps control the gradient estimator variance)
    discounted_epr -= np.mean(discounted_epr)
    discounted_epr /= np.std(discounted_epr)

    epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.)
    grad = policy_backward(eph, epdlogp)
    for k in model: grad_buffer[k] += grad[k] # accumulate grad over batch

    # perform rmsprop parameter update every batch_size episodes
    if episode_number % batch_size == 0:
      for k,v in model.iteritems():
        g = grad_buffer[k] # gradient
        rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 - decay_rate) * g**2
        model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)
        grad_buffer[k] = np.zeros_like(v) # reset batch gradient buffer

    # boring book-keeping
    running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
    print 'resetting env. episode reward total was %f. running mean: %f' % (reward_sum, running_reward)
    if episode_number % 100 == 0: pickle.dump(model, open('save.p', 'wb'))
    reward_sum = 0
    observation = env.reset() # reset env
    prev_x = None

  if reward != 0: # Pong has either +1 or -1 reward exactly when game ends.
    print ('ep %d: game finished, reward: %f' % (episode_number, reward)) + ('' if reward == -1 else ' !!!!!!!!')

(Source: http://karpathy.github.io/2016/05/31/rl/)

Well, OpenAI was founded to fasten the progress of RL and make it at par with supervised learning. We shall talk about using OpenAI gyms in another blog. Their documentation is understandable easily too. I suggest you go through Andrej Karpathy’s blog, the best article to get you started with RL. This was tutorial 5 in the 5 part series.

For any queries, suggestions or feedback, contact me at [email protected].