Welcome to the third part of the five series tutorials on Machine Learning and its applications. Check out , a data annotations tool to make your ML life simpler and smoother. Dataturks are that are assigned to words, that have similar contextual usages. What is the use of word embeddings you might say? Well, if I am talking about Messi and immediately know that the context is football… How is it that happened? Our brains have associative memories and we associate Messi with football… Word embeddings vectorial representations To achieve the same, that is group similar words, we use embeddings. Embeddings, initially started off with one hot encoding approach, where each word in the text is represented using an array whose length is equal to the number of unique words in the vocabulary. Ex: Sentence 1: The mangoes are yellow. Sentence 2: The apples are red. The unique words are {The, mangoes, are, yellow, apples, red}. Hence sentence 1 will be represented as [1,1,1,1,0,0] & sentence 2 will be[1,0,1,0,1,1]. This approach works well for small datasets but doesn’t work efficiently for very large datasets. Hence there are several n-gram model implemented for this. We shall not explore this area in this tutorial. The topic of interest is word2vec model for generation of word embeddings. This covers many concepts of machine learning. We shall learn about a single hidden layer neural network, embeddings, and various optimisation techniques. Any machine learning algorithm needs three domains to work hand in hand. They are of classifier, of the hypothesis, and of the model for higher accuracy. representation evaluation optimization In the , we have a single hidden layered neural network of size N, that is used to obtain the word embeddings in a dimension N. The way to visualise the embeddings is as follows… word2vec model Let’s understand the various terminologies… Introduced by Tomas Mikolov in his paper, this model assumes that there is only one word considered per context. Hence the model will predict one target word given one context word. Let the vocabulary size be V Continuous Bag of Words Model- CBOW: (CBOW model with only one word in context) The weights matrix between the input layer and the output layer can be represented by a V*N matrix. Each row of the matrix represents the embedding vector for each word. Note that the activation function in this case is a . The objective function is the conditional probability of observing the actual output word given the input context word. We need to maximise the objective function, that is maximise the prediction of a word given its context… Simple right! linear function CBOW also has a multi- word context, where instead of having one word in the context, it takes average of a certain window sized length of words, and then sends it as an input to the neural net. Skip-Gram Model The skip gram model is introduced in Mikolov et al, which is the opposite of CBOW model. The target word is now at the input layer, and the context words are on the output layer. The objective function being the probability of the output word in the group of target words given the context word. W_O,c is the actual output word in the cth group of output words. (Objective function) Word2vec model implements skip-gram, and now… let’s have a look at the code. Gensim also offers word2vec faster implementation… We shall look at the source code for Word2Vec. Lets import all the required libraries and the dataset available in nltk.corpus. It is a replica of Project Gutenberg. __future__ absolute_import __future__ division __future__ print_function collections math random numpy np six.moves xrange tensorflow tf nltk # is the dataset interest nltk.corpus brown emma = nltk.corpus.gutenberg.words( ) vocabulary=list() i brown.categories(): vocabulary.append(emma) e_list=list() vocabulary=e_list vocabulary_size=len(vocabulary) #print(vocabulary,vocabulary_size) from import from import from import import import import import as from import import as import this of from import 'austen-emma.txt' for in Let's’ preprocess the dataset by getting rid of uncommon words, and marking them as UNK tokens. def build_dataset(words, n_words): count = [[ , ]] count.extend(collections.Counter(words).most_common(n_words - )) dictionary = dict() word, _ count: dictionary[word] = len(dictionary) data = list() unk_count = word words: word dictionary: index = dictionary[word] : index = # dictionary[ ] unk_count += data.append(index) count[ ][ ] = unk_count reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys())) data, count, dictionary, reversed_dictionary data, count, dictionary, reverse_dictionary = build_dataset(vocabulary, vocabulary_size) del vocabulary # Hint to reduce memory. print( , count[: ]) print( , data[: ], [reverse_dictionary[i] i data[: ]]) data_index = "" "Process raw inputs into a dataset." "" 'UNK' -1 1 for in 0 for in if in else 0 'UNK' 1 0 1 return 'Most common words (+UNK)' 5 'Sample data' 10 for in 10 0 Implementing the skip gram model is the next part. # Step : to generate a training batch the skip-gram model. def generate_batch(batch_size, num_skips, skip_window): global data_index assert batch_size % num_skips == assert num_skips <= * skip_window batch = np.ndarray(shape=(batch_size), dtype=np.int32) labels = np.ndarray(shape=(batch_size, ), dtype=np.int32) span = * skip_window + # [ skip_window target skip_window ] buffer = collections.deque(maxlen=span) _ range(span): buffer.append(data[data_index]) data_index = (data_index + ) % len(data) i range(batch_size target = skip_window # target label at the center the buffer targets_to_avoid = [skip_window] j range(num_skips): target targets_to_avoid: target = random.randint( , span - ) targets_to_avoid.append(target) batch[i * num_skips + j] = buffer[skip_window] labels[i * num_skips + j, ] = buffer[target] buffer.append(data[data_index]) data_index = (data_index + ) % len(data) # Backtrack a little bit to avoid skipping words the end a batch data_index = (data_index + len(data) - span) % len(data) batch, labels batch, labels = generate_batch(batch_size= , num_skips= , skip_window= ) i range( ): print(batch[i], reverse_dictionary[batch[i]], , labels[i, ], reverse_dictionary[labels[i, ]]) 3 Function for 0 2 1 2 1 for in 1 for in // num_skips): of for in while in 0 1 0 1 in of return 8 2 1 for in 8 '->' 0 0 Training the Skip gram model results in the model understanding the language structure. # Step : Build and train a skip-gram model. batch_size = embedding_size = # Dimension the embedding vector. skip_window = # How many words to consider left and right. num_skips = # How many times to reuse an input to generate a label. # We pick a random validation set to sample nearest neighbors. Here we limit the # validation samples to the words that have a low numeric ID, which by # construction are also the most frequent. valid_size = # Random set words to evaluate similarity on. valid_window = # Only pick dev samples the head the distribution. valid_examples = np.random.choice(valid_window, valid_size, replace=False) num_sampled = # negative examples to sample. graph = tf.Graph() graph.as_default(): # Input data. train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) train_labels = tf.placeholder(tf.int32, shape=[batch_size, ]) valid_dataset = tf.constant(valid_examples, dtype=tf.int32) # Ops and variables pinned to the CPU because missing GPU implementation tf.device( ): # Look up embeddings inputs. embeddings = tf.Variable( tf.random_uniform([vocabulary_size, embedding_size], , )) embed = tf.nn.embedding_lookup(embeddings, train_inputs) # Construct the variables the NCE loss nce_weights = tf.Variable( tf.truncated_normal([vocabulary_size, embedding_size], stddev= / math.sqrt(embedding_size))) nce_biases = tf.Variable(tf.zeros([vocabulary_size])) # Compute the average NCE loss the batch. # tf.nce_loss automatically draws a sample the negative labels each # time we evaluate the loss. loss = tf.reduce_mean( tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, labels=train_labels, inputs=embed, num_sampled=num_sampled, num_classes=vocabulary_size)) # Construct the SGD optimizer using a learning rate . optimizer = tf.train.GradientDescentOptimizer( ).minimize(loss) # Compute the cosine similarity between minibatch examples and all embeddings. norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), , keep_dims=True)) normalized_embeddings = embeddings / norm valid_embeddings = tf.nn.embedding_lookup( normalized_embeddings, valid_dataset) similarity = tf.matmul( valid_embeddings, normalized_embeddings, transpose_b=True) # Add variable initializer. init = tf.global_variables_initializer() # Step : Begin training. num_steps = tf.Session(graph=graph) session: # We must initialize all variables before we use them. init.run() print( ) average_loss = step xrange(num_steps): batch_inputs, batch_labels = generate_batch( batch_size, num_skips, skip_window) feed_dict = { : batch_inputs, : batch_labels} # We perform one update step by evaluating the optimizer op (including it # the list returned values session.run() _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict) average_loss += loss_val step % == : step > : average_loss /= # The average loss is an estimate the loss over the last batches. print( , step, , average_loss) average_loss = # Note that is expensive (~ % slowdown computed every steps) step % == : sim = similarity.eval() i xrange(valid_size): valid_word = reverse_dictionary[valid_examples[i]] top_k = # number nearest neighbors nearest = (-sim[i, :]).argsort()[ :top_k + ] print( ,nearest) log_str = % valid_word k xrange(top_k): close_word = reverse_dictionary[nearest[k]] print(nearest[k]) log_str = % (log_str, close_word) print(log_str) final_embeddings = normalized_embeddings.eval() 4 128 128 of 1 2 16 of 100 in of 64 Number of with 1 of with '/cpu:0' for -1.0 1.0 for 1.0 for new of of 1.0 1.0 1 5 100001 with as 'Initialized' 0 for in train_inputs train_labels in of for if 2000 0 if 0 2000 of 2000 'Average loss at step ' ': ' 0 this 20 if 500 if 100000 0 for in 8 of 1 1 "nearest" 'Nearest to %s:' for in '%s %s,' Let’s visualise the embeddings. # Step : Visualize the embeddings tsne. def plot_with_labels(low_dim_embs, labels, filename= ): assert low_dim_embs.shape[ ] >= len(labels), plt.figure(figsize=( , )) # inches i, label enumerate(labels): x, y = low_dim_embs[i, :] plt.scatter(x, y) plt.annotate(label, xy=(x, y), xytext=( , ), textcoords= , ha= , va= ) plt.savefig(filename) : # pylint: disable=g- -not-at-top sklearn.manifold TSNE matplotlib.pyplot plt tsne = TSNE(perplexity= , n_components= , init= , n_iter= ) plot_only = low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :]) labels = [reverse_dictionary[i] i xrange(plot_only)] plot_with_labels(low_dim_embs, labels) except ImportError: print( ) 6 with 'tsne.png' 0 'More labels than embeddings' 18 18 in for in 5 2 'offset points' 'right' 'bottom' try import from import import as 30 2 'pca' 5000 500 for in 'Please install sklearn, matplotlib, and scipy to show embeddings.' Optimisation is used to refine the embeddings obtained. Let’s review the various techniques that we know and use. I suggest you to go through due to limitations of typing math on Medium. this (Results for comparison of various optimisers) Hence, we can conclude that RMSProp and Adam, which are state of the art do not work well on these models. On the other hand, Proximal Adagrad and SGD work really well. Let’s see the results of Proximal Adagrad and SGD. (Proximal Adaptive Gradient Descent Optimizer) Check the words that go together often being represented close enough on the images. Also… compare the location of the numbers… in the two images… Decide which one is the better one accordingly! (Stochastic Gradient Descent Optimizer) This is the third tutorial in a five part series… Excited for the next two… Share your thoughts and feedback at lalith@dataturks.com.