Word2Vec; the Steroids for Natural Language Processing Let’s start with the Basics. Word Vectors Q) What are word vectors? Representation of words with numbers. Ans) Q) Why Word Vectors? I’ll sum it up with three main reasons: Ans) 1. Computer cannot do computations on strings. 2. Strings don’t hold much information themselves. explicit 3. Words Vectors are usually dense vector representations. Q) So what is Explicit information? Yes, the word itself doesn’t say much about what it represents in real life. Example: Ans) The string “cat” just tells us it has three alphabets “c”, ”a” and “t”. It has no about the animal it represents or the count or the context in which it is being used. information Q) Dense Vector Representation? Short answer (for now), these vectors can hold Enormous information compared to their size. Ans) Q) Types of Word Vectors? A) There are two main categories: : Use stats to compute probability of a word co-occurring with respect to it’s neighbouring words. Frequency based : Use predictive analysis to make a weighted guess of a word co-occurring with respect to it’s neighbouring words. Prediction based Predictions can be of two types: : Try to guess a word w.r.t. context of text Semantic : Try to guess a word w.r.t. syntax of text Syntactic Q) Difference between syntax and context based vectors? Let’s look at an example: Ans) Consider the following sentence “Newton does not like apples.” Semantic Vectors are concerned with ‘ ’. In this case “Newton” and “Apples” Who or What are the entities, a text is about? Syntactic Vectors are concerned with ‘ ’. In this case “Not” and “Like” What about those entities is being said? Word2Vec Is one of the most widely used form of word vector representation. First coined by Google in Mikolov et el . It has two variants: : This model tries to predict a word on bases of it’s neighbours. CBOW (Continuous Bag of Words) 2. : This models tries to predict the neighbours of a word. SkipGram Statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets. - Tensorflow In simpler words, tends to find the probability of a word occurring in a neighbourhood (context). So it . CBOW generalises over all the different contexts in which a word can be used Whereas tends to . So SkipGram needs enough data w.r.t. each context. Hence requires more data to train, also SkipGram (given enough data) contains more knowledge about the context. SkipGram learn the different contexts separately SkipGram : These techniques do not need tagged dataset (though a tagged dataset can be used to include additional information as we’ll see later). So any large text corpus is effectively a dataset. As the tag to be predicted are the words already present in the text. NOTE We will focus on SkipGram as large enough datasets (Wikipedia, Reddit, Stackoverflow etc.) are available for download. SkipGram First we decide what context are we looking for in terms of what will be our (to be predicted), (on bases of which we predict) and are we looking for context (size of window). target words source words how far Example: Considering Window size to be 3 Type 1 Considering the middle word as the source word. The next and previous words as the target words. fig no. 1 fig no. 2 Type 2 Considering the first word as the source word. The following two words as the target words. fig no. 3 fig no. 4 In both the types, source word is surrounded by words which are relevant to a context of that source word. Like ‘Messi’ will usually be surrounded by words related to ‘Football’. So after seeing a few examples, word vector of ‘Messi’ will start incorporating context related to ‘Football’, ‘Goals’, ‘Matches’ etc. In case of ‘Apple’, its word vector would do the same but for both the company and the fruit (see fig no. 6). fig no. 5 Word2Vec’s Neural Network W1(s) and W2(s) contain information about the words. The information in W1 and W2 are combined/averaged to obtain the Word2Vec representations. Say the Size of W(s) was 400, the Word2Vec representation of ‘apple’ would look something like array([-2.56660223e-01, -7.96796158e-02, -2.04517767e-02, -7.34366626e-02, 3.26843783e-02, -1.90244913e-02, 7.93217495e-02, 4.07200940e-02, -1.74737453e-01, ..... 1.86899990e-01, -4.33036387e-02, -2.66942739e-01, -1.00671440e-01], dtype=float32) Now a simple sentence like “Apple fell on Newton” containing 4 words, with help of Word2Vec, can be converted into 4*400 (1600) numbers; . So now we also know that the text is talking about a person, science, fruit etc. each [1] containing explicit information [1] : hence Dense Vector representation Visualisation Visualising Word2Vec directly is currently impossible for mankind (because of high dimensionality like 400). Instead we use techniques like , , etc. dimensionality reduction multidimensional scaling sammon’s mapping nearest neighbor graph The most widely algorithm is (t-SNE). Christopher Olah has an amazing about Dimensionality Reduction. t-Distributed Stochastic Neighbour Embedding blog The end result of t-SNE on Word2Vec looks something like fig no. 6 Multiple contexts of Apple This figure shows that Apple lies between Companies (IBM, Microsoft) and Fruits (Mango). That’s because the Word2Vec representation of Apple contains information about both the Company Apple and the Fruit Apple. Distance Between Apple and Mango : 0.505 Apple and IBM : 0.554 Mango and IBM : 0.902 And fig no. 7 Combining contexts of multiple words This figure shows that by combining directions of two vectors ‘State’ and ‘America’, the resultant vector ‘Dakota’ is relevant to original vectors. So effectively State + America = Dakota State + Germany = Bavaria Other examples are : German + Airlines = Lufthansa King + Woman — Man = Queen Implementations Gensim and Tensorflow, both have pretty impressive implementations of Word2Vec. This is excellent of ’s implementation and has a . blog Gensim Tensorflow tutorial Issues By default, Word2Vec model has one representation per word. A vector can try to accumulate all contexts but that just ends up generalising all the contexts to at least some extent, hence precision of each context is compromised. This is especially a problem for words which have very different contexts. This might lead to one context, over powering others. Like : There will be only one Word2Vec representation for ‘apple’ the company and ‘apple’ the fruit. Example 1: ‘maiden’ can be used for a woman, a band (Iron Maiden), in sports etc. When you try to find most similar words to ‘maiden’ [(u'odi_debut', 0.43079674243927),(u'racecourse_debut', 0.42960068583488464),.....(u'marathon_debut', 0.40903717279434204),(u'one_day_debut', 0.40729495882987976),(u'test_match_debut', 0.4013477563858032)] It is clearly visible that the context related to ‘sports’ has overpowered others. Even combining ‘iron’ and ‘maiden’, doesn’t resolve the issue as now context of ‘iron’ overpowers. [(u'steel', 0.5581518411636353),(u'copper', 0.5266575217247009),.....(u'bar_iron', 0.49549400806427)] Example 2 A word could be used as a verb and a noun but with completely different meanings. Like the word ‘iron’. As a verb it is usually used to smooth things with an electric iron but at a noun it is mostly used to denote the metal. When we find the nearest neighbours of ‘iron’ [(u'steel', 0.5581518411636353),(u'copper', 0.5266575217247009),.....(u'bar_iron', 0.49549400806427)] There is negligible reference to verb counterpart. Variants Variant 1 : Compound Noun based Word2Vec By replacing nouns (like ‘iron’ and ‘maiden’) by compound nouns (like ‘iron_maiden’) in training set. “Iron Maiden is an amazing band” becomes “ Iron_Maiden is an amazing band” So the context of the compound noun stands out and is remarkably accurate! Result: Most relevant words w.r.t ‘iron_maiden’ are: [(u'judas_priest', 0.8176089525222778),(u'black_sabbath', 0.7859792709350586),(u'megadeth', 0.7748109102249146),(u'metallica', 0.7701393961906433),..... That’s hardcore, literally! Here is a Python code for converting nouns into compound nouns ( ), in order to create the training set for training Word2Vec. with adj-noun pairing as well fig no. 8 This Python code is for converting nouns into compound ( ). only noun-noun paring fig no. 9 Variant 2 : Sense2Vec (Note : Non NER implementation of Sense2Vec) Taking the above mentioned variant one step further by adding Part Of Speech (P.O.S) tags to the training set. Example: “I iron my shirt with class” becomes “I /PRP iron /VBP my /PRP$ shirt /NN with /IN class /NN . /. ” Or “I**/NOUN** iron**/VERB** my**/ADJ** shirt**/NOUN** with**/ADP** class**/NOUN** . ” /PUNCT Result: Now most relevant words w.r.t ‘iron/VERB’ are: [(u'ironing/VERB', 0.818801760673523),(u'polish/VERB', 0.794084906578064),(u'smooth/VERB', 0.7590495347976685),..... (Refer ‘Example 2’ of ‘Issues’ section for comparison) Below is visualisation of Sense2Vec fig no. 10 Below is the Python code for preparing training dataset for Sense2Vec fig no. 11 These codes are available at . github Conclusion Word2Vec can hold enormous information compared to their size! They can learn both semantics and syntax The one problem is generalisation over multiple contexts but that too can be tackled with additional modification of training text They are computation friendly as they are all array of numbers The relationship between vectors can be discovered with just linear algebra Word2Vec (Part 2) Use Cases Next : Prev : Natural Language Processing (NLP)