Word2Vec; the Steroids for Natural Language Processing
Let’s start with the Basics.
Q) What are word vectors?
Ans) Representation of words with numbers.
Q) Why Word Vectors?
Ans) I’ll sum it up with three main reasons:
1. Computer cannot do computations on strings.
2. Strings don’t hold much explicit information themselves.
3. Words Vectors are usually dense vector representations.
Q) So what is Explicit information?
Ans) Yes, the word itself doesn’t say much about what it represents in real life. Example:
The string “cat” just tells us it has three alphabets “c”, ”a” and “t”.
It has no information about the animal it represents or the count or the context in which it is being used.
Q) Dense Vector Representation?
Ans) Short answer (for now), these vectors can hold Enormous information compared to their size.
Q) Types of Word Vectors?
A) There are two main categories:
Predictions can be of two types:
Q) Difference between syntax and context based vectors?
Ans) Let’s look at an example:
Consider the following sentence “Newton does not like apples.”
Is one of the most widely used form of word vector representation. First coined by Google in Mikolov et el.
It has two variants:
2. SkipGram : This models tries to predict the neighbours of a word.
Statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.
- Tensorflow
In simpler words, CBOW tends to find the probability of a word occurring in a neighbourhood (context). So it generalises over all the different contexts in which a word can be used.
Whereas SkipGram tends to learn the different contexts separately. So SkipGram needs enough data w.r.t. each context. Hence SkipGram requires more data to train, also SkipGram (given enough data) contains more knowledge about the context.
NOTE : These techniques do not need tagged dataset (though a tagged dataset can be used to include additional information as we’ll see later). So any large text corpus is effectively a dataset. As the tag to be predicted are the words already present in the text.
We will focus on SkipGram as large enough datasets (Wikipedia, Reddit, Stackoverflow etc.) are available for download.
First we decide what context are we looking for in terms of what will be our target words (to be predicted), source words (on bases of which we predict) and how far are we looking for context (size of window).
Example:
Considering Window size to be 3
Considering the middle word as the source word. The next and previous words as the target words.
fig no. 1
fig no. 2
Considering the first word as the source word. The following two words as the target words.
fig no. 3
fig no. 4
In both the types, source word is surrounded by words which are relevant to a context of that source word. Like ‘Messi’ will usually be surrounded by words related to ‘Football’. So after seeing a few examples, word vector of ‘Messi’ will start incorporating context related to ‘Football’, ‘Goals’, ‘Matches’ etc.
In case of ‘Apple’, its word vector would do the same but for both the company and the fruit (see fig no. 6).
fig no. 5 Word2Vec’s Neural Network
W1(s) and W2(s) contain information about the words. The information in W1 and W2 are combined/averaged to obtain the Word2Vec representations.
Say the Size of W(s) was 400, the Word2Vec representation of ‘apple’ would look something like
array([-2.56660223e-01, -7.96796158e-02, -2.04517767e-02,
-7.34366626e-02, 3.26843783e-02, -1.90244913e-02,
7.93217495e-02, 4.07200940e-02, -1.74737453e-01,
.....
1.86899990e-01, -4.33036387e-02, -2.66942739e-01,
-1.00671440e-01], dtype=float32)
Now a simple sentence like “Apple fell on Newton” containing 4 words, with help of Word2Vec, can be converted into 4*400 (1600) numbers; each[1] containing explicit information. So now we also know that the text is talking about a person, science, fruit etc.
[1] : hence Dense Vector representation
Visualising Word2Vec directly is currently impossible for mankind (because of high dimensionality like 400). Instead we use dimensionality reduction techniques like multidimensional scaling , sammon’s mapping, nearest neighbor graph etc.
The most widely algorithm is t-Distributed Stochastic Neighbour Embedding (t-SNE). Christopher Olah has an amazing blog about Dimensionality Reduction.
The end result of t-SNE on Word2Vec looks something like
fig no. 6 Multiple contexts of Apple
This figure shows that Apple lies between Companies (IBM, Microsoft) and Fruits (Mango).
That’s because the Word2Vec representation of Apple contains information about both the Company Apple and the Fruit Apple.
Distance Between
And
fig no. 7 Combining contexts of multiple words
This figure shows that by combining directions of two vectors ‘State’ and ‘America’, the resultant vector ‘Dakota’ is relevant to original vectors.
So effectively
Other examples are :
Gensim and Tensorflow, both have pretty impressive implementations of Word2Vec.
This is excellent blog of Gensim’s implementation and Tensorflow has a tutorial.
By default, Word2Vec model has one representation per word. A vector can try to accumulate all contexts but that just ends up generalising all the contexts to at least some extent, hence precision of each context is compromised. This is especially a problem for words which have very different contexts. This might lead to one context, over powering others.
Like : There will be only one Word2Vec representation for ‘apple’ the company and ‘apple’ the fruit.
Example 1:
‘maiden’ can be used for a woman, a band (Iron Maiden), in sports etc.
When you try to find most similar words to ‘maiden’
[(u'odi_debut', 0.43079674243927),(u'racecourse_debut', 0.42960068583488464),.....(u'marathon_debut', 0.40903717279434204),(u'one_day_debut', 0.40729495882987976),(u'test_match_debut', 0.4013477563858032)]
It is clearly visible that the context related to ‘sports’ has overpowered others.
Even combining ‘iron’ and ‘maiden’, doesn’t resolve the issue as now context of ‘iron’ overpowers.
[(u'steel', 0.5581518411636353),(u'copper', 0.5266575217247009),.....(u'bar_iron', 0.49549400806427)]
Example 2
A word could be used as a verb and a noun but with completely different meanings. Like the word ‘iron’. As a verb it is usually used to smooth things with an electric iron but at a noun it is mostly used to denote the metal.
When we find the nearest neighbours of ‘iron’
[(u'steel', 0.5581518411636353),(u'copper', 0.5266575217247009),.....(u'bar_iron', 0.49549400806427)]
There is negligible reference to verb counterpart.
By replacing nouns (like ‘iron’ and ‘maiden’) by compound nouns (like ‘iron_maiden’) in training set.
“Iron Maiden is an amazing band”
becomes
“Iron_Maiden is an amazing band”
So the context of the compound noun stands out and is remarkably accurate!
Result:
Most relevant words w.r.t ‘iron_maiden’ are:
[(u'judas_priest', 0.8176089525222778),(u'black_sabbath', 0.7859792709350586),(u'megadeth', 0.7748109102249146),(u'metallica', 0.7701393961906433),.....
That’s hardcore, literally!
Here is a Python code for converting nouns into compound nouns (with adj-noun pairing as well), in order to create the training set for training Word2Vec.
fig no. 8
This Python code is for converting nouns into compound (only noun-noun paring).
fig no. 9
(Note : Non NER implementation of Sense2Vec)
Taking the above mentioned variant one step further by adding Part Of Speech (P.O.S) tags to the training set.
Example:
“I iron my shirt with class”
becomes
“I/PRP iron/VBP my/PRP$ shirt/NN with/IN class/NN ./.”
Or
“I**/NOUN** iron**/VERB** my**/ADJ** shirt**/NOUN** with**/ADP** class**/NOUN** ./PUNCT”
Result:
Now most relevant words w.r.t ‘iron/VERB’ are:
[(u'ironing/VERB', 0.818801760673523),(u'polish/VERB', 0.794084906578064),(u'smooth/VERB', 0.7590495347976685),.....
(Refer ‘Example 2’ of ‘Issues’ section for comparison)
Below is visualisation of Sense2Vec
fig no. 10
Below is the Python code for preparing training dataset for Sense2Vec
fig no. 11
These codes are available at github.
Next : Word2Vec (Part 2) Use Cases