Word2Vec (Part 1)

Word2Vec; the Steroids for Natural Language Processing

Let’s start with the Basics.

Word Vectors

Q) What are word vectors?

Ans) Representation of words with numbers.

Q) Why Word Vectors?

Ans) I’ll sum it up with three main reasons:

1. Computer cannot do computations on strings.

2. Strings don’t hold much explicit information themselves.

3. Words Vectors are usually dense vector representations.

Q) So what is Explicit information?

Ans) Yes, the word itself doesn’t say much about what it represents in real life. Example:

The string “cat” just tells us it has three alphabets “c”, ”a” and “t”.

It has no information about the animal it represents or the count or the context in which it is being used.

Q) Dense Vector Representation?

Ans) Short answer (for now), these vectors can hold Enormous information compared to their size.

Q) Types of Word Vectors?

A) There are two main categories:

Frequency based : Use stats to compute probability of a word co-occurring with respect to it’s neighbouring words.
Prediction based : Use predictive analysis to make a weighted guess of a word co-occurring with respect to it’s neighbouring words.

Predictions can be of two types:

Semantic : Try to guess a word w.r.t. context of text
Syntactic : Try to guess a word w.r.t. syntax of text

Q) Difference between syntax and context based vectors?

Ans) Let’s look at an example:

Consider the following sentence “Newton does not like apples.”

Semantic Vectors are concerned with ‘Who or What are the entities, a text is about?’. In this case “Newton” and “Apples”
Syntactic Vectors are concerned with ‘What about those entities is being said?’. In this case “Not” and “Like”

Word2Vec

Is one of the most widely used form of word vector representation. First coined by Google in Mikolov et el.

It has two variants:

CBOW (Continuous Bag of Words) : This model tries to predict a word on bases of it’s neighbours.

2. SkipGram : This models tries to predict the neighbours of a word.

Statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.

- Tensorflow

In simpler words, CBOW tends to find the probability of a word occurring in a neighbourhood (context). So it generalises over all the different contexts in which a word can be used.

Whereas SkipGram tends to learn the different contexts separately. So SkipGram needs enough data w.r.t. each context. Hence SkipGram requires more data to train, also SkipGram (given enough data) contains more knowledge about the context.

NOTE : These techniques do not need tagged dataset (though a tagged dataset can be used to include additional information as we’ll see later). So any large text corpus is effectively a dataset. As the tag to be predicted are the words already present in the text.

We will focus on SkipGram as large enough datasets (Wikipedia, Reddit, Stackoverflow etc.) are available for download.

SkipGram

First we decide what context are we looking for in terms of what will be our target words (to be predicted), source words (on bases of which we predict) and how far are we looking for context (size of window).

Example:

Considering Window size to be 3

Type 1

Considering the middle word as the source word. The next and previous words as the target words.

fig no. 1

fig no. 2

Type 2

Considering the first word as the source word. The following two words as the target words.

fig no. 3

fig no. 4

In both the types, source word is surrounded by words which are relevant to a context of that source word. Like ‘Messi’ will usually be surrounded by words related to ‘Football’. So after seeing a few examples, word vector of ‘Messi’ will start incorporating context related to ‘Football’, ‘Goals’, ‘Matches’ etc.

In case of ‘Apple’, its word vector would do the same but for both the company and the fruit (see fig no. 6).

fig no. 5 Word2Vec’s Neural Network

W1(s) and W2(s) contain information about the words. The information in W1 and W2 are combined/averaged to obtain the Word2Vec representations.

Say the Size of W(s) was 400, the Word2Vec representation of ‘apple’ would look something like

array([-2.56660223e-01, -7.96796158e-02, -2.04517767e-02,

-7.34366626e-02, 3.26843783e-02, -1.90244913e-02,

7.93217495e-02, 4.07200940e-02, -1.74737453e-01,

.....

1.86899990e-01, -4.33036387e-02, -2.66942739e-01,

-1.00671440e-01], dtype=float32)

Now a simple sentence like “Apple fell on Newton” containing 4 words, with help of Word2Vec, can be converted into 4*400 (1600) numbers; each[1] containing explicit information. So now we also know that the text is talking about a person, science, fruit etc.

[1] : hence Dense Vector representation

Visualisation

Visualising Word2Vec directly is currently impossible for mankind (because of high dimensionality like 400). Instead we use dimensionality reduction techniques like multidimensional scaling , sammon’s mapping, nearest neighbor graph etc.

The most widely algorithm is t-Distributed Stochastic Neighbour Embedding (t-SNE). Christopher Olah has an amazing blog about Dimensionality Reduction.

The end result of t-SNE on Word2Vec looks something like

fig no. 6 Multiple contexts of Apple

This figure shows that Apple lies between Companies (IBM, Microsoft) and Fruits (Mango).

That’s because the Word2Vec representation of Apple contains information about both the Company Apple and the Fruit Apple.

Distance Between

Apple and Mango : 0.505
Apple and IBM : 0.554
Mango and IBM : 0.902

And

fig no. 7 Combining contexts of multiple words

This figure shows that by combining directions of two vectors ‘State’ and ‘America’, the resultant vector ‘Dakota’ is relevant to original vectors.

So effectively

State + America = Dakota
State + Germany = Bavaria

Other examples are :

German + Airlines = Lufthansa
King + Woman — Man = Queen

Implementations

Gensim and Tensorflow, both have pretty impressive implementations of Word2Vec.

This is excellent blog of Gensim’s implementation and Tensorflow has a tutorial.

Issues

By default, Word2Vec model has one representation per word. A vector can try to accumulate all contexts but that just ends up generalising all the contexts to at least some extent, hence precision of each context is compromised. This is especially a problem for words which have very different contexts. This might lead to one context, over powering others.

Like : There will be only one Word2Vec representation for ‘apple’ the company and ‘apple’ the fruit.

Example 1:

‘maiden’ can be used for a woman, a band (Iron Maiden), in sports etc.

When you try to find most similar words to ‘maiden’

[(u'odi_debut', 0.43079674243927),(u'racecourse_debut', 0.42960068583488464),.....(u'marathon_debut', 0.40903717279434204),(u'one_day_debut', 0.40729495882987976),(u'test_match_debut', 0.4013477563858032)]

It is clearly visible that the context related to ‘sports’ has overpowered others.

Even combining ‘iron’ and ‘maiden’, doesn’t resolve the issue as now context of ‘iron’ overpowers.

[(u'steel', 0.5581518411636353),(u'copper', 0.5266575217247009),.....(u'bar_iron', 0.49549400806427)]

Example 2

A word could be used as a verb and a noun but with completely different meanings. Like the word ‘iron’. As a verb it is usually used to smooth things with an electric iron but at a noun it is mostly used to denote the metal.

When we find the nearest neighbours of ‘iron’

[(u'steel', 0.5581518411636353),(u'copper', 0.5266575217247009),.....(u'bar_iron', 0.49549400806427)]

There is negligible reference to verb counterpart.

Variants

Variant 1 : Compound Noun based Word2Vec

By replacing nouns (like ‘iron’ and ‘maiden’) by compound nouns (like ‘iron_maiden’) in training set.

“Iron Maiden is an amazing band”

becomes

“Iron_Maiden is an amazing band”

So the context of the compound noun stands out and is remarkably accurate!

Result:

Most relevant words w.r.t ‘iron_maiden’ are:

[(u'judas_priest', 0.8176089525222778),(u'black_sabbath', 0.7859792709350586),(u'megadeth', 0.7748109102249146),(u'metallica', 0.7701393961906433),.....

That’s hardcore, literally!

Here is a Python code for converting nouns into compound nouns (with adj-noun pairing as well), in order to create the training set for training Word2Vec.

fig no. 8

This Python code is for converting nouns into compound (only noun-noun paring).

fig no. 9

Variant 2 : Sense2Vec

(Note : Non NER implementation of Sense2Vec)

Taking the above mentioned variant one step further by adding Part Of Speech (P.O.S) tags to the training set.

Example:

“I iron my shirt with class”

becomes

“I/PRP iron/VBP my/PRP$ shirt/NN with/IN class/NN ./.”

“I**/NOUN** iron**/VERB** my**/ADJ** shirt**/NOUN** with**/ADP** class**/NOUN** ./PUNCT”

Result:

Now most relevant words w.r.t ‘iron/VERB’ are:

[(u'ironing/VERB', 0.818801760673523),(u'polish/VERB', 0.794084906578064),(u'smooth/VERB', 0.7590495347976685),.....

(Refer ‘Example 2’ of ‘Issues’ section for comparison)

Below is visualisation of Sense2Vec

fig no. 10

Below is the Python code for preparing training dataset for Sense2Vec

fig no. 11

These codes are available at github.

Conclusion

Word2Vec can hold enormous information compared to their size!
They can learn both semantics and syntax
The one problem is generalisation over multiple contexts but that too can be tackled with additional modification of training text
They are computation friendly as they are all array of numbers
The relationship between vectors can be discovered with just linear algebra

Next : Word2Vec (Part 2) Use Cases

Prev : Natural Language Processing (NLP)