Enriching Word Vectors with Subword Information [PAPER SUMMARY]

About the Authors:

This paper was published by a group of researchers from FAIR (Facebook AI research). The original authors are Piotr Bojanowski, Edouard Grave, Armand Joulin and Tomas Mikolov.

The ready-to-run code for this paper is available here on Google Colab.

The Basic Idea behind Word Vectors:

For most of the Natural Language Processing related tasks like text classification, text summarization, text generation etc, we need to perform various computations in order to achieve maximum precision on these tasks. In order to perform these computations, we need a numerical based representation for various components of Language like words, sentences and syllables.

We assign multi-dimensional vectors to represent the words in the language to get a vector-based representation of the language.

IMG SRC : deeplearning.ai by Andrew Ng

Each dimension of such vector can capture the different characteristics and features of the words.

Word vectors are also referred to as word embeddings in the Natural Language scientific community.

You can know more about word vectors here.

Earlier Work:

For the last two decades, many approaches to calculate word vectors were introduced. Many of them used computationally expensive models to calculate the word embeddings.

In 2013, Tomas Mikolov et al introduced a simpler and efficient way of calculating word embeddings using a computationally cheaper model: word2vec with skipgram or CBOW approach.

This paper is an extension of Mikolov’s word2vec SkipGram model.

The Basic idea behind Word Vectors with Subword Information:

Mikolov’s word2vec model provided an excellent set of word embeddings for most of the large publicly available datasets. The one huge limitation of word2vec model was that it ignored the morphological structure of the words and only assigned the features based on the semantic context of the word. The major source of learning for Mikolov’s word2vec model was the external neighbors of the word. This also limited it to learn the word vectors for only the words present in the vocabulary.

So, in languages like Turkish, German and Czech where the internal morphological structures of the words hold certain importance, the word2vec model failed to capture all the features of the available text data.

This limitation has been addressed in this paper by using subword information to cover the information from morphological structures of the words.

GOALS:

Our Goals can be listed as follows :

For a vocabulary size of ‘W’, we need to learn a vector representation for every word ‘w’.
The word vector representation should consider the context of the surrounding words.
The word vector representation should also consider the internal morphological structure of the word.

APPROACH:

The first thing to be done here is setting up a word dictionary with an index for every word.

Next, we define a window size which will give us the context words for every target word. For example, a window size of 2 for the target word “army” in the sentence: “I have an army. We have a hulk” , the context words would be:

( “have”, “an”, “we”, “have” )

The next step is to calculate the score for context between words. Let's assume a function S(w,c) → R to numerically score the context between the target word <w> and the context word <c>. Where <w> and <c> are the vector representations for these words and <R> represents the single dimension real numbers space.

Further, we can define the probability of <c> occurring as a context word as a softmax function as follows :

Here <Wc> and <Wt> are the vectors representing the target and context words and W is the vocabulary size.

But this probability distribution only considers the probability of occurrence of a single context word to that of the whole vocabulary. Thus this cannot be used to define our objective function to train the model. Predicting context words can be seen as a set of binary classification problems. Therefore, we use the binary logistic loss to get the following negative log-likelihood as the objective function:

The final objective function for training.

Now, to parameterize the model, we define the context score S(w,c) as the scalar product between the vectors <w> and <c>.

Until here, the approach is exactly the same as the one given by Mikolov. Clearly, the current approach only considers the neighboring context words for calculating the features and completely ignores the structure of the word itself.

The Subword Model:

To address this issue, we represent the word “w” as a bag of all possible character n-grams in the word. The word is padded by a set of unique symbols like the angled brackets: <WORD>. This helps in distinguishing the suffixes and prefixes from the rest of the character sequences. We also add the complete word as a sequence into this bag of n-grams.

EXAMPLE: For n = 3, the word “where” will give us the following n-grams bag represented as Gw

Gw = [ <wh, whe, her, ere, re>, <where> ]

Now, every character sequence in the n-grams bag is denoted by a vector <z> and the word vector <w> is given by the sum of the vectors of all the n-grams in that word. We also change the context score function to :

This allows the sharing of representations across words, thus allowing to learn reliable representation for rare words. This way, the subword information is utilized in the learning process of calculating word embeddings.

This model can be memory-wise costlier. So, the authors use Fowler-Noll-Vo hashing function (FNV-1a variant) to hash the character sequences. Ultimately, a word is represented by its index in the word dictionary and the set of hashed n-grams it contains.

Optimization:

Given, the negative log-likelihood as the objective function before, we can optimize to minimize it ( and hence maximize the likelihood ) by an optimization method. The authors have used Stochastic Gradient Descent with Linear Learning Rate decay for all their experiments. The same optimization method has been used by Mikolov et al in their word2vec model.

Implementation Details:

The authors have tried to maintain maximum similarity between their approach and Mikolov et al’s approach.

Word vector size = 300
Negative Sampling Size = 5 samples per positive sample
Word rejection criteria = If the word occurs less than times in the corpus, it is removed from the dictionary.

This implementation is 1.5x slower than Mikolov et al’s SkipGram implementation.

EXPERIMENTS:

Datasets: Wikipedia dumps in nine languages: Arabic, Czech, German, English, Spanish, French, Italian, Romanian and Russian.

Human similarity judgment: Computing Spearman’s rank correlation coefficient between human judgment and the cosine similarity between the vector representations.

Word analogy tasks: This is tightly related to the choice of the length of character n-grams that we consider. The analogy accuracy is split as Semantic (meaning-based) and Syntactic (syntax and grammar-based).

Other Experiments: Further the authors also experiment with Language Modelling tasks, n-grams size variation, morphemes and qualitative analysis.

CONCLUSION:

Some very interesting results can be concluded, which will completely justify the inclusion of subword information for generating word vectors.

Effect of Training Size: Since we exploit character-level similarities between words, we are able to better model infrequent words. Therefore, we should also be more robust to the size of the training data that we use.

Effect of Training size on the performance of the model. SISG performs better with smaller training data size.

Word similarity for OOV words: Our model is capable of building word vectors for words that do not appear in the training set. For such words, we simply average out the vector representation of its n-grams.

Out of Vocabulary words as queries.

Some of the interesting out of the vocabulary vector predictions can be explained as follows :

‘rarity’ & ‘scarceness’ : ‘scarce’ roughly matches ‘rarity’ while the suffix ‘-ness’ matches ‘-ity’ very well.
‘preadolescent’ & ‘young’: due to the subwords ‘-adolescent’ and ‘pre-’ , semantic similarity is obtained with the word ‘young’.