About the Authors: This was published by a group of researchers from (Facebook AI research). The original authors are , , and . paper FAIR Piotr Bojanowski Edouard Grave Armand Joulin Tomas Mikolov The for this paper is available . ready-to-run code here on Google Colab The Basic Idea behind Word Vectors: For most of the Natural Language Processing related tasks like text classification, text summarization, text generation etc, we need to perform various computations in order to achieve maximum precision on these tasks. In order to perform these computations, we need a for various components of Language like words, sentences and syllables. numerical based representation We assign multi-dimensional vectors to represent the words in the language to get a vector-based representation of the language. IMG SRC : deeplearning.ai by Andrew Ng Each dimension of such vector can capture the different characteristics and features of the words. Word vectors are also referred to as word embeddings in the Natural Language scientific community. You can know more about word vectors . here Earlier Work: For the last two decades, many to calculate word vectors were introduced. Many of them used computationally expensive models to calculate the word embeddings. approaches In 2013, Tomas Mikolov et al introduced a simpler and efficient way of calculating word embeddings using a computationally cheaper model: . word2vec with skipgram or CBOW approach This paper is an extension of Mikolov’s word2vec SkipGram model. The Basic idea behind Word Vectors with Subword Information: Mikolov’s word2vec model provided an excellent set of word embeddings for most of the large publicly available datasets. The one huge of word2vec model was that and only assigned the features based on the semantic context of the word. The major source of learning for Mikolov’s word2vec model was the external neighbors of the word. This also . limitation it ignored the morphological structure of the words limited it to learn the word vectors for only the words present in the vocabulary So, in languages like where the internal morphological structures of the words hold certain importance, the word2vec model failed to capture all the features of the available text data. Turkish, German and Czech This limitation has been addressed in this paper by using to cover the information from morphological structures of the words. subword information GOALS: Our Goals can be listed as follows : For a of ‘ , we need to learn a vector representation for every ‘ vocabulary size W’ word w’. The word vector representation should consider the context of the surrounding words. The word vector representation should also consider the internal morphological structure of the word. APPROACH: The first thing to be done here is setting up a word dictionary with an index for every word. Next, we define a window size which will give us the context words for every target word. For example, a for the target word “ in the sentence: “ ” , the context words would be: window size of 2 army ” I have an army. We have a hulk ( “have”, “an”, “we”, “have” ) The next step is to calculate the score for context between words. Let's assume a function to numerically score the context between the < and the < . Where < and < are the vector representations for these words and < represents the . S(w,c) → R target word w> context word c> w> c> R> single dimension real numbers space Further, we can define the probability of < occurring as a context word as a softmax function as follows : c> Here <Wc> and <Wt> are the vectors representing the target and context words and W is the vocabulary size. But this probability distribution only considers the probability of occurrence of a single context word to that of the whole vocabulary. Thus this cannot be used to define our to train the model. Predicting context words can be seen as a set of binary classification problems. Therefore, we use the binary logistic loss to get the following as the : objective function negative log-likelihood objective function The final objective function for training. Now, to parameterize the model, we the as the between the vectors and . define context score S(w,c) scalar product <w> <c> Until here, the approach is exactly the same as the one given by Mikolov. Clearly, the current approach only considers the neighboring context words for calculating the features and completely ignores the structure of the word itself. The Subword Model: To address this issue, we represent the word “ as in the word. The : . This helps in distinguishing the suffixes and prefixes from the rest of the character sequences. We also add the complete word as a sequence into this bag of n-grams. w” a bag of all possible character n-grams word is padded by a set of unique symbols like the angled brackets <WORD> For n = 3, the word “ ” will give us the following n-grams bag represented as EXAMPLE: where Gw = [ <wh, whe, her, ere, re>, <where> ] Gw Now, every character sequence in the n-grams bag is denoted by a vector and the word vector is given by the sum of the vectors of all the n-grams in that word. We also change the context score function to : <z> <w> This allows the sharing of representations across words, . This way, the subword information is utilized in the learning process of calculating word embeddings. thus allowing to learn reliable representation for rare words This model can be memory-wise costlier. So, the authors use Fowler-Noll-Vo hashing function ( ) to hash the character sequences. Ultimately, a word is represented by its index in the word dictionary and the set of hashed n-grams it contains. FNV-1a variant Optimization: Given, the as the before, we can optimize to it ( and hence ) by an optimization method. The authors have used with for all their experiments. The same optimization method has been used by Mikolov et al in their word2vec model. negative log-likelihood objective function minimize maximize the likelihood Stochastic Gradient Descent Linear Learning Rate decay Implementation Details: The authors have tried to maintain maximum similarity between their approach and Mikolov et al’s approach. = Word vector size 300 = Negative Sampling Size 5 samples per positive sample = . Word rejection criteria If the word occurs less than times in the corpus, it is removed from the dictionary This implementation is than Mikolov et al’s SkipGram implementation. 1.5x slower EXPERIMENTS: Datasets: Wikipedia dumps in nine languages: Arabic, Czech, German, English, Spanish, French, Italian, Romanian and Russian. Human similarity judgment: C omputing Spearman’s rank correlation coefficient between human judgment and the cosine similarity between the vector representations. Word analogy tasks: This is tightly related to the choice of the length of character n-grams that we consider. The analogy accuracy is split as Semantic (meaning-based) and Syntactic (syntax and grammar-based). Other Experiments: Further the authors also experiment with Language Modelling tasks, n-grams size variation, morphemes and qualitative analysis. CONCLUSION: Some very interesting results can be concluded, which will completely justify the inclusion of subword information for generating word vectors. Effect of Training Size: Since we exploit character-level similarities between words, we are able to better model infrequent words. Therefore, we should also be more robust to the size of the training data that we use. Effect of Training size on the performance of the model. SISG performs better with smaller training data size. Word similarity for OOV words: Our model is capable of building word vectors for words that do not appear in the training set. For such words, we simply average out the vector representation of its n-grams. Out of Vocabulary words as queries. Some of the interesting out of the vocabulary vector predictions can be explained as follows : & ‘ ‘rarity’ scarceness’ : ‘ scarce’ roughly matches ‘rarity’ while the suffix ‘-ness’ matches ‘-ity’ very well. & ‘ . ‘preadolescent’ young’: due to the subwords ‘-adolescent’ and ‘pre-’ , semantic similarity is obtained with the word ‘young’ REFERENCES: Enriching word vectors with Subword Information | ACL 2017 Original Paper Mikolov et al word2vec | Original Paper