Photo by Dmitry Ratushny on Unsplash
Natural language processing (NLP) is an area of computer science and artificial intelligence that is _k_nown to be concerned with the interaction between computer and hum_a_ns in n_a_tural language. The goal is to enable the systems to fully understand _v_arious language as well as we do. It is the driving force behind NLP products/techniques like v_i_rtual assistants, speech recognition, machine translation, sentiment anal_y_sis, automatic text summarization, and much more. We’ll be working on a word embedding technique called Word2Vec using Gensim framework in this post.
Word Embedding is an NLP technique, capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words, etc. In general, they are vector representations of a particular word. Having said that what follows is the techniques to create Word Embeddings. There are many techniques to create Word Embeddings. Some of the popular ones are:
We’ll discuss the different embedding techniques on future posts, for now, we’ll stick with Word2Vec Embedding.
Word2vec is one of the most widely used models to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2Vec can be implemented in two ways, one is Skip Gram and other is Common Bag Of Words (CBOW)
CBOW is learning to predict the word by the context. Here the input will be the context #neighboring words and output will be the target word. The limit on the number of words in each context is determined by a parameter called “window size”.
Example: The quick brown fox jumps over the lazy dog #yes the same example :-)
Model: CBOW
INPUT Layer: White box content
TARGET Layer: blue box word
Window Size: 5
Skip Gram is learning to predict the context by the word. Here the input will be the word and output will be the target context #neighboring words. The limit on the number of words in each context is determined by a parameter called “window size”.
Example: The quick brown fox jumps over the lazy dog #yes the same example :-)
Model: Skip Gram
INPUT Layer: blue box word
TARGET Layer: White box content Window Size: 5
Word2Vec Embeddings
INPUT CORPUS
1. this is the first sentence for word2vec
2. this is the second sentence
3. yet another sentence
4. one more sentence5. and the final sentence
Literally everywhere!!
That’s it for explaining things, I believe you have got some understanding about word2vec. If not, don’t worry! you can get a clear idea after going through the example below. let’s dive into some python 🐍.
Let’s add Some Python
As we discussed earlier, we’ll be implementing word2vec using Gensim framework in python. Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy and optionally Cython for performance.
Here we have imported the requirements, next we’ll be defining out text corpus.
In this step, We are defining our Word2vec model.
Here,
size
is the dimensionality of the vector higher the size denser the embeddings( ideally size
must be lower than the vocab length. Using a higher dimensionality than vocabulary size would more-or-less guarantee ‘overfitting’.)min_count
frequency are dropped before training occurs(as we have few lines of input corpus, we are taking every word).PCA model is used to reduce the n dimensioned vector for each word in our vocab to a 2d vector(we are doing this for plotting/visualize our results).
We’ll be using scatter plot in matplotlib for plotting the word embeddings,
As we can see, the word embeddings are mapped relative to each other. This result is had, by using the pre-trained model for simplicity basic model is discussed. Original model can be found here.
Using this word embeddings we can find similarity between the words in our corpus.
>>>model_1.most_similar(positive=[‘first’], topn=1)
[('second', 0.8512464761734009)]
The most_similar function is to find similar words in our embeddings to the target word.
>>>model_1.similarity('one', 'another')
0.80782
Here we found the similarity between the two words in our embeddings. Like this, there are many useful functions to work with. They can be found here.
We end this here, hope I’ve given some introduction to the word2vec embeddings. Check the other works here.
Lol, if you think so we are on the same page. Let’s connect Medium, Linkedin, Facebook.