Positional Embedding: The Secret behind the Accuracy of Transformer Neural Networks

An article explaining the intuition behind the “positional embedding” in transformer models from the renowned research paper - “Attention Is All You Need”.

Introduction
Concept of embedding in NLP
Need for positional embedding in Transformers
Various types of initial trial and error experiments
Frequency-based positional embedding
Conclusion
References

Introduction

The introduction of transformer architecture in the field of deep learning undoubtedly has paved a way for the silent revolution, especially in the branches of NLP. One of the most integral parts of the transformer architecture is “positional embedding” which gives the ability to neural networks the ability to understand the order of words and their dependencies in a long sentence.

However, we know that RNN and LSTM, which were introduced much before the transformers had the capability of understanding word ordering even without positional embedding. Then, you would have an obvious doubt that why this concept was introduced in transformers and the real edge behind instigating this notion. Let us conceptualize all of this information in this article.

Concept of embedding in NLP

Embedding is a process used in natural language processing for converting raw text into mathematical vectors. This is because a machine learning model will not be able to directly consume an input in text format for the various internal computational processes.

The embedding process carried out by algorithms such as Word2vec, Glove, etc is called word embedding or static embedding.

Here, a large text corpus containing a lot of words is passed inside a model for the training process. The model will assign a corresponding mathematical value to each word by assuming that the words which are appearing close to each other more frequently are similar. After this process, the derived mathematical values are used for further calculations.

For example,

Consider that our text corpus had 3 sentences as mentioned here-

The British government, which awarded a large annual subsidy to the king and queen at Palermo, claimed to have some control over the administration.
The royal party included, besides the king and queen, their daughter Marie Therese Charlotte (Madame Royale), the king's sister Madame Elisabeth, the valet Clery and others.
This is interrupted by the tidings of Mordred's treachery, and Lancelot, taking no part in the last fatal conflict, outlives both king and queen, and the downfall of the Round Table.

Here, we can see that the words “King” and “Queen” are appearing frequently. Hence, the model will assume that there could be some similarities among these words. When these words are transformed into mathematical values, they'll be placed at a small distance when represented in a multidimensional space.

Image source: Illustrated by the author

Imagine there is another word “Road” then logically it won't be appearing more frequently with “King” and “Queen” in a large text corpus. Hence, that word will be placed far apart in the space.

Image source: Illustrated by the author

Mathematically, a vector is represented using a sequence of numbers where each number represents the word’s magnitude in a particular dimension.

For example,

We represented the word “King” in 3 dimensions here. Hence, it can be hypothetically represented in a space [0.21,0.45,0.67].

The word “Queen” can be hypothetically represented as [0.24,0.41,0.62].

The word “Road” can be hypothetically represented as [0.97,0.72,0.36].

Need for positional embedding in Transformers

As we discussed in the introduction part, the need for positional embedding is to make the neural network understand the ordering and positional dependency in the sentence.

For example, Let us consider the following sentences-

Sentence 1 - “Although Sachin Tendulkar did not hit a century today, he took the team to a winning position”.

Sentence 2 - “Although Sachin Tendulkar hit a century today, he was not able to take the team to a winning position”.

Both of the sentences look similar since they share most of the words but the intrinsic meaning of both of them are very different. The ordering and position of a word like “not” have changed the entire context of the information conveyed here.

Hence, understanding the positional information is very critical while working on NLP projects. If the model misunderstands the context by just using the numbers in a multidimensional space, it can lead us to severe consequences, especially in predictive models.

In order to overcome this challenge, neural network architectures such as RNN (Recurrent neural network) and LSTM ( Long Term Short Term Memory) were introduced. To an extent, these architectures were very successful in understanding the positional information. The main secret behind their success is that they try to learn long sentences by preserving the sequential order of words. In addition to that, they will have information regarding the words that are placed very near to the “word of interest” and words that are placed very far from the “word of interest”.

For example,

Consider the following sentence-

“Sachin is the greatest cricketer of all time.”

Image source : Illustrated by the author

The words underlined with red colour is the “word of interest”. It is the word that the neural network (RNN/LSTM) tries to learn through complex mathematical processes such as embedding. We can see here that the “word of interest” is traversed sequentially as per the original text.

Also, they can memorize the dependency among the words by remembering the “context words”. Here, the context words are those which are placed near the “word of interest”. As a simple demonstration, we can consider the context words as the words underlined by the green colour in the following image while learning each “word of interest”.

Image source : Illustrated by the author

Through these techniques, RNN/LSTM can understand the positional information in a large text corpus.

All are going well. Right?

Then, what's the real problem here?

The real problem is the sequential traversing of the words in a large text corpus. Imagine that we have a really large text corpus with 1 million words, it will take really long time to sequentially traverse through each of the words. Sometimes, it is not feasible to afford that much computation time for training the models.

For overcoming this challenge, a new advanced architecture was introduced - “Transformers”.

One of the important characteristics of transformer architecture is that it can learn a text corpus by processing all of the words in parallel. Even if you have 10 words or 1 million words, it doesn't really care about the length of the corpus.

Image source : Illustrated by the author

Now, there is one challenge associated with this parallel processing of words. Since all of the words are accessed simultaneously, the dependency information will be lost. Hence, the model won't be able to remember the “context” of a particular word and the information regarding the relationship between the words cannot be preserved accurately. This problem again leads us to the initial challenge of preserving the contextual dependency although the computation/ training time of the model is considerably reduced.

Now, how can we tackle this situation?

The solution is “Positional embedding”.

Various types of initial trial and error experiments

Initially, when this concept was introduced, the researchers were very eager to derive an optimized method that could preserve the positional information in a transformer architecture.

The first method tried as a part of this trial and error experiment was “Positional embedding based on Index of words”.

Here, the idea was to introduce a new mathematical vector along with this word vector that can contain the index of a particular word.

Image source : Illustrated by the author

Assume that this is the representation of words in the multidimensional space-

Image source: Illustrated by the author

After adding the positional vector, the magnitude and direction might change the position of each word like this:

Image source: Illustrated by the author

One of the big disadvantages associated with this technique is that if the length of the sentence is very big then the magnitude of the positional vector will also increase proportionally. Let's say that a sentence has 25 words then the first word will be added with a positional vector with a magnitude of 0 and the last word will be added with a positional vector with a magnitude of 24. This large disparency might cause a problem when we are projecting these values in higher dimensions.

Another technique tried to reduce the large magnitude of the positional vector is “Positional embedding based on the fraction of length of the sentence”.

Here, the fractional value of each word with respect to the length of the sentence is calculated as a magnitude of the positional vector.

The fractional value is calculated using the formula-

Value = 1/N-1

Where "N" is the position of a particular word.

For example,

Let's consider this sentence-

Image source: Illustrated by the author

In this technique, the maximum magnitude of the positional vector can be bounded to 1 irrespective of the length of the sentence. But, there is a big loophole in this system.

If we are comparing 2 sentences with different lengths then the embedding value for a word at a particular position will differ. A particular word or position should possess the same embedding value throughout the text corpus for easy understanding of its context. If the same word in various sentences is possessing different embedding values then representing the information of the entire text corpus in a multidimensional space will become a very complex task. Even if we achieve such a complex space then there is a high chance that the model will collapse at some point due to the distortion of too much information. Hence, this technique was eliminated from further progress for positional embedding in transformers.

Finally, the researchers came up with a system of “Frequency-based Positional embeddings” that received critical acclaim across the globe and finally incorporated into the transformer architecture and mentioned in the renowned white paper - “Attention is all you need”.

Frequency-based positional embedding

According to this technique, the researchers recommend a unique way of embedding the words based on wave frequency using the following formula-

Image source : Illustrated by the author

Where,

“pos” is the position or index value of the particular word in the sentence
“d” is the maximum length/dimension of the vector that represents a particular word in the sentence.
“i” represents the indices of each of the positional embedding dimensions. It also denotes the frequency. When i=0, it is considered the highest frequency and for the subsequent values, the frequency is considered as decreasing magnitude.

Image source : Illustrated by the author

Since the height of the curve depends upon the position of the word depicted on the x-axis, the curve’s height can be used as a proxy for the word positions.

If 2 words are of a similar height then we can consider that their proximity in the sentence is very high.

Similarly, If 2 words are of drastically different heights then we can consider that their proximity in the sentence is very low.

According to our example text - “Sachin is a great cricketer”,

For the word “Sachin”,

pos =0
d = 3
i[0] = 0.21, i[1] = 0.45, i[2] = 0.67

While applying the formula,

Image source : Illustrated by the author

For i =0,

PE(0,0) = sin(0/10000^2(0)/3)

PE(0,0) = sin(0)

PE(0,0) = 0

For i =1,

PE(0,1) = cos(0/10000^2(1)/3)

PE(0,1) = cos(0)

PE(0,1) = 1

For i =2,

PE(0,2) = sin(0/10000^2(2)/3)

PE(0,2) = sin(0)

PE(0,2) = 0

For the word “Great”,

pos =3
d = 3
i[0] = 0.78, i[1] = 0.64, i[2] = 0.56

While applying the formula,

Image source : Illustrated by the author

For i =0,

PE(3,0) = sin(3/10000^2(0)/3)

PE(3,0) = sin(3/1)

PE(3,0) = 0.05

For i =1,

PE(3,1) = cos(3/10000^2(1)/3)

PE(3,1) = cos(3/436)

PE(3,1) = 0.99

For i =2,

PE(3,2) = sin(3/10000^2(2)/3)

PE(3,2) = sin(3/1.4)

PE(3,2) = 0.03

Image source : Illustrated by the author

Here, the maximum value will be capped at 1 (since we are using sin/cos functions). Hence, there is no scope for high-magnitude positional vectors which was a problem in earlier techniques.

Moreover, the words with high proximity to each other might fall at similar heights at lower frequencies and their height will be a little bit dissimilar at higher frequencies.

If the words have low proximity to each other then their height will be highly dissimilar even at lower frequencies and their height difference will increase as the frequency increases.

For example,

Consider the sentence - "King and Queen are walking on the road.”

The words “King” and “Road” are placed far apart.

Consider that these 2 words are having approximately similar heights after applying the wave frequency formula. When we reach higher frequencies (such as 0), their heights will become more dissimilar.

Image source : Illustrated by the author

The words “King” and “Queen” are placed at a near distance.

These 2 words will be placed at a similar height in lower frequencies (such as 2 here). When we reach higher frequencies (such as 0), their height difference would have increased a little bit for differentiation.

Image source : Illustrated by the author

But we need to note that if the words are having less proximity, their heights will drastically differ when we are progressing to higher frequencies. If the words are having high proximity, their heights will be differing only a little bit when we are progressing to higher frequencies.

Conclusion

Through this write-up, I hope you got an intuitive understanding of the complex mathematical computations behind the Positional embedding in machine learning. In short, we discussed the postulation behind the concept of “Embedding”, Some of its various types and the need for implementing positional embedding to achieve certain objectives.

For tech enthusiasts whose area of interest is “Natural language processing", I think this content will be helpful in comprehending some of the sophisticated calculations in a nutshell. For more detailed information, you can refer to the renowned research paper - “Attention is all you need” (I have added the URL for accessing this research paper in the reference section).

References

Attention Is All You Need Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Yann Lecun (2020). Deep Learning course at NYU, Spring 2020, video lecture Week 6.
Robertson, Sean. "NLP From Scratch: Translation With a Sequence To Sequence Network and Attention". pytorch.org. Retrieved 2021-12-22.
Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Distributed Representations of Words and Phrases and their Compositionality". arXiv:1310.4546 [cs.CL].
Socher, Richard; Perelygin, Alex; Wu, Jean; Chuang, Jason; Manning, Chris; Ng, Andrew; Potts, Chris (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank (PDF). EMNLP.

Positional Embedding: The Secret behind the Accuracy of Transformer Neural Networks

Table of Contents﻿

Introduction

Concept of embedding in NLP

Need for positional embedding in Transformers

Various types of initial trial and error experiments

Frequency-based positional embedding

Conclusion﻿

References﻿

Table of Contents

Conclusion

References