<TLDR> BERT is certainly a significant step forward in the context of NLP. Business activities such as topic detection and sentiment analysis will be much easier to create and execute, and the results much more accurate. But how did you get to BERT, and how exactly does the model work? Why is it so powerful? Last but not least, what benefits it can bring to the business, and our decision to integrate it into the sandsiv+ Customer Experience platform.</TLDR>
This time it is difficult, I have set myself an ambitious goal. To explain the transformers to people who have neither a background in programming nor in artificial intelligence. The challenge is great, I will try to do my best.
This morning I got it into my head to learn a new language, Portuguese. I like the sound of the language and I said to myself, let's learn it! The first thing that came into my mind was to take some words in Portuguese translated from my mother tongue, Italian, in order to build a first elementary vocabulary.
It was amusing because some of the words sounded a lot like Italian, and in the context of Italian I tried to understand things as synonyms and autonomous so as to appear a little bit more scented than what they really are. So I tried to understand the semantics or the meaning of the relationships between words.
Practically I used a language I know well - my mother tongue - and then associated those new Portuguese language terms and slowly learned the new language. A similar process, thanks to deep machine learning and the considerable increase in computing power, has been done in the computational field.
The computer knows only one language, mathematics, so you have to refer to that if you want to "teach" the machine the interpretation of a human language.
It is important to remember that any problem that is solved with Deep Learning is a mathematical problem. The computer, for example, "sees" thanks to a Convolutional Neural Network. This CNN receives the images in the form of mathematical matrices, whether they are black and white or color, and then applies linear algebra rules. The same happens in tasks such as topic detection, sentiment analysis, etc.
The problem is mathematical and not linguistic if someone offers you a NLP solution that is language sensitive, know that it is probably 4 generations old, or even worst: a keyword search solution.
In the computer world, taking an unknown word - a Portuguese word in my case - and translating it into a known language (Italian) trying to learn, has been tackled by word embedding systems in vectors. Algorithms like FasText, Word2Vec, and GloVe have done exactly this: transforming words of any language into mathematical vectors that computers can "understand" applying linear algebra. Once again, a mathematical problem, not a linguistic one.
The next step in my effort to learn Portuguese was to translate small sentences. I listened carefully to each new Portuguese word to translate it into Italian. A linear operation with each new word coincided with my effort to translate. In the computer world, the same operation was done with algorithms called Encoder and Decoder.
The system, "listening" sequentially to the "words" - mathematical vectors - translates them into new instructions, both language-to-language translations and computational models for "understanding". ...the tongue.
Of course, this allowed me to translate small sentences from Portuguese into Italian, but when the sentence became longer, or even became a whole document, at this point the word-for-word system no longer worked very well. I had to increase my ability to concentrate and try to better understand the context in which each new word was presented in Portuguese and of course relate it to my Italian knowledge.
This was one more step than the word-by-word approach I used before. Even in the computing world, the next step was to add to the Encoder-Decoder models what is called "The attention mechanism" which allows the computer to pay more attention to the words in its context. Practically trying to apply those semantic rules that are missing in a linear process like Encoder and Decoder.
The approach, despite the increased level of attention, is always sequential, and clearly shows its limits. In my case, with every new word that comes from Portuguese, I try to read it in Italian paying a lot of attention, but I have to say that certain ambiguities of the language, I can hardly interpret them correctly. The same happens on the computer where complex semantic rules are hardly understood by the model.
I have to say that my level of translation from Portuguese to Italian has increased considerably, I can translate, albeit with some errors, sentences much longer than those made with the previous methods. At this point, however, I need more, I want to be faster, and more precise. I want to understand the context much better. I want to reduce ambiguities. I need some kind of parallel process as well as context knowledge, and finally, I need to understand long term dependencies.
My computational process has exactly the same needs as mine, that's where the Transformers come in.
Let's take a little example, let's look at these two sentences:
The exact same word "bank" has two meanings that change in both contexts. You need to look at the sentence as a whole to understand the syntax and semantics. ELMo- Embedding from Language Models looks at the entire sentence to understand the syntax, semantics, and context to increase the accuracy of the NLP tasks.
My next step to better learn the Portuguese language was to read lots of books, listen to Portuguese television, watch Portuguese language movies, etc.. I tried to significantly increase my vocabulary, understand the language and its dependencies.
My computer did the same thing. It has, for example, "read and memorized" all Wikipedia in Portuguese, it has done technically what is called Transfer Learning. In this way, my computer no longer starts from scratch when it has to perform any linguistic operation in Portuguese but has created a fairly vast level of knowledge in that language.
The model that "learns" from a large body of words to have a strong initial understanding of the language is called Generative Pre-Trained Transformers (GPT). The model uses only the decoder part of the Transformer. It uses what it has learned from reading, for example, Wikipedia (Transfer Learning) and "reads" words from left to right (Uni-directional).
When you learn different aspects of the language, you realize that exposure to a variety of texts is very useful for applying Transfer Learning. Start reading books to build a strong vocabulary and understanding of the language. When some words in a sentence are masked or hidden, then rely on your knowledge of the language and read the entire sentence from left to right and right to left (two-way).
Now you can predict masked words more accurately (Masked Language Modeling). It is like filling in the blanks. You can also predict when the two sentences are related or not (Prediction of the next sentence). This is a simple BERT -stands for Bidirectional Encoder Representations from Transformers (BERT) - is a technique for NLP pre-training developed by Google. Its job is quite self-explainable from the acronyms: bidirectional representations of encoders from transformers.
Are you confused? Don't worry, so am I. I will only try to explain the real advantages of BERT from a practical point of view.
Of all the things we've seen before, BERT is a big evolution. It collects all the features of the previous models, from word embedding to transformers, with all the advantages they achieve. But it brings other very interesting practical innovations:
BERT is bidirectional, it doesn't just "read" from left to right, it does the opposite. This allows it to better "understand" words in context. Not only for ambiguous words, but also for related words, an example: Mike has gone to the stage. He had a great time! BERT understands that "he" refers to Mike, which is no small thing to solve language problems.BERT, while training, not only "reads", but he hides 15% of the words and tries to "guess" them.
In this way, he tries to create knowledge that goes beyond "reading", but helps BERT to anticipate the word based on the previous context, and even predict the sentence based on the previous one. Which is no small thing in an automatic question and answer system, or a chatbot, for instance. BERT offers several generic models that can be "uploaded" and then fine-tuned to the specific case (e.g. topic detection or sentiment analysis), without having a huge mass of data to do the fine-tuning.
This is no small thing for those who have already tried to train NLP models by labeling the data.
I've been working in Natural Language Processing for several years - the real thing, not the keyword search that my competitors pass off as Text Mining - and I was impressed with BERT. Let me give you a small practical example. I built a sentiment model, with a final accuracy F1 89% from a dataset so composed:
All this is only possible because by using Transfer Learning and the generic models available from BERT, even very small cases (in our case FRUSTRATED) can be fine-tuned. Practically as if you could load in your brain a model that summarizes the linguistic knowledge obtained by reading all Wikipedia in Portuguese, and then just do a little fine-tuning for the specific case you want to solve. A leap forward in NLP!
I forgot, BERT clearly provides a number of models for Transfer Learning. And it clearly offers a large number of languages. For example, the BERT-Base Multilingual Cased model "has read texts" in 104 different languages, and can be refined with your own small dataset in each of them.
BERT will soon be available in our sandisv+ solution, and our customers will be able to take advantage of all these benefits that this great innovation brings in topic detection and sentiment analysis.
Previously published at https://www.linkedin.com/pulse/natural-language-processing-explaining-bert-business-people-cesconi/