paint-brush
Text Embedding Explained: How AI Understands Wordsby@whatsai
2,786 reads
2,786 reads

Text Embedding Explained: How AI Understands Words

by Louis BouchardDecember 3rd, 2022
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

Large language models. You must’ve heard these words before. They represent a specific type of machine learning-based algorithms that understand and can generate language, a field often called natural language processing or NLP. You’ve certainly heard of the most known and powerful language model: GPT-3. GPT-3, as I’ve described in the video covering it is able to take language, understand it and generate language in return. But be careful here; it doesn’t really understand it. In fact, it’s far from understanding. GPT-3 and other language-based models merely use what we call dictionaries of words to represent them as numbers, remember their positions in the sentence, and that’s it. Let's dive into those powerful machine learning models and try to understand what they see instead of words, called word embeddings, and how to produce them with an example provided by Cohere.
featured image - Text Embedding Explained: How AI Understands Words
Louis Bouchard HackerNoon profile picture

Large language models.

You must’ve heard these words before. They represent a specific type of machine learning-based algorithm that understand and can generate language, a field often called natural language processing or NLP.

You’ve certainly heard of the most known and powerful language model: .

GPT-3, as I’ve described in the video covering it is able to take language, understand it and generate language in return. But be careful here; it doesn’t really understand it. In fact, it’s far from understanding. GPT-3 and other language-based models merely use what we call dictionaries of words to represent them as numbers, remember their positions in the sentence, and that’s it.

Let's dive into those powerful machine learning models and try to understand what they see instead of words, called word embeddings, and how to produce them with an example provided by Cohere.

Learn more in the video...

References

►Read the full article: https://www.louisbouchard.ai/text-embedding/
►BERT Word Embeddings Tutorial: https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#why-bert-embeddings
►Cohere's Notebook from the code example: https://colab.research.google.com/github/cohere-ai/notebooks/blob/main/notebooks/Basic_Semantic_Search.ipynb
►Cohere Repos focused on embeddings: https://github.com/cohere-ai/notebooks
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

0:07

language models you must have heard

0:10

these words before they represent a

0:13

specific type of machine learning

0:14

algorithms that understand and can

0:16

generate language a field often called

0:19

natural language processing or NLP

0:22

you've certainly heard of the most known

0:24

and Powerful language models like gpt3

0:26

gpt3 as I've described in the video

0:28

covering it is able to take language

0:30

understand it and generate language in

0:33

return but be careful here it doesn't

0:35

really understand it in fact it's far

0:38

from understanding gbd3 and other

0:41

language-based models merely use what we

0:44

call dictionaries of words to represent

0:46

them as numbers remember their positions

0:49

in the sentence and that's it using a

0:52

few numbers and positional numbers

0:53

called embeddings they are able to

0:55

regroup similar sentences which also

0:58

means that they are able to kind of

1:00

understand sentences by comparing them

1:02

to known sentences like our data set

1:05

it's the same process for image sentence

1:07

models that take your sentence to

1:10

generate an image they do not really

1:11

understand it but they can compare it to

1:13

similar images producing some sort of

1:16

understanding of the concepts in your

1:18

sentence in this video we will have a

1:20

look at what those powerful machine

1:22

learning models see instead of words

1:24

called word embeddings and how to

1:27

produce them with an example provided by

1:29

the sponsor of this video a great

1:31

company in the NLP field cohere which I

1:35

will talk about at the end of the video

1:36

as they have a fantastic platform for

1:39

NLP we've talked about embeddings and

1:42

gpt3 but what's the link between the two

1:44

emittings are what is seen by the models

1:47

and how they process the words we know

1:50

and why use embeddings well because as

1:53

of now machines cannot process words and

1:56

we need numbers in order to train those

1:59

large models thanks to our carefully

2:01

built data set we can use mathematics to

2:04

measure the distance between embeddings

2:06

and correct our Network based on this

2:08

distance iteratively getting our

2:10

predictions closer to the real meaning

2:12

and improving the results and meetings

2:15

are also what the models like clip

2:17

stable diffusion or Dali used to

2:19

understand sentences and generate images

2:21

this is done by comparing both images

2:24

and text in the same embedding space

2:26

meaning that the model does not

2:28

understand either text or images but it

2:31

can understand if an image is similar to

2:33

a specific text or not so if we find

2:36

enough image caption pairs we can train

2:38

a huge and Powerful model like Dali to

2:41

take a sentence embed it find its

2:43

nearest image clone and generate it in

2:46

return so machine learning with text is

2:48

all about comparing embeddings but how

2:51

do we get those embeddings we get them

2:53

using another model trained to find the

2:56

best way to generate similar embeddings

2:58

for similar sentences while keeping the

3:01

differences in meaning for similar words

3:03

compared to using a straight one for one

3:06

dictionary the sentences are usually

3:08

represented with special tokens marking

3:10

the beginning and end of our text then

3:13

as I said we have our poses from all

3:15

embeddings which indicate the position

3:17

of each word relative to each other

3:19

often using sinusoidal functions I

3:22

linked a great article about this in the

3:25

description if you'd like to learn more

3:26

finally we have our word embeddings we

3:29

start with all our words being split

3:31

into an array just like a table of words

3:34

starting now there are no longer words

3:36

they are just tokens or numbers from the

3:40

whole English dictionary you can see

3:42

here that all the words now are

3:44

represented by a number indicating where

3:46

they are in the dictionary thus having

3:49

the same number for the word Bank even

3:51

though their meaning are different in

3:53

the sentence we have now we need to add

3:56

a little bit of intelligence to that but

3:58

not too much this is done thanks to a

4:00

model trained to take this new list of

4:03

numbers and further encode it into

4:05

another list of numbers that better

4:08

represent the sentence for example it

4:10

will no longer have the same embedding

4:13

for the two words bank here this is

4:15

possible because the model used to do

4:17

that has been trained on a lot of

4:19

annotated Text data and learned to

4:21

encode similar meaning sentences next to

4:24

each other and opposite sentences far

4:27

from each other thus allowing our

4:29

embeddings to be less biased by our

4:31

choice of words then the initial simple

4:34

one for one word embedding we initially

4:37

had here's what using imagings looks

4:39

like in a very short NLP example there

4:42

are more links below to learn more about

4:44

embeddings and how to code it yourself

4:46

here we will take some Hacker News posts

4:49

and build a model label to retrieve the

4:51

most similar post of a new input

4:53

sentence to start we need a data set in

4:56

this case it is a pre-embedded set of

4:58

3000 Hacker News posts that have already

5:01

been emitted into numbers then we build

5:04

a memory saving all those embeddings for

5:07

future comparison we basically just

5:09

saved these embeddings in an efficient

5:11

way when a new query is done for example

5:13

here asking what is your most profound

5:16

life inside you can generate its

5:18

embedding using the same embedding

5:20

Network usually it is bird or a version

5:23

of it and we compare the distance

5:25

between the embedding space to all other

5:27

Hacker News posts in our memory note

5:30

that it's really important here to

5:32

always use the same network whether for

5:34

generating your data set or for querying

5:36

it as I said there is no real

5:38

intelligence here nor that it actually

5:40

understands the words it just has been

5:42

trained to embed similar sentences

5:45

nearby in the unmanning space nothing

5:47

more if you send your sentence to a

5:50

different network to generate an

5:51

embedding and compare the embedding to

5:53

the ones you had from another Network

5:55

nothing will work it will just be like

5:58

the nice people that try to talk to me

5:59

in Hebrew at eccv last week it just

6:02

wasn't in an embedding space my brain

6:04

could understand fortunately for us our

6:06

brain can learn to transfer from one

6:08

embedding space to another as I can with

6:11

French and English but it requires a lot

6:13

of work and practice and it's the same

6:16

for machines anyways coming back to our

6:18

problem we could find the most similar

6:21

posts that's pretty cool but how could

6:23

we achieve this as I mentioned it's

6:25

because of the network birth in this

6:28

case it learns to create similar

6:30

embeddings from similar sentences we can

6:32

even visualize it in two Dimensions like

6:35

this where you can see how two similar

6:37

points represent similar subjects you

6:39

can do many other things once you have

6:41

those embeddings like extracting

6:43

keywords performing a semantic search

6:45

doing sentiment analysis or even

6:47

generating images as we said and

6:49

demonstrated in previous videos I have a

6:52

lot of videos covering those and listed

6:55

a few interesting notebooks to learn to

6:57

play with encodings thanks to the cohere

6:59

team now let me talk a little bit about

7:02

kohilu as they are highly relevant to

7:05

this video cook here provides a

7:07

everything you need if you are working

7:09

in the NLP field including a super

7:11

simple way to use embedding models in

7:14

your application literally with just an

7:16

API call you can embed the text without

7:18

knowing anything about how the embedding

7:21

models work the API does it for you in

7:23

the background here you can see the

7:25

semantic search notebook that uses

7:27

cohere API to create embeddings of an

7:30

archive of questions and question

7:32

queries to later perform search of

7:34

similar questions using cook here you

7:37

can easily do anything related to text

7:39

generate categorize and organize at

7:42

pretty much any scale you can integrate

7:44

large language models trained on

7:46

billions of words with a few lines of

7:48

code and it works in any Library you

7:51

don't even need machine learning skills

7:53

to get started they even have learning

7:55

resources like the recent cohere for

7:57

ai's colors program that I really like

8:00

this program is an incredible

8:01

opportunity for emerging talent in NLP

8:04

research around the world if selected

8:06

you will will work alongside their team

8:08

and have access to a large-scale

8:10

experimental framework and cohere

8:12

experts which is pretty cool I also

8:15

invite you to join their great Discord

8:17

Community ingeniously called Co Unity I

8:21

hope you've enjoyed this video and will

8:23

try out cohere for yourself with the

8:25

first link below I am sure you will

8:27

benefit from it thank you very much for

8:29

watching the whole video and thanks to

8:31

anyone supporting my work by leaving a

8:33

like comment or trying out our sponsors

8:36

that I carefully select for these videos