paint-brush
यान लेकुन "मालिकाना एआई सिस्टम के माध्यम से शक्ति की एकाग्रता में खतरा" पर द्वारा@videoman
133 रीडिंग

यान लेकुन "मालिकाना एआई सिस्टम के माध्यम से शक्ति की एकाग्रता में खतरा" पर

द्वारा Video Man
Video Man HackerNoon profile picture

Video Man

@videoman

i'm a man i'm a man i'm a video man...

1 मिनट read2024/04/04
Read on Terminal Reader
Read this story in a terminal
Print this story

बहुत लंबा; पढ़ने के लिए

यान लेकन मेटा में मुख्य एआई वैज्ञानिक, NYU में प्रोफेसर, ट्यूरिंग अवार्ड विजेता और सबसे प्रभावशाली शोधकर्ताओं में से एक हैं। लेक्स, लेक्स लूथर नहीं है।
featured image - यान लेकुन "मालिकाना एआई सिस्टम के माध्यम से शक्ति की एकाग्रता में खतरा" पर
Video Man HackerNoon profile picture
Video Man

Video Man

@videoman

i'm a man i'm a man i'm a video man



- I see the danger of thisconcentration of power

[00:00:00] : [00:00:02]

through proprietary AI systems

[00:00:02] : [00:00:06]

as a much bigger dangerthan everything else.

[00:00:06] : [00:00:08]

What works against this

[00:00:08] : [00:00:11]

is people who think thatfor reasons of security,

[00:00:11] : [00:00:15]

we should keep AI systemsunder lock and key

[00:00:15] : [00:00:18]

because it's too dangerous

[00:00:18] : [00:00:19]

to put it in the hands of everybody.

[00:00:19] : [00:00:22]

That would lead to a very bad future

[00:00:22] : [00:00:25]

in which all of our information diet

[00:00:25] : [00:00:27]

is controlled by a smallnumber of companies

[00:00:27] : [00:00:30]

through proprietary systems.

[00:00:30] : [00:00:32]

- I believe that peopleare fundamentally good

[00:00:32] : [00:00:34]

and so if AI, especially open source AI

[00:00:34] : [00:00:38]

can make them smarter,

[00:00:38] : [00:00:41]

it just empowers the goodness in humans.

[00:00:41] : [00:00:44]

- So I share that feeling.

[00:00:44] : [00:00:45]

Okay?

[00:00:45] : [00:00:46]

I think people arefundamentally good. (laughing)

[00:00:46] : [00:00:50]

And in fact a lot of doomers are doomers

[00:00:50] : [00:00:52]

because they don't think thatpeople are fundamentally good.

[00:00:52] : [00:00:55]

- The following is aconversation with Yann LeCun,

[00:00:55] : [00:01:01]

his third time on this podcast.

[00:01:01] : [00:01:02]

He is the chief AI scientist at Meta,

[00:01:02] : [00:01:05]

professor at NYU,

[00:01:05] : [00:01:07]

Turing Award winner

[00:01:07] : [00:01:08]

and one of the seminal figures

[00:01:08] : [00:01:10]

in the history of artificial intelligence.

[00:01:10] : [00:01:13]

He and Meta AI

[00:01:13] : [00:01:15]

have been big proponents ofopen sourcing AI development,

[00:01:15] : [00:01:19]

and have been walking the walk

[00:01:19] : [00:01:21]

by open sourcing manyof their biggest models,

[00:01:21] : [00:01:24]

including LLaMA 2 and eventually LLaMA 3.

[00:01:24] : [00:01:28]

Also, Yann has been an outspoken critic

[00:01:28] : [00:01:31]

of those people in the AI community

[00:01:31] : [00:01:34]

who warn about the looming danger

[00:01:34] : [00:01:36]

and existential threat of AGI.

[00:01:36] : [00:01:39]

He believes the AGIwill be created one day,

[00:01:39] : [00:01:43]

but it will be good.

[00:01:43] : [00:01:45]

It will not escape human control

[00:01:45] : [00:01:47]

nor will it dominate and kill all humans.

[00:01:47] : [00:01:52]

At this moment of rapid AI development,

[00:01:52] : [00:01:54]

this happens to be somewhata controversial position.

[00:01:54] : [00:01:58]

And so it's been fun

[00:01:58] : [00:02:00]

seeing Yann get into a lot of intense

[00:02:00] : [00:02:02]

and fascinating discussions online

[00:02:02] : [00:02:04]

as we do in this very conversation.

[00:02:04] : [00:02:08]

This is the Lex Fridman podcast.

[00:02:08] : [00:02:10]

To support it,

[00:02:10] : [00:02:11]

please check out oursponsors in the description.

[00:02:11] : [00:02:13]

And now, dear friends, here's Yann LeCun.

[00:02:13] : [00:02:17]

You've had some strong statements,

[00:02:17] : [00:02:21]

technical statements

[00:02:21] : [00:02:22]

about the future of artificialintelligence recently,

[00:02:22] : [00:02:25]

throughout your careeractually but recently as well.

[00:02:25] : [00:02:28]

You've said that autoregressive LLMs

[00:02:28] : [00:02:31]

are not the way we'regoing to make progress

[00:02:31] : [00:02:36]

towards superhuman intelligence.

[00:02:36] : [00:02:38]

These are the large language models

[00:02:38] : [00:02:41]

like GPT-4, like LLaMA2 and 3 soon and so on.

[00:02:41] : [00:02:44]

How do they work

[00:02:44] : [00:02:45]

and why are they not goingto take us all the way?

[00:02:45] : [00:02:47]

- For a number of reasons.

[00:02:47] : [00:02:49]

The first is that there isa number of characteristics

[00:02:49] : [00:02:51]

of intelligent behavior.

[00:02:51] : [00:02:53]

For example, the capacityto understand the world,

[00:02:53] : [00:02:58]

understand the physical world,

[00:02:58] : [00:03:00]

the ability to rememberand retrieve things,

[00:03:00] : [00:03:05]

persistent memory,

[00:03:05] : [00:03:08]

the ability to reasonand the ability to plan.

[00:03:08] : [00:03:12]

Those are four essential characteristic

[00:03:12] : [00:03:14]

of intelligent systems or entities,

[00:03:14] : [00:03:18]

humans, animals.

[00:03:18] : [00:03:19]

LLMs can do none of those,

[00:03:19] : [00:03:23]

or they can only do themin a very primitive way.

[00:03:23] : [00:03:26]

And they don't reallyunderstand the physical world,

[00:03:26] : [00:03:29]

they don't really have persistent memory,

[00:03:29] : [00:03:31]

they can't really reason

[00:03:31] : [00:03:32]

and they certainly can't plan.

[00:03:32] : [00:03:34]

And so if you expect thesystem to become intelligent

[00:03:34] : [00:03:38]

just without having thepossibility of doing those things,

[00:03:38] : [00:03:43]

you're making a mistake.

[00:03:43] : [00:03:44]

That is not to say thatautoregressive LLMs are not useful,

[00:03:44] : [00:03:50]

they're certainly useful.

[00:03:50] : [00:03:52]

That they're not interesting,

[00:03:52] : [00:03:55]

that we can't build

[00:03:55] : [00:03:56]

a whole ecosystem ofapplications around them,

[00:03:56] : [00:04:00]

of course we can.

[00:04:00] : [00:04:00]

But as a path towardshuman level intelligence,

[00:04:00] : [00:04:05]

they're missing essential components.

[00:04:05] : [00:04:08]

And then there is another tidbit or fact

[00:04:08] : [00:04:11]

that I think is very interesting;

[00:04:11] : [00:04:14]

those LLMs are trained onenormous amounts of text.

[00:04:14] : [00:04:16]

Basically the entirety

[00:04:16] : [00:04:18]

of all publicly availabletext on the internet, right?

[00:04:18] : [00:04:21]

That's typically on theorder of 10 to the 13 tokens.

[00:04:21] : [00:04:26]

Each token is typically two bytes.

[00:04:26] : [00:04:28]

So that's two 10 to the13 bytes as training data.

[00:04:28] : [00:04:31]

It would take you or me 170,000 years

[00:04:31] : [00:04:34]

to just read through this ateight hours a day. (laughs)

[00:04:34] : [00:04:37]

So it seems like an enormousamount of knowledge, right?

[00:04:37] : [00:04:41]

That those systems can accumulate.

[00:04:41] : [00:04:43]

But then you realize it'sreally not that much data.

[00:04:43] : [00:04:48]

If you talk todevelopmental psychologists,

[00:04:48] : [00:04:51]

and they tell you a 4-year-old

[00:04:51] : [00:04:53]

has been awake for 16,000hours in his or her life,

[00:04:53] : [00:04:57]

and the amount of information

[00:04:57] : [00:05:01]

that has reached thevisual cortex of that child

[00:05:01] : [00:05:06]

in four years

[00:05:06] : [00:05:07]

is about 10 to 15 bytes.

[00:05:07] : [00:05:12]

And you can compute this

[00:05:12] : [00:05:12]

by estimating that the optical nerve

[00:05:12] : [00:05:16]

carry about 20 megabytesper second, roughly.

[00:05:16] : [00:05:19]

And so 10 to the 15 bytes for a 4-year-old

[00:05:19] : [00:05:22]

versus two times 10 to the 13 bytes

[00:05:22] : [00:05:25]

for 170,000 years worth of reading.

[00:05:25] : [00:05:28]

What that tells you isthat through sensory input,

[00:05:28] : [00:05:33]

we see a lot more information

[00:05:33] : [00:05:35]

than we do through language.

[00:05:35] : [00:05:37]

And that despite our intuition,

[00:05:37] : [00:05:40]

most of what we learnand most of our knowledge

[00:05:40] : [00:05:43]

is through our observation and interaction

[00:05:43] : [00:05:46]

with the real world,

[00:05:46] : [00:05:47]

not through language.

[00:05:47] : [00:05:49]

Everything that we learn inthe first few years of life,

[00:05:49] : [00:05:51]

and certainly everythingthat animals learn

[00:05:51] : [00:05:54]

has nothing to do with language.

[00:05:54] : [00:05:57]

- So it would be good

[00:05:57] : [00:05:57]

to maybe push againstsome of the intuition

[00:05:57] : [00:06:00]

behind what you're saying.

[00:06:00] : [00:06:01]

So it is true there'sseveral orders of magnitude

[00:06:01] : [00:06:05]

more data coming into thehuman mind, much faster,

[00:06:05] : [00:06:10]

and the human mind is able tolearn very quickly from that,

[00:06:10] : [00:06:13]

filter the data very quickly.

[00:06:13] : [00:06:14]

Somebody might argue

[00:06:14] : [00:06:16]

your comparison betweensensory data versus language.

[00:06:16] : [00:06:19]

That language is already very compressed.

[00:06:19] : [00:06:23]

It already contains a lot more information

[00:06:23] : [00:06:25]

than the bytes it takes to store them,

[00:06:25] : [00:06:27]

if you compare it to visual data.

[00:06:27] : [00:06:29]

So there's a lot of wisdom in language.

[00:06:29] : [00:06:31]

There's words and the waywe stitch them together,

[00:06:31] : [00:06:33]

it already contains a lot of information.

[00:06:33] : [00:06:36]

So is it possible that language alone

[00:06:36] : [00:06:40]

already has enough wisdomand knowledge in there

[00:06:40] : [00:06:45]

to be able to, from thatlanguage construct a world model

[00:06:45] : [00:06:50]

and understanding of the world,

[00:06:50] : [00:06:52]

an understanding of the physical world

[00:06:52] : [00:06:54]

that you're saying LLMs lack?

[00:06:54] : [00:06:56]

- So it's a big debate among philosophers

[00:06:56] : [00:07:00]

and also cognitive scientists,

[00:07:00] : [00:07:01]

like whether intelligence needsto be grounded in reality.

[00:07:01] : [00:07:05]

I'm clearly in the camp

[00:07:05] : [00:07:07]

that yes, intelligence cannot appear

[00:07:07] : [00:07:10]

without some grounding in some reality.

[00:07:10] : [00:07:14]

It doesn't need to be physical reality,

[00:07:14] : [00:07:17]

it could be simulated

[00:07:17] : [00:07:18]

but the environment is just much richer

[00:07:18] : [00:07:20]

than what you can express in language.

[00:07:20] : [00:07:22]

Language is a very approximaterepresentation or percepts

[00:07:22] : [00:07:27]

and or mental models, right?

[00:07:27] : [00:07:29]

I mean, there's a lot oftasks that we accomplish

[00:07:29] : [00:07:32]

where we manipulate a mentalmodel of the situation at hand,

[00:07:32] : [00:07:37]

and that has nothing to do with language.

[00:07:37] : [00:07:40]

Everything that's physical,mechanical, whatever,

[00:07:40] : [00:07:43]

when we build something,

[00:07:43] : [00:07:44]

when we accomplish a task,

[00:07:44] : [00:07:46]

a moderate task of grabbingsomething, et cetera,

[00:07:46] : [00:07:50]

we plan our action sequences,

[00:07:50] : [00:07:52]

and we do this

[00:07:52] : [00:07:53]

by essentially imagining the result

[00:07:53] : [00:07:55]

of the outcome of sequence ofactions that we might imagine.

[00:07:55] : [00:08:00]

And that requires mental models

[00:08:00] : [00:08:03]

that don't have much to do with language.

[00:08:03] : [00:08:06]

And that's, I would argue,

[00:08:06] : [00:08:07]

most of our knowledge

[00:08:07] : [00:08:09]

is derived from that interactionwith the physical world.

[00:08:09] : [00:08:13]

So a lot of my colleagues

[00:08:13] : [00:08:15]

who are more interested inthings like computer vision

[00:08:15] : [00:08:19]

are really on that camp

[00:08:19] : [00:08:20]

that AI needs to be embodied, essentially.

[00:08:20] : [00:08:25]

And then other peoplecoming from the NLP side

[00:08:25] : [00:08:28]

or maybe some other motivation

[00:08:28] : [00:08:32]

don't necessarily agree with that.

[00:08:32] : [00:08:34]

And philosophers are split as well.

[00:08:34] : [00:08:37]

And the complexity of theworld is hard to imagine.

[00:08:37] : [00:08:42]

It's hard to representall the complexities

[00:08:42] : [00:08:49]

that we take completely forgranted in the real world

[00:08:49] : [00:08:53]

that we don't even imaginerequire intelligence, right?

[00:08:53] : [00:08:55]

This is the old Moravec's paradox

[00:08:55] : [00:08:57]

from the pioneer ofrobotics, Hans Moravec,

[00:08:57] : [00:09:01]

who said, how is it that with computers,

[00:09:01] : [00:09:03]

it seems to be easy to dohigh level complex tasks

[00:09:03] : [00:09:05]

like playing chess and solving integrals

[00:09:05] : [00:09:08]

and doing things like that,

[00:09:08] : [00:09:09]

whereas the thing we take forgranted that we do every day,

[00:09:09] : [00:09:13]

like, I don't know,learning to drive a car

[00:09:13] : [00:09:16]

or grabbing an object,

[00:09:16] : [00:09:18]

we can't do with computers. (laughs)

[00:09:18] : [00:09:21]

And we have LLMs thatcan pass the bar exam,

[00:09:21] : [00:09:26]

so they must be smart.

[00:09:26] : [00:09:29]

But then they can'tlaunch a drive in 20 hours

[00:09:29] : [00:09:33]

like any 17-year-old.

[00:09:33] : [00:09:35]

They can't learn to clearout the dinner table

[00:09:35] : [00:09:37]

and fill out the dishwasher

[00:09:37] : [00:09:40]

like any 10-year-oldcan learn in one shot.

[00:09:40] : [00:09:42]

Why is that?

[00:09:42] : [00:09:44]

Like what are we missing?

[00:09:44] : [00:09:45]

What type of learning

[00:09:45] : [00:09:47]

or reasoning architectureor whatever are we missing

[00:09:47] : [00:09:52]

that basically prevent us

[00:09:52] : [00:09:55]

from having level five self-driving cars

[00:09:55] : [00:09:58]

and domestic robots?

[00:09:58] : [00:10:00]

- Can a large language modelconstruct a world model

[00:10:00] : [00:10:05]

that does know how to drive

[00:10:05] : [00:10:07]

and does know how to fill a dishwasher,

[00:10:07] : [00:10:09]

but just doesn't know

[00:10:09] : [00:10:10]

how to deal with visual data at this time?

[00:10:10] : [00:10:12]

So it can operate in a space of concepts.

[00:10:12] : [00:10:17]

- So yeah, that's what a lotof people are working on.

[00:10:17] : [00:10:19]

So the answer,

[00:10:19] : [00:10:20]

the short answer is no.

[00:10:20] : [00:10:22]

And the more complex answer is

[00:10:22] : [00:10:24]

you can use all kind of tricks

[00:10:24] : [00:10:26]

to get an LLM to basicallydigest visual representations

[00:10:26] : [00:10:31]

of images or video oraudio for that matter.

[00:10:31] : [00:10:40]

And a classical way of doing this

[00:10:40] : [00:10:45]

is you train a vision system in some way,

[00:10:45] : [00:10:48]

and we have a number of waysto train vision systems,

[00:10:48] : [00:10:51]

either supervised,unsupervised, self-supervised,

[00:10:51] : [00:10:53]

all kinds of different ways.

[00:10:53] : [00:10:55]

That will turn any image intoa high level representation.

[00:10:55] : [00:11:01]

Basically, a list of tokens

[00:11:01] : [00:11:03]

that are really similarto the kind of tokens

[00:11:03] : [00:11:05]

that a typical LLM takes as an input.

[00:11:05] : [00:11:10]

And then you just feed that to the LLM

[00:11:10] : [00:11:15]

in addition to the text,

[00:11:15] : [00:11:17]

and you just expectthe LLM during training

[00:11:17] : [00:11:21]

to kind of be able touse those representations

[00:11:21] : [00:11:25]

to help make decisions.

[00:11:25] : [00:11:27]

I mean, there's beenwork along those lines

[00:11:27] : [00:11:29]

for quite a long time.

[00:11:29] : [00:11:30]

And now you see those systems, right?

[00:11:30] : [00:11:32]

I mean, there are LLMs thathave some vision extension.

[00:11:32] : [00:11:36]

But they're basically hacks

[00:11:36] : [00:11:37]

in the sense that those things

[00:11:37] : [00:11:39]

are not like trained to handle,

[00:11:39] : [00:11:41]

to really understand the world.

[00:11:41] : [00:11:43]

They're not trainedwith video, for example.

[00:11:43] : [00:11:46]

They don't really understandintuitive physics,

[00:11:46] : [00:11:48]

at least not at the moment.

[00:11:48] : [00:11:50]

- So you don't think

[00:11:50] : [00:11:52]

there's something special toyou about intuitive physics,

[00:11:52] : [00:11:54]

about sort of common sense reasoning

[00:11:54] : [00:11:55]

about the physical space,about physical reality?

[00:11:55] : [00:11:58]

That to you is a giant leap

[00:11:58] : [00:12:00]

that LLMs are just not able to do?

[00:12:00] : [00:12:02]

- We're not gonna be able to do this

[00:12:02] : [00:12:03]

with the type of LLMs thatwe are working with today.

[00:12:03] : [00:12:07]

And there's a number of reasons for this,

[00:12:07] : [00:12:09]

but the main reason is

[00:12:09] : [00:12:10]

the way LLMs are trained isthat you take a piece of text,

[00:12:10] : [00:12:16]

you remove some of the wordsin that text, you mask them,

[00:12:16] : [00:12:20]

you replace them by black markers,

[00:12:20] : [00:12:22]

and you train a gigantic neural net

[00:12:22] : [00:12:24]

to predict the words that are missing.

[00:12:24] : [00:12:26]

And if you build this neuralnet in a particular way

[00:12:26] : [00:12:30]

so that it can only look at words

[00:12:30] : [00:12:32]

that are to the left of theone it's trying to predict,

[00:12:32] : [00:12:36]

then what you have is a system

[00:12:36] : [00:12:37]

that basically is trying to predict

[00:12:37] : [00:12:38]

the next word in a text, right?

[00:12:38] : [00:12:40]

So then you can feed it a text, a prompt,

[00:12:40] : [00:12:43]

and you can ask it topredict the next word.

[00:12:43] : [00:12:45]

It can never predictthe next word exactly.

[00:12:45] : [00:12:47]

And so what it's gonna do

[00:12:47] : [00:12:49]

is produce a probability distribution

[00:12:49] : [00:12:52]

of all the possiblewords in the dictionary.

[00:12:52] : [00:12:54]

In fact, it doesn't predict words,

[00:12:54] : [00:12:56]

it predicts tokens thatare kind of subword units.

[00:12:56] : [00:12:58]

And so it's easy to handle the uncertainty

[00:12:58] : [00:13:01]

in the prediction there

[00:13:01] : [00:13:02]

because there's only a finite number

[00:13:02] : [00:13:04]

of possible words in the dictionary,

[00:13:04] : [00:13:07]

and you can just computea distribution over them.

[00:13:07] : [00:13:10]

Then what the system does

[00:13:10] : [00:13:12]

is that it picks a wordfrom that distribution.

[00:13:12] : [00:13:16]

Of course, there's a higherchance of picking words

[00:13:16] : [00:13:18]

that have a higher probabilitywithin that distribution.

[00:13:18] : [00:13:21]

So you sample from that distribution

[00:13:21] : [00:13:22]

to actually produce a word,

[00:13:22] : [00:13:24]

and then you shift thatword into the input.

[00:13:24] : [00:13:27]

And so that allows the system now

[00:13:27] : [00:13:29]

to predict the second word, right?

[00:13:29] : [00:13:32]

And once you do this,

[00:13:32] : [00:13:33]

you shift it into the input, et cetera.

[00:13:33] : [00:13:35]

That's called autoregressive prediction,

[00:13:35] : [00:13:38]

which is why those LLMs

[00:13:38] : [00:13:39]

should be called autoregressive LLMs,

[00:13:39] : [00:13:42]

but we just call them at LLMs.

[00:13:42] : [00:13:46]

And there is a differencebetween this kind of process

[00:13:46] : [00:13:50]

and a process by whichbefore producing a word,

[00:13:50] : [00:13:53]

when you talk.

[00:13:53] : [00:13:55]

When you and I talk,

[00:13:55] : [00:13:56]

you and I are bilinguals.

[00:13:56] : [00:13:58]

We think about what we're gonna say,

[00:13:58] : [00:14:00]

and it's relatively independent

[00:14:00] : [00:14:01]

of the language in which we're gonna say.

[00:14:01] : [00:14:04]

When we talk about like, I don't know,

[00:14:04] : [00:14:06]

let's say a mathematicalconcept or something.

[00:14:06] : [00:14:09]

The kind of thinking that we're doing

[00:14:09] : [00:14:10]

and the answer thatwe're planning to produce

[00:14:10] : [00:14:13]

is not linked to whetherwe're gonna say it

[00:14:13] : [00:14:16]

in French or Russian or English.

[00:14:16] : [00:14:19]

- Chomsky just rolled hiseyes, but I understand.

[00:14:19] : [00:14:21]

So you're saying thatthere's a bigger abstraction

[00:14:21] : [00:14:24]

that goes before language-

[00:14:24] : [00:14:28]

- [Yann] Yeah.- And maps onto language.

[00:14:28] : [00:14:30]

- Right.

[00:14:30] : [00:14:31]

It's certainly true for alot of thinking that we do.

[00:14:31] : [00:14:33]

- Is that obvious that we don't?

[00:14:33] : [00:14:35]

Like you're saying yourthinking is same in French

[00:14:35] : [00:14:39]

as it is in English?

[00:14:39] : [00:14:40]

- Yeah, pretty much.

[00:14:40] : [00:14:42]

- Pretty much or is this...

[00:14:42] : [00:14:43]

Like how flexible are you,

[00:14:43] : [00:14:45]

like if there's aprobability distribution?

[00:14:45] : [00:14:48]

(both laugh)

[00:14:48] : [00:14:49]

- Well, it depends whatkind of thinking, right?

[00:14:49] : [00:14:50]

If it's like producing puns,

[00:14:50] : [00:14:53]

I get much better in Frenchthan English about that (laughs)

[00:14:53] : [00:14:56]

or much worse-

[00:14:56] : [00:14:58]

- Is there an abstractrepresentation of puns?

[00:14:58] : [00:15:00]

Like is your humor an abstract...

[00:15:00] : [00:15:01]

Like when you tweet

[00:15:01] : [00:15:03]

and your tweets aresometimes a little bit spicy,

[00:15:03] : [00:15:06]

is there an abstract representationin your brain of a tweet

[00:15:06] : [00:15:09]

before it maps onto English?

[00:15:09] : [00:15:11]

- There is an abstract representation

[00:15:11] : [00:15:13]

of imagining the reactionof a reader to that text.

[00:15:13] : [00:15:18]

- Oh, you start with laughter

[00:15:18] : [00:15:19]

and then figure out howto make that happen?

[00:15:19] : [00:15:22]

- Figure out like areaction you wanna cause

[00:15:22] : [00:15:25]

and then figure out how to say it

[00:15:25] : [00:15:26]

so that it causes that reaction.

[00:15:26] : [00:15:29]

But that's like really close to language.

[00:15:29] : [00:15:30]

But think about likea mathematical concept

[00:15:30] : [00:15:34]

or imagining something youwant to build out of wood

[00:15:34] : [00:15:38]

or something like this, right?

[00:15:38] : [00:15:40]

The kind of thinking you're doing

[00:15:40] : [00:15:41]

has absolutely nothing todo with language, really.

[00:15:41] : [00:15:43]

Like it's not like you have necessarily

[00:15:43] : [00:15:44]

like an internal monologuein any particular language.

[00:15:44] : [00:15:47]

You're imagining mentalmodels of the thing, right?

[00:15:47] : [00:15:51]

I mean, if I ask you to like imagine

[00:15:51] : [00:15:54]

what this water bottle will look like

[00:15:54] : [00:15:56]

if I rotate it 90 degrees,

[00:15:56] : [00:15:59]

that has nothing to do with language.

[00:15:59] : [00:16:01]

And so clearly

[00:16:01] : [00:16:04]

there is a more abstractlevel of representation

[00:16:04] : [00:16:08]

in which we do most of our thinking

[00:16:08] : [00:16:11]

and we plan what we're gonna say

[00:16:11] : [00:16:13]

if the output is uttered words

[00:16:13] : [00:16:18]

as opposed to an outputbeing muscle actions, right?

[00:16:18] : [00:16:24]

We plan our answer before we produce it.

[00:16:24] : [00:16:29]

And LLMs don't do that,

[00:16:29] : [00:16:30]

they just produce oneword after the other,

[00:16:30] : [00:16:32]

instinctively if you want.

[00:16:32] : [00:16:35]

It's a bit like the subconsciousactions where you don't...

[00:16:35] : [00:16:40]

Like you're distracted.

[00:16:40] : [00:16:42]

You're doing something,

[00:16:42] : [00:16:43]

you're completely concentrated

[00:16:43] : [00:16:45]

and someone comes to youand asks you a question.

[00:16:45] : [00:16:47]

And you kind of answer the question.

[00:16:47] : [00:16:49]

You don't have time tothink about the answer,

[00:16:49] : [00:16:51]

but the answer is easy

[00:16:51] : [00:16:52]

so you don't need to pay attention

[00:16:52] : [00:16:54]

and you sort of respond automatically.

[00:16:54] : [00:16:55]

That's kind of what an LLM does, right?

[00:16:55] : [00:16:58]

It doesn't think about its answer, really.

[00:16:58] : [00:17:01]

It retrieves it because it'saccumulated a lot of knowledge,

[00:17:01] : [00:17:04]

so it can retrieve some things,

[00:17:04] : [00:17:06]

but it's going to just spitout one token after the other

[00:17:06] : [00:17:10]

without planning the answer.

[00:17:10] : [00:17:13]

- But you're making it soundjust one token after the other,

[00:17:13] : [00:17:17]

one token at a time generationis bound to be simplistic.

[00:17:17] : [00:17:22]

But if the world model issufficiently sophisticated,

[00:17:22] : [00:17:28]

that one token at a time,

[00:17:28] : [00:17:30]

the most likely thing itgenerates as a sequence of tokens

[00:17:30] : [00:17:35]

is going to be a deeply profound thing.

[00:17:35] : [00:17:39]

- Okay.

[00:17:39] : [00:17:39]

But then that assumes that those systems

[00:17:39] : [00:17:42]

actually possess an internal world model.

[00:17:42] : [00:17:44]

- So it really goes to the...

[00:17:44] : [00:17:46]

I think the fundamental question is

[00:17:46] : [00:17:48]

can you build a reallycomplete world model?

[00:17:48] : [00:17:53]

Not complete,

[00:17:53] : [00:17:54]

but one that has a deepunderstanding of the world.

[00:17:54] : [00:17:58]

- Yeah.

[00:17:58] : [00:17:59]

So can you build thisfirst of all by prediction?

[00:17:59] : [00:18:03]

- [Lex] Right.

[00:18:03] : [00:18:04]

- And the answer is probably yes.

[00:18:04] : [00:18:06]

Can you build it by predicting words?

[00:18:06] : [00:18:10]

And the answer is most probably no,

[00:18:10] : [00:18:14]

because language isvery poor in terms of...

[00:18:14] : [00:18:17]

Or weak or low bandwidth if you want,

[00:18:17] : [00:18:19]

there's just not enough information there.

[00:18:19] : [00:18:21]

So building world modelsmeans observing the world

[00:18:21] : [00:18:26]

and understanding why the worldis evolving the way it is.

[00:18:26] : [00:18:32]

And then the extracomponent of a world model

[00:18:32] : [00:18:38]

is something that can predict

[00:18:38] : [00:18:41]

how the world is going to evolve

[00:18:41] : [00:18:42]

as a consequence of anaction you might take, right?

[00:18:42] : [00:18:45]

So one model really is,

[00:18:45] : [00:18:47]

here is my idea of the stateof the world at time T,

[00:18:47] : [00:18:49]

here is an action I might take.

[00:18:49] : [00:18:51]

What is the predicted state of the world

[00:18:51] : [00:18:53]

at time T plus one?

[00:18:53] : [00:18:55]

Now, that state of the world

[00:18:55] : [00:18:57]

does not need to representeverything about the world,

[00:18:57] : [00:19:01]

it just needs to represent

[00:19:01] : [00:19:02]

enough that's relevant forthis planning of the action,

[00:19:02] : [00:19:06]

but not necessarily all the details.

[00:19:06] : [00:19:08]

Now, here is the problem.

[00:19:08] : [00:19:09]

You're not going to be able to do this

[00:19:09] : [00:19:11]

with generative models.

[00:19:11] : [00:19:14]

So a generative modelthat's trained on video,

[00:19:14] : [00:19:16]

and we've tried to do this for 10 years.

[00:19:16] : [00:19:18]

You take a video,

[00:19:18] : [00:19:20]

show a system a piece of video

[00:19:20] : [00:19:22]

and then ask you to predictthe reminder of the video.

[00:19:22] : [00:19:25]

Basically predict what's gonna happen.

[00:19:25] : [00:19:27]

- One frame at a time.

[00:19:27] : [00:19:29]

Do the same thing as sort ofthe autoregressive LLMs do,

[00:19:29] : [00:19:33]

but for video.

[00:19:33] : [00:19:34]

- Right.

[00:19:34] : [00:19:35]

Either one frame at a time ora group of frames at a time.

[00:19:35] : [00:19:37]

But yeah, a large videomodel, if you want. (laughing)

[00:19:37] : [00:19:42]

The idea of doing this

[00:19:42] : [00:19:45]

has been floating around for a long time.

[00:19:45] : [00:19:46]

And at FAIR,

[00:19:46] : [00:19:48]

some colleagues and I

[00:19:48] : [00:19:51]

have been trying to dothis for about 10 years.

[00:19:51] : [00:19:53]

And you can't really do thesame trick as with LLMs,

[00:19:53] : [00:19:58]

because LLMs, as I said,

[00:19:58] : [00:20:02]

you can't predict exactlywhich word is gonna follow

[00:20:02] : [00:20:05]

a sequence of words,

[00:20:05] : [00:20:06]

but you can predict thedistribution of the words.

[00:20:06] : [00:20:09]

Now, if you go to video,

[00:20:09] : [00:20:11]

what you would have to do

[00:20:11] : [00:20:12]

is predict the distribution

[00:20:12] : [00:20:13]

of all possible frames in a video.

[00:20:13] : [00:20:16]

And we don't really knowhow to do that properly.

[00:20:16] : [00:20:19]

We do not know how torepresent distributions

[00:20:19] : [00:20:21]

over high dimensional continuous spaces

[00:20:21] : [00:20:24]

in ways that are useful.

[00:20:24] : [00:20:25]

And there lies the main issue.

[00:20:25] : [00:20:31]

And the reason we can do this

[00:20:31] : [00:20:33]

is because the world

[00:20:33] : [00:20:34]

is incredibly more complicated and richer

[00:20:34] : [00:20:38]

in terms of information than text.

[00:20:38] : [00:20:40]

Text is discreet.

[00:20:40] : [00:20:41]

Video is high dimensional and continuous.

[00:20:41] : [00:20:45]

A lot of details in this.

[00:20:45] : [00:20:47]

So if I take a video of this room,

[00:20:47] : [00:20:49]

and the video is a camera panning around,

[00:20:49] : [00:20:54]

there is no way I can predict

[00:20:54] : [00:20:57]

everything that's gonna bein the room as I pan around,

[00:20:57] : [00:21:00]

the system cannot predictwhat's gonna be in the room

[00:21:00] : [00:21:02]

as the camera is panning.

[00:21:02] : [00:21:03]

Maybe it's gonna predict,

[00:21:03] : [00:21:06]

this is a room where there'sa light and there is a wall

[00:21:06] : [00:21:08]

and things like that.

[00:21:08] : [00:21:09]

It can't predict what thepainting of the wall looks like

[00:21:09] : [00:21:11]

or what the texture ofthe couch looks like.

[00:21:11] : [00:21:14]

Certainly not the texture of the carpet.

[00:21:14] : [00:21:16]

So there's no way it canpredict all those details.

[00:21:16] : [00:21:19]

So the way to handle this

[00:21:19] : [00:21:22]

is one way to possibly to handle this,

[00:21:22] : [00:21:24]

which we've been working for a long time,

[00:21:24] : [00:21:26]

is to have a model that haswhat's called a latent variable.

[00:21:26] : [00:21:29]

And the latent variableis fed to a neural net,

[00:21:29] : [00:21:33]

and it's supposed to represent

[00:21:33] : [00:21:34]

all the information about the world

[00:21:34] : [00:21:35]

that you don't perceive yet.

[00:21:35] : [00:21:37]

And that you need to augment the system

[00:21:37] : [00:21:42]

for the prediction to do agood job at predicting pixels,

[00:21:42] : [00:21:47]

including the fine textureof the carpet and the couch

[00:21:47] : [00:21:52]

and the painting on the wall.

[00:21:52] : [00:21:54]

That has been a completefailure, essentially.

[00:21:54] : [00:22:00]

And we've tried lots of things.

[00:22:00] : [00:22:01]

We tried just straight neural nets,

[00:22:01] : [00:22:03]

we tried GANs,

[00:22:03] : [00:22:04]

we tried VAEs,

[00:22:04] : [00:22:08]

all kinds of regularized auto encoders,

[00:22:08] : [00:22:10]

we tried many things.

[00:22:10] : [00:22:13]

We also tried those kind of methods

[00:22:13] : [00:22:15]

to learn good representationsof images or video

[00:22:15] : [00:22:20]

that could then be used as input

[00:22:20] : [00:22:24]

for example, an imageclassification system.

[00:22:24] : [00:22:26]

And that also has basically failed.

[00:22:26] : [00:22:29]

Like all the systems thatattempt to predict missing parts

[00:22:29] : [00:22:33]

of an image or a video

[00:22:33] : [00:22:34]

from a corrupted version of it, basically.

[00:22:34] : [00:22:40]

So, right, take an image or a video,

[00:22:40] : [00:22:41]

corrupt it or transform it in some way,

[00:22:41] : [00:22:44]

and then try to reconstructthe complete video or image

[00:22:44] : [00:22:47]

from the corrupted version.

[00:22:47] : [00:22:48]

And then hope that internally,

[00:22:48] : [00:22:52]

the system will develop goodrepresentations of images

[00:22:52] : [00:22:54]

that you can use for object recognition,

[00:22:54] : [00:22:57]

segmentation, whatever it is.

[00:22:57] : [00:22:58]

That has been essentiallya complete failure.

[00:22:58] : [00:23:01]

And it works really well for text.

[00:23:01] : [00:23:04]

That's the principle thatis used for LLMs, right?

[00:23:04] : [00:23:07]

- So where's the failure exactly?

[00:23:07] : [00:23:08]

Is it that it is very difficult to form

[00:23:08] : [00:23:11]

a good representation of an image,

[00:23:11] : [00:23:14]

like a good embedding

[00:23:14] : [00:23:16]

of all the importantinformation in the image?

[00:23:16] : [00:23:19]

Is it in terms of the consistency

[00:23:19] : [00:23:21]

of image to image to image toimage that forms the video?

[00:23:21] : [00:23:24]

If we do a highlight reelof all the ways you failed.

[00:23:24] : [00:23:28]

What's that look like?

[00:23:28] : [00:23:30]

- Okay.

[00:23:30] : [00:23:31]

So the reason this doesn't work is...

[00:23:31] : [00:23:35]

First of all, I have to tellyou exactly what doesn't work

[00:23:35] : [00:23:37]

because there is somethingelse that does work.

[00:23:37] : [00:23:40]

So the thing that does not work

[00:23:40] : [00:23:41]

is training the system tolearn representations of images

[00:23:41] : [00:23:46]

by training it to reconstruct a good image

[00:23:46] : [00:23:51]

from a corrupted version of it.

[00:23:51] : [00:23:53]

Okay.

[00:23:53] : [00:23:54]

That's what doesn't work.

[00:23:54] : [00:23:55]

And we have a whole slewof techniques for this

[00:23:55] : [00:23:58]

that are variant of thenusing auto encoders.

[00:23:58] : [00:24:02]

Something called MAE,

[00:24:02] : [00:24:03]

developed by some ofmy colleagues at FAIR,

[00:24:03] : [00:24:05]

masked autoencoder.

[00:24:05] : [00:24:06]

So it's basically like theLLMs or things like this

[00:24:06] : [00:24:11]

where you train thesystem by corrupting text,

[00:24:11] : [00:24:13]

except you corrupt images.

[00:24:13] : [00:24:15]

You remove patches from it

[00:24:15] : [00:24:16]

and you train a giganticneural network to reconstruct.

[00:24:16] : [00:24:19]

The features you get are not good.

[00:24:19] : [00:24:20]

And you know they're not good

[00:24:20] : [00:24:22]

because if you now trainthe same architecture,

[00:24:22] : [00:24:25]

but you train it tosupervise with label data,

[00:24:25] : [00:24:30]

with textual descriptionsof images, et cetera,

[00:24:30] : [00:24:34]

you do get good representations.

[00:24:34] : [00:24:35]

And the performance onrecognition tasks is much better

[00:24:35] : [00:24:39]

than if you do this selfsupervised free training.

[00:24:39] : [00:24:41]

- So the architecture is good.

[00:24:41] : [00:24:44]

- The architecture is good.

[00:24:44] : [00:24:45]

The architecture of the encoder is good.

[00:24:45] : [00:24:47]

Okay?

[00:24:47] : [00:24:48]

But the fact that you train thesystem to reconstruct images

[00:24:48] : [00:24:51]

does not lead it to produce

[00:24:51] : [00:24:53]

long good generic features of images.

[00:24:53] : [00:24:56]

- [Lex] When you train itin a self supervised way.

[00:24:56] : [00:24:58]

- Self supervised by reconstruction.

[00:24:58] : [00:25:00]

- [Lex] Yeah, by reconstruction.

[00:25:00] : [00:25:01]

- Okay, so what's the alternative?

[00:25:01] : [00:25:02]

(both laugh)

[00:25:02] : [00:25:04]

The alternative is joint embedding.

[00:25:04] : [00:25:07]

- What is joint embedding?

[00:25:07] : [00:25:08]

What are these architecturesthat you're so excited about?

[00:25:08] : [00:25:11]

- Okay, so now insteadof training a system

[00:25:11] : [00:25:13]

to encode the image

[00:25:13] : [00:25:14]

and then training it toreconstruct the full image

[00:25:14] : [00:25:17]

from a corrupted version,

[00:25:17] : [00:25:20]

you take the full image,

[00:25:20] : [00:25:21]

you take the corruptedor transformed version,

[00:25:21] : [00:25:25]

you run them both through encoders,

[00:25:25] : [00:25:27]

which in general areidentical but not necessarily.

[00:25:27] : [00:25:30]

And then you train a predictoron top of those encoders

[00:25:30] : [00:25:36]

to predict the representationof the full input

[00:25:36] : [00:25:42]

from the representationof the corrupted one.

[00:25:42] : [00:25:45]

Okay?

[00:25:45] : [00:25:47]

So joint embedding,

[00:25:47] : [00:25:48]

because you're taking the full input

[00:25:48] : [00:25:51]

and the corrupted versionor transformed version,

[00:25:51] : [00:25:54]

run them both through encoders

[00:25:54] : [00:25:55]

so you get a joint embedding.

[00:25:55] : [00:25:57]

And then you're saying

[00:25:57] : [00:25:59]

can I predict therepresentation of the full one

[00:25:59] : [00:26:02]

from the representationof the corrupted one?

[00:26:02] : [00:26:04]

Okay?

[00:26:04] : [00:26:05]

And I call this a JEPA,

[00:26:05] : [00:26:07]

so that means joint embeddingpredictive architecture

[00:26:07] : [00:26:09]

because there's joint embedding

[00:26:09] : [00:26:11]

and there is this predictor

[00:26:11] : [00:26:12]

that predicts the representation

[00:26:12] : [00:26:13]

of the good guy from the bad guy.

[00:26:13] : [00:26:15]

And the big question is

[00:26:15] : [00:26:18]

how do you train something like this?

[00:26:18] : [00:26:20]

And until five years ago or six years ago,

[00:26:20] : [00:26:23]

we didn't have particularly good answers

[00:26:23] : [00:26:26]

for how you train those things,

[00:26:26] : [00:26:27]

except for one calledcontrastive learning.

[00:26:27] : [00:26:31]

And the idea of contrastive learning

[00:26:31] : [00:26:36]

is you take a pair of images

[00:26:36] : [00:26:38]

that are, again, an imageand a corrupted version

[00:26:38] : [00:26:42]

or degraded version somehow

[00:26:42] : [00:26:44]

or transformed versionof the original one.

[00:26:44] : [00:26:47]

And you train the predicted representation

[00:26:47] : [00:26:49]

to be the same as that.

[00:26:49] : [00:26:51]

If you only do this,

[00:26:51] : [00:26:52]

this system collapses.

[00:26:52] : [00:26:53]

It basically completely ignores the input

[00:26:53] : [00:26:55]

and produces representationsthat are constant.

[00:26:55] : [00:26:58]

So the contrastive methods avoid this.

[00:26:58] : [00:27:02]

And those things have beenaround since the early '90s,

[00:27:02] : [00:27:05]

I had a paper on this in 1993,

[00:27:05] : [00:27:07]

is you also show pairs of imagesthat you know are different

[00:27:07] : [00:27:13]

and then you push away therepresentations from each other.

[00:27:13] : [00:27:17]

So you say not only dorepresentations of things

[00:27:17] : [00:27:20]

that we know are the same,

[00:27:20] : [00:27:22]

should be the same or should be similar,

[00:27:22] : [00:27:23]

but representation of thingsthat we know are different

[00:27:23] : [00:27:25]

should be different.

[00:27:25] : [00:27:26]

And that prevents the collapse,

[00:27:26] : [00:27:29]

but it has some limitation.

[00:27:29] : [00:27:30]

And there's a whole bunch of techniques

[00:27:30] : [00:27:31]

that have appeared overthe last six, seven years

[00:27:31] : [00:27:35]

that can revive this type of method.

[00:27:35] : [00:27:38]

Some of them from FAIR,

[00:27:38] : [00:27:40]

some of them from Google and other places.

[00:27:40] : [00:27:44]

But there are limitations tothose contrastive methods.

[00:27:44] : [00:27:47]

What has changed in thelast three, four years

[00:27:47] : [00:27:51]

is now we have methodsthat are non-contrastive.

[00:27:51] : [00:27:54]

So they don't require thosenegative contrastive samples

[00:27:54] : [00:27:59]

of images that we know are different.

[00:27:59] : [00:28:01]

You train them only with images

[00:28:01] : [00:28:04]

that are different versions

[00:28:04] : [00:28:06]

or different views of the same thing.

[00:28:06] : [00:28:08]

And you rely on some other tweaks

[00:28:08] : [00:28:10]

to prevent the system from collapsing.

[00:28:10] : [00:28:12]

And we have half a dozendifferent methods for this now.

[00:28:12] : [00:28:16]

- So what is the fundamental difference

[00:28:16] : [00:28:17]

between joint embeddingarchitectures and LLMs?

[00:28:17] : [00:28:22]

So can JEPA take us to AGI?

[00:28:22] : [00:28:26]

Whether we should say thatyou don't like the term AGI

[00:28:26] : [00:28:31]

and we'll probably argue,

[00:28:31] : [00:28:33]

I think every singletime I've talked to you

[00:28:33] : [00:28:34]

we've argued about the G in AGI.

[00:28:34] : [00:28:36]

- [Yann] Yes.

[00:28:36] : [00:28:38]

- I get it, I get it, I get it. (laughing)

[00:28:38] : [00:28:40]

Well we'll probablycontinue to argue about it.

[00:28:40] : [00:28:42]

It's great.

[00:28:42] : [00:28:43]

Because you're like French,

[00:28:43] : [00:28:48]

and ami is I guess friend in French-

[00:28:48] : [00:28:51]

- [Yann] Yes.

[00:28:51] : [00:28:52]

- And AMI stands for advancedmachine intelligence-

[00:28:52] : [00:28:55]

- [Yann] Right.

[00:28:55] : [00:28:56]

- But either way, canJEPA take us to that,

[00:28:56] : [00:29:00]

towards that advancedmachine intelligence?

[00:29:00] : [00:29:02]

- Well, so it's a first step.

[00:29:02] : [00:29:04]

Okay?

[00:29:04] : [00:29:05]

So first of all, what's the difference

[00:29:05] : [00:29:07]

with generative architectures like LLMs?

[00:29:07] : [00:29:10]

So LLMs or vision systems thatare trained by reconstruction

[00:29:10] : [00:29:15]

generate the inputs, right?

[00:29:15] : [00:29:20]

They generate the original input

[00:29:20] : [00:29:22]

that is non-corrupted,non-transformed, right?

[00:29:22] : [00:29:27]

So you have to predict all the pixels.

[00:29:27] : [00:29:28]

And there is a huge amount ofresources spent in the system

[00:29:28] : [00:29:33]

to actually predict all thosepixels, all the details.

[00:29:33] : [00:29:36]

In a JEPA, you're not tryingto predict all the pixels,

[00:29:36] : [00:29:40]

you're only trying to predict

[00:29:40] : [00:29:42]

an abstract representationof the inputs, right?

[00:29:42] : [00:29:47]

And that's much easier in many ways.

[00:29:47] : [00:29:49]

So what the JEPA system

[00:29:49] : [00:29:50]

when it's being trained is trying to do,

[00:29:50] : [00:29:52]

is extract as much informationas possible from the input,

[00:29:52] : [00:29:56]

but yet only extract information

[00:29:56] : [00:29:58]

that is relatively easily predictable.

[00:29:58] : [00:30:00]

Okay.

[00:30:00] : [00:30:02]

So there's a lot of things in the world

[00:30:02] : [00:30:03]

that we cannot predict.

[00:30:03] : [00:30:04]

Like for example, if youhave a self driving car

[00:30:04] : [00:30:07]

driving down the street or road.

[00:30:07] : [00:30:08]

There may be trees around the road.

[00:30:08] : [00:30:13]

And it could be a windy day,

[00:30:13] : [00:30:14]

so the leaves on thetree are kind of moving

[00:30:14] : [00:30:17]

in kind of semi chaotic random ways

[00:30:17] : [00:30:19]

that you can't predict and you don't care,

[00:30:19] : [00:30:22]

you don't want to predict.

[00:30:22] : [00:30:23]

So what you want is your encoder

[00:30:23] : [00:30:25]

to basically eliminate all those details.

[00:30:25] : [00:30:27]

It'll tell you there's moving leaves,

[00:30:27] : [00:30:28]

but it's not gonna keep the details

[00:30:28] : [00:30:30]

of exactly what's going on.

[00:30:30] : [00:30:32]

And so when you do the predictionin representation space,

[00:30:32] : [00:30:35]

you're not going to have to predict

[00:30:35] : [00:30:37]

every single pixel of every leaf.

[00:30:37] : [00:30:38]

And that not only is a lot simpler,

[00:30:38] : [00:30:43]

but also it allows the system

[00:30:43] : [00:30:45]

to essentially learn an abstractrepresentation of the world

[00:30:45] : [00:30:49]

where what can be modeledand predicted is preserved

[00:30:49] : [00:30:54]

and the rest is viewed as noise

[00:30:54] : [00:30:57]

and eliminated by the encoder.

[00:30:57] : [00:30:59]

So it kind of lifts thelevel of abstraction

[00:30:59] : [00:31:00]

of the representation.

[00:31:00] : [00:31:02]

If you think about this,

[00:31:02] : [00:31:03]

this is something we doabsolutely all the time.

[00:31:03] : [00:31:05]

Whenever we describe a phenomenon,

[00:31:05] : [00:31:07]

we describe it at a particularlevel of abstraction.

[00:31:07] : [00:31:10]

And we don't always describeevery natural phenomenon

[00:31:10] : [00:31:13]

in terms of quantum field theory, right?

[00:31:13] : [00:31:15]

That would be impossible, right?

[00:31:15] : [00:31:17]

So we have multiple levels of abstraction

[00:31:17] : [00:31:19]

to describe what happens in the world.

[00:31:19] : [00:31:22]

Starting from quantum field theory

[00:31:22] : [00:31:24]

to like atomic theory andmolecules in chemistry,

[00:31:24] : [00:31:27]

materials,

[00:31:27] : [00:31:29]

all the way up to kind ofconcrete objects in the real world

[00:31:29] : [00:31:33]

and things like that.

[00:31:33] : [00:31:34]

So we can't just only modeleverything at the lowest level.

[00:31:34] : [00:31:39]

And that's what the ideaof JEPA is really about.

[00:31:39] : [00:31:44]

Learn abstract representationin a self supervised manner.

[00:31:44] : [00:31:49]

And you can do it hierarchically as well.

[00:31:49] : [00:31:52]

So that I think is an essential component

[00:31:52] : [00:31:54]

of an intelligent system.

[00:31:54] : [00:31:56]

And in language, we canget away without doing this

[00:31:56] : [00:31:58]

because language is alreadyto some level abstract

[00:31:58] : [00:32:02]

and already has eliminateda lot of information

[00:32:02] : [00:32:05]

that is not predictable.

[00:32:05] : [00:32:07]

And so we can get away withoutdoing the joint embedding,

[00:32:07] : [00:32:11]

without lifting the abstraction level

[00:32:11] : [00:32:13]

and by directly predicting words.

[00:32:13] : [00:32:15]

- So joint embedding.

[00:32:15] : [00:32:17]

It's still generative,

[00:32:17] : [00:32:20]

but it's generative in thisabstract representation space.

[00:32:20] : [00:32:23]

- [Yann] Yeah.

[00:32:23] : [00:32:24]

- And you're saying language,

[00:32:24] : [00:32:25]

we were lazy with language

[00:32:25] : [00:32:27]

'cause we already got theabstract representation for free

[00:32:27] : [00:32:30]

and now we have to zoom out,

[00:32:30] : [00:32:32]

actually think aboutgenerally intelligent systems,

[00:32:32] : [00:32:34]

we have to deal with the full mess

[00:32:34] : [00:32:37]

of physical of reality, of reality.

[00:32:37] : [00:32:40]

And you do have to do this step

[00:32:40] : [00:32:42]

of jumping from the full,rich, detailed reality

[00:32:42] : [00:32:47]

to an abstract representationof that reality

[00:32:47] : [00:32:54]

based on what you can then reason

[00:32:54] : [00:32:56]

and all that kind of stuff.

[00:32:56] : [00:32:57]

- Right.

[00:32:57] : [00:32:58]

And the thing is thoseself supervised algorithms

[00:32:58] : [00:33:00]

that learn by prediction,

[00:33:00] : [00:33:02]

even in representation space,

[00:33:02] : [00:33:04]

they learn more concept

[00:33:04] : [00:33:09]

if the input data you feedthem is more redundant.

[00:33:09] : [00:33:12]

The more redundancy there is in the data,

[00:33:12] : [00:33:14]

the more they're able to capture

[00:33:14] : [00:33:15]

some internal structure of it.

[00:33:15] : [00:33:17]

And so there,

[00:33:17] : [00:33:18]

there is way moreredundancy in the structure

[00:33:18] : [00:33:21]

in perceptual inputs,sensory input like vision,

[00:33:21] : [00:33:26]

than there is in text,

[00:33:26] : [00:33:28]

which is not nearly as redundant.

[00:33:28] : [00:33:30]

This is back to thequestion you were asking

[00:33:30] : [00:33:32]

a few minutes ago.

[00:33:32] : [00:33:33]

Language might representmore information really

[00:33:33] : [00:33:35]

because it's already compressed,

[00:33:35] : [00:33:36]

you're right about that.

[00:33:36] : [00:33:38]

But that means it's also less redundant.

[00:33:38] : [00:33:40]

And so self supervisedonly will not work as well.

[00:33:40] : [00:33:43]

- Is it possible to join

[00:33:43] : [00:33:45]

the self supervisedtraining on visual data

[00:33:45] : [00:33:49]

and self supervisedtraining on language data?

[00:33:49] : [00:33:53]

There is a huge amount of knowledge

[00:33:53] : [00:33:56]

even though you talk down aboutthose 10 to the 13 tokens.

[00:33:56] : [00:34:00]

Those 10 to the 13 tokens

[00:34:00] : [00:34:01]

represent the entirety,

[00:34:01] : [00:34:03]

a large fraction of whatus humans have figured out.

[00:34:03] : [00:34:08]

Both the shit talk on Reddit

[00:34:08] : [00:34:11]

and the contents of allthe books and the articles

[00:34:11] : [00:34:14]

and the full spectrum ofhuman intellectual creation.

[00:34:14] : [00:34:18]

So is it possible tojoin those two together?

[00:34:18] : [00:34:22]

- Well, eventually, yes,

[00:34:22] : [00:34:23]

but I think if we do this too early,

[00:34:23] : [00:34:27]

we run the risk of being tempted to cheat.

[00:34:27] : [00:34:30]

And in fact, that's whatpeople are doing at the moment

[00:34:30] : [00:34:32]

with vision language model.

[00:34:32] : [00:34:33]

We're basically cheating.

[00:34:33] : [00:34:35]

We are using language as a crutch

[00:34:35] : [00:34:38]

to help the deficienciesof our vision systems

[00:34:38] : [00:34:42]

to kind of learn good representationsfrom images and video.

[00:34:42] : [00:34:46]

And the problem with this

[00:34:46] : [00:34:47]

is that we might improve ourvision language system a bit,

[00:34:47] : [00:34:52]

I mean our language modelsby feeding them images.

[00:34:52] : [00:34:58]

But we're not gonna get to the level

[00:34:58] : [00:34:59]

of even the intelligence

[00:34:59] : [00:35:01]

or level of understanding of the world

[00:35:01] : [00:35:03]

of a cat or a dog whichdoesn't have language.

[00:35:03] : [00:35:06]

They don't have language

[00:35:06] : [00:35:08]

and they understand the worldmuch better than any LLM.

[00:35:08] : [00:35:12]

They can plan really complex actions

[00:35:12] : [00:35:14]

and sort of imagine theresult of a bunch of actions.

[00:35:14] : [00:35:17]

How do we get machines to learn that

[00:35:17] : [00:35:20]

before we combine that with language?

[00:35:20] : [00:35:22]

Obviously, if we combinethis with language,

[00:35:22] : [00:35:24]

this is gonna be a winner,

[00:35:24] : [00:35:26]

but before that we have to focus

[00:35:26] : [00:35:30]

on like how do we get systemsto learn how the world works?

[00:35:30] : [00:35:33]

- So this kind of joint embeddingpredictive architecture,

[00:35:33] : [00:35:37]

for you, that's gonna be able to learn

[00:35:37] : [00:35:40]

something like common sense,

[00:35:40] : [00:35:41]

something like what a cat uses

[00:35:41] : [00:35:43]

to predict how to mess withits owner most optimally

[00:35:43] : [00:35:47]

by knocking over a thing.

[00:35:47] : [00:35:49]

- That's the hope.

[00:35:49] : [00:35:51]

In fact, the techniques we'reusing are non-contrastive.

[00:35:51] : [00:35:54]

So not only is thearchitecture non-generative,

[00:35:54] : [00:35:57]

the learning procedures we'reusing are non-contrastive.

[00:35:57] : [00:36:00]

We have two sets of techniques.

[00:36:00] : [00:36:02]

One set is based on distillation

[00:36:02] : [00:36:05]

and there's a number of methodsthat use this principle.

[00:36:05] : [00:36:10]

One by DeepMind called BYOL.

[00:36:10] : [00:36:12]

A couple by FAIR,

[00:36:12] : [00:36:14]

one called VICReg andanother one called I-JEPA.

[00:36:14] : [00:36:19]

And VICReg, I should say,

[00:36:19] : [00:36:21]

is not a distillation method actually,

[00:36:21] : [00:36:23]

but I-JEPA and BYOL certainly are.

[00:36:23] : [00:36:25]

And there's another onealso called DINO or Dino,

[00:36:25] : [00:36:28]

also produced at FAIR.

[00:36:28] : [00:36:31]

And the idea of those things

[00:36:31] : [00:36:32]

is that you take the fullinput, let's say an image.

[00:36:32] : [00:36:35]

You run it through an encoder,

[00:36:35] : [00:36:37]

produces a representation.

[00:36:37] : [00:36:41]

And then you corrupt thatinput or transform it,

[00:36:41] : [00:36:43]

run it through essentially whatamounts to the same encoder

[00:36:43] : [00:36:46]

with some minor differences.

[00:36:46] : [00:36:48]

And then train a predictor.

[00:36:48] : [00:36:50]

Sometimes a predictor is very simple,

[00:36:50] : [00:36:51]

sometimes it doesn't exist.

[00:36:51] : [00:36:53]

But train a predictor topredict a representation

[00:36:53] : [00:36:55]

of the first uncorrupted inputfrom the corrupted input.

[00:36:55] : [00:37:00]

But you only train the second branch.

[00:37:00] : [00:37:04]

You only train the part of the network

[00:37:04] : [00:37:07]

that is fed with the corrupted input.

[00:37:07] : [00:37:10]

The other network, you don't train.

[00:37:10] : [00:37:12]

But since they share the same weight,

[00:37:12] : [00:37:14]

when you modify the first one,

[00:37:14] : [00:37:15]

it also modifies the second one.

[00:37:15] : [00:37:18]

And with various tricks,

[00:37:18] : [00:37:19]

you can prevent the system from collapsing

[00:37:19] : [00:37:21]

with the collapse of thetype I was explaining before

[00:37:21] : [00:37:24]

where the system basicallyignores the input.

[00:37:24] : [00:37:26]

So that works very well.

[00:37:26] : [00:37:31]

The two techniqueswe've developed at FAIR,

[00:37:31] : [00:37:33]

DINO and I-JEPA work really well for that.

[00:37:33] : [00:37:38]

- So what kind of dataare we talking about here?

[00:37:38] : [00:37:41]

- So there's several scenarios.

[00:37:41] : [00:37:43]

One scenario is you take an image,

[00:37:43] : [00:37:47]

you corrupt it by changingthe cropping, for example,

[00:37:47] : [00:37:52]

changing the size a little bit,

[00:37:52] : [00:37:54]

maybe changing theorientation, blurring it,

[00:37:54] : [00:37:56]

changing the colors,

[00:37:56] : [00:37:58]

doing all kinds of horrible things to it-

[00:37:58] : [00:38:00]

- But basic horrible things.

[00:38:00] : [00:38:01]

- Basic horrible things

[00:38:01] : [00:38:02]

that sort of degradethe quality a little bit

[00:38:02] : [00:38:04]

and change the framing,

[00:38:04] : [00:38:05]

crop the image.

[00:38:05] : [00:38:08]

And in some cases, in the case of I-JEPA,

[00:38:08] : [00:38:12]

you don't need to do any of this,

[00:38:12] : [00:38:13]

you just mask some parts of it, right?

[00:38:13] : [00:38:16]

You just basically remove some regions

[00:38:16] : [00:38:19]

like a big block, essentially.

[00:38:19] : [00:38:21]

And then run through the encoders

[00:38:21] : [00:38:24]

and train the entire system,

[00:38:24] : [00:38:26]

encoder and predictor,

[00:38:26] : [00:38:27]

to predict the representationof the good one

[00:38:27] : [00:38:29]

from the representationof the corrupted one.

[00:38:29] : [00:38:31]

So that's the I-JEPA.

[00:38:31] : [00:38:35]

It doesn't need to know thatit's an image, for example,

[00:38:35] : [00:38:38]

because the only thing it needs to know

[00:38:38] : [00:38:39]

is how to do this masking.

[00:38:39] : [00:38:40]

Whereas with DINO,

[00:38:40] : [00:38:43]

you need to know it's an image

[00:38:43] : [00:38:44]

because you need to do things

[00:38:44] : [00:38:45]

like geometry transformation and blurring

[00:38:45] : [00:38:48]

and things like that thatare really image specific.

[00:38:48] : [00:38:51]

A more recent version of thisthat we have is called V-JEPA.

[00:38:51] : [00:38:53]

So it's basically the same idea as I-JEPA

[00:38:53] : [00:38:56]

except it's applied to video.

[00:38:56] : [00:38:59]

So now you take a whole video

[00:38:59] : [00:39:00]

and you mask a whole chunk of it.

[00:39:00] : [00:39:02]

And what we mask is actuallykind of a temporal tube.

[00:39:02] : [00:39:04]

So like a whole segmentof each frame in the video

[00:39:04] : [00:39:07]

over the entire video.

[00:39:07] : [00:39:10]

- And that tube is likestatically positioned

[00:39:10] : [00:39:12]

throughout the frames?

[00:39:12] : [00:39:14]

It's literally just a straight tube?

[00:39:14] : [00:39:15]

- Throughout the tube, yeah.

[00:39:15] : [00:39:17]

Typically it's 16 frames or something,

[00:39:17] : [00:39:18]

and we mask the same regionover the entire 16 frames.

[00:39:18] : [00:39:22]

It's a different one forevery video, obviously.

[00:39:22] : [00:39:24]

And then again, train that system

[00:39:24] : [00:39:28]

so as to predict therepresentation of the full video

[00:39:28] : [00:39:31]

from the partially masked video.

[00:39:31] : [00:39:34]

And that works really well.

[00:39:34] : [00:39:35]

It's the first system that we have

[00:39:35] : [00:39:36]

that learns good representations of video

[00:39:36] : [00:39:39]

so that when you feedthose representations

[00:39:39] : [00:39:41]

to a supervised classifier head,

[00:39:41] : [00:39:44]

it can tell you what actionis taking place in the video

[00:39:44] : [00:39:47]

with pretty good accuracy.

[00:39:47] : [00:39:49]

So it's the first time we getsomething of that quality.

[00:39:49] : [00:39:55]

- So that's a good test

[00:39:55] : [00:39:57]

that a good representation is formed.

[00:39:57] : [00:39:58]

That means there's something to this.

[00:39:58] : [00:40:00]

- Yeah.

[00:40:00] : [00:40:01]

We also preliminary result

[00:40:01] : [00:40:03]

that seem to indicate

[00:40:03] : [00:40:05]

that the representationallows our system to tell

[00:40:05] : [00:40:09]

whether the video is physically possible

[00:40:09] : [00:40:12]

or completely impossible

[00:40:12] : [00:40:13]

because some object disappeared

[00:40:13] : [00:40:15]

or an object suddenly jumpedfrom one location to another

[00:40:15] : [00:40:19]

or changed shape or something.

[00:40:19] : [00:40:21]

- So it's able to capturesome physics based constraints

[00:40:21] : [00:40:26]

about the realityrepresented in the video?

[00:40:26] : [00:40:29]

- [Yann] Yeah.

[00:40:29] : [00:40:30]

- About the appearance andthe disappearance of objects?

[00:40:30] : [00:40:32]

- Yeah.

[00:40:32] : [00:40:34]

That's really new.

[00:40:34] : [00:40:35]

- Okay, but can this actually

[00:40:35] : [00:40:38]

get us to this kind of world model

[00:40:38] : [00:40:43]

that understands enough about the world

[00:40:43] : [00:40:46]

to be able to drive a car?

[00:40:46] : [00:40:48]

- Possibly.

[00:40:48] : [00:40:50]

And this is gonna take a while

[00:40:50] : [00:40:51]

before we get to that point.

[00:40:51] : [00:40:52]

And there are systemsalready, robotic systems,

[00:40:52] : [00:40:56]

that are based on this idea.

[00:40:56] : [00:40:58]

What you need for this

[00:40:58] : [00:41:02]

is a slightly modified version of this

[00:41:02] : [00:41:04]

where imagine that you have a video,

[00:41:04] : [00:41:09]

a complete video,

[00:41:09] : [00:41:12]

and what you're doing to this video

[00:41:12] : [00:41:13]

is that you are eithertranslating it in time

[00:41:13] : [00:41:17]

towards the future.

[00:41:17] : [00:41:18]

So you'll only see thebeginning of the video,

[00:41:18] : [00:41:19]

but you don't see the latter part of it

[00:41:19] : [00:41:21]

that is in the original one.

[00:41:21] : [00:41:23]

Or you just mask the secondhalf of the video, for example.

[00:41:23] : [00:41:27]

And then you train this I-JEPA system

[00:41:27] : [00:41:30]

or the type I described,

[00:41:30] : [00:41:32]

to predict representationof the full video

[00:41:32] : [00:41:33]

from the shifted one.

[00:41:33] : [00:41:36]

But you also feed thepredictor with an action.

[00:41:36] : [00:41:39]

For example, the wheel is turned

[00:41:39] : [00:41:42]

10 degrees to the rightor something, right?

[00:41:42] : [00:41:45]

So if it's a dash cam in a car

[00:41:45] : [00:41:49]

and you know the angle of the wheel,

[00:41:49] : [00:41:51]

you should be able topredict to some extent

[00:41:51] : [00:41:53]

what's going to happen to what you see.

[00:41:53] : [00:41:56]

You're not gonna be ableto predict all the details

[00:41:56] : [00:41:59]

of objects that appearin the view, obviously,

[00:41:59] : [00:42:02]

but at an abstract representation level,

[00:42:02] : [00:42:05]

you can probably predictwhat's gonna happen.

[00:42:05] : [00:42:08]

So now what you have is an internal model

[00:42:08] : [00:42:12]

that says, here is my idea

[00:42:12] : [00:42:13]

of the state of the world at time T,

[00:42:13] : [00:42:15]

here is an action I'm taking,

[00:42:15] : [00:42:17]

here is a prediction

[00:42:17] : [00:42:18]

of the state of theworld at time T plus one,

[00:42:18] : [00:42:20]

T plus delta T,

[00:42:20] : [00:42:22]

T plus two seconds, whatever it is.

[00:42:22] : [00:42:24]

If you have a model of this type,

[00:42:24] : [00:42:26]

you can use it for planning.

[00:42:26] : [00:42:27]

So now you can do what LLMs cannot do,

[00:42:27] : [00:42:31]

which is planning what you're gonna do

[00:42:31] : [00:42:34]

so as you arrive at a particular outcome

[00:42:34] : [00:42:37]

or satisfy a particular objective, right?

[00:42:37] : [00:42:40]

So you can have a numberof objectives, right?

[00:42:40] : [00:42:44]

I can predict that if I havean object like this, right?

[00:42:44] : [00:42:49]

And I open my hand,

[00:42:49] : [00:42:52]

it's gonna fall, right?

[00:42:52] : [00:42:54]

And if I push it with aparticular force on the table,

[00:42:54] : [00:42:57]

it's gonna move.

[00:42:57] : [00:42:58]

If I push the table itself,

[00:42:58] : [00:43:00]

it's probably not gonnamove with the same force.

[00:43:00] : [00:43:03]

So we have this internal modelof the world in our mind,

[00:43:03] : [00:43:07]

which allows us to plansequences of actions

[00:43:07] : [00:43:11]

to arrive at a particular goal.

[00:43:11] : [00:43:13]

And so now if you have this world model,

[00:43:13] : [00:43:18]

we can imagine a sequence of actions,

[00:43:18] : [00:43:21]

predict what the outcome

[00:43:21] : [00:43:22]

of the sequence of action is going to be,

[00:43:22] : [00:43:25]

measure to what extent the final state

[00:43:25] : [00:43:28]

satisfies a particular objective

[00:43:28] : [00:43:30]

like moving the bottleto the left of the table.

[00:43:30] : [00:43:35]

And then plan a sequence of actions

[00:43:35] : [00:43:38]

that will minimize thisobjective at runtime.

[00:43:38] : [00:43:41]

We're not talking about learning,

[00:43:41] : [00:43:43]

we're talking about inference time, right?

[00:43:43] : [00:43:44]

So this is planning, really.

[00:43:44] : [00:43:46]

And in optimal control,

[00:43:46] : [00:43:47]

this is a very classical thing.

[00:43:47] : [00:43:48]

It's called model predictive control.

[00:43:48] : [00:43:50]

You have a model of thesystem you want to control

[00:43:50] : [00:43:53]

that can predict the sequence of states

[00:43:53] : [00:43:55]

corresponding to a sequence of commands.

[00:43:55] : [00:43:58]

And you are planninga sequence of commands

[00:43:58] : [00:44:02]

so that according to your world model,

[00:44:02] : [00:44:04]

the end state of the system

[00:44:04] : [00:44:06]

will satisfy any objectives that you fix.

[00:44:06] : [00:44:10]

This is the way rockettrajectories have been planned

[00:44:10] : [00:44:15]

since computers have been around.

[00:44:15] : [00:44:17]

So since the early '60s, essentially.

[00:44:17] : [00:44:20]

- So yes, for a model predictive control,

[00:44:20] : [00:44:21]

but you also often talkabout hierarchical planning.

[00:44:21] : [00:44:26]

- [Yann] Yeah.

[00:44:26] : [00:44:26]

- Can hierarchical planningemerge from this somehow?

[00:44:26] : [00:44:28]

- Well, so no.

[00:44:28] : [00:44:29]

You will have to builda specific architecture

[00:44:29] : [00:44:32]

to allow for hierarchical planning.

[00:44:32] : [00:44:34]

So hierarchical planningis absolutely necessary

[00:44:34] : [00:44:36]

if you want to plan complex actions.

[00:44:36] : [00:44:39]

If I wanna go from, let'ssay, from New York to Paris,

[00:44:39] : [00:44:43]

this the example I use all the time.

[00:44:43] : [00:44:45]

And I'm sitting in my office at NYU.

[00:44:45] : [00:44:48]

My objective that I need to minimize

[00:44:48] : [00:44:50]

is my distance to Paris.

[00:44:50] : [00:44:52]

At a high level,

[00:44:52] : [00:44:52]

a very abstractrepresentation of my location,

[00:44:52] : [00:44:57]

I would have to decomposethis into two sub-goals.

[00:44:57] : [00:44:59]

First one is go to the airport,

[00:44:59] : [00:45:02]

second one is catch a plane to Paris.

[00:45:02] : [00:45:04]

Okay.

[00:45:04] : [00:45:05]

So my sub-goal is nowgoing to the airport.

[00:45:05] : [00:45:09]

My objective function ismy distance to the airport.

[00:45:09] : [00:45:11]

How do I go to the airport?

[00:45:11] : [00:45:14]

Well, I have to go in thestreet and hail a taxi,

[00:45:14] : [00:45:18]

which you can do in New York.

[00:45:18] : [00:45:19]

Okay, now I have another sub-goal.

[00:45:19] : [00:45:22]

Go down on the street.

[00:45:22] : [00:45:24]

Well, that means going to the elevator,

[00:45:24] : [00:45:27]

going down the elevator,

[00:45:27] : [00:45:28]

walk out to the street.

[00:45:28] : [00:45:30]

How do I go to the elevator?

[00:45:30] : [00:45:32]

I have to stand up from my chair,

[00:45:32] : [00:45:36]

open the door of my office,

[00:45:36] : [00:45:38]

go to the elevator, push the button.

[00:45:38] : [00:45:40]

How do I get up for my chair?

[00:45:40] : [00:45:42]

Like you can imagine goingdown all the way down

[00:45:42] : [00:45:45]

to basically what amounts

[00:45:45] : [00:45:47]

to millisecond bymillisecond muscle control.

[00:45:47] : [00:45:50]

Okay?

[00:45:50] : [00:45:51]

And obviously you're notgoing to plan your entire trip

[00:45:51] : [00:45:55]

from New York to Paris

[00:45:55] : [00:45:56]

in terms of millisecond bymillisecond muscle control.

[00:45:56] : [00:46:00]

First, that would be incredibly expensive,

[00:46:00] : [00:46:02]

but it will also be completely impossible

[00:46:02] : [00:46:03]

because you don't know all the conditions

[00:46:03] : [00:46:06]

of what's gonna happen.

[00:46:06] : [00:46:07]

How long it's gonna take to catch a taxi

[00:46:07] : [00:46:10]

or to go to the airport with traffic.

[00:46:10] : [00:46:13]

I mean, you would have to know exactly

[00:46:13] : [00:46:16]

the condition of everything

[00:46:16] : [00:46:18]

to be able to do this planning,

[00:46:18] : [00:46:19]

and you don't have the information.

[00:46:19] : [00:46:21]

So you have to do thishierarchical planning

[00:46:21] : [00:46:23]

so that you can start acting

[00:46:23] : [00:46:25]

and then sort of re-planning as you go.

[00:46:25] : [00:46:27]

And nobody really knowshow to do this in AI.

[00:46:27] : [00:46:32]

Nobody knows how to train a system

[00:46:32] : [00:46:35]

to learn the appropriatemultiple levels of representation

[00:46:35] : [00:46:38]

so that hierarchical planning works.

[00:46:38] : [00:46:41]

- Does something like that already emerge?

[00:46:41] : [00:46:42]

So like can you use an LLM,

[00:46:42] : [00:46:45]

state-of-the-art LLM,

[00:46:45] : [00:46:48]

to get you from New York to Paris

[00:46:48] : [00:46:50]

by doing exactly the kind of detailed

[00:46:50] : [00:46:54]

set of questions that you just did?

[00:46:54] : [00:46:56]

Which is can you give me alist of 10 steps I need to do

[00:46:56] : [00:47:01]

to get from New York to Paris?

[00:47:01] : [00:47:02]

And then for each of those steps,

[00:47:02] : [00:47:05]

can you give me a list of 10 steps

[00:47:05] : [00:47:07]

how I make that step happen?

[00:47:07] : [00:47:09]

And for each of those steps,

[00:47:09] : [00:47:10]

can you give me a list of 10 steps

[00:47:10] : [00:47:12]

to make each one of those,

[00:47:12] : [00:47:13]

until you're movingyour individual muscles?

[00:47:13] : [00:47:15]

Maybe not.

[00:47:15] : [00:47:17]

Whatever you can actually act upon

[00:47:17] : [00:47:19]

using your own mind.

[00:47:19] : [00:47:20]

- Right.

[00:47:20] : [00:47:22]

So there's a lot of questions

[00:47:22] : [00:47:23]

that are also implied by this, right?

[00:47:23] : [00:47:24]

So the first thing is LLMswill be able to answer

[00:47:24] : [00:47:27]

some of those questions

[00:47:27] : [00:47:28]

down to some level of abstraction.

[00:47:28] : [00:47:30]

Under the condition thatthey've been trained

[00:47:30] : [00:47:34]

with similar scenariosin their training set.

[00:47:34] : [00:47:37]

- They would be able toanswer all of those questions.

[00:47:37] : [00:47:40]

But some of them may be hallucinated,

[00:47:40] : [00:47:43]

meaning non-factual.

[00:47:43] : [00:47:44]

- Yeah, true.

[00:47:44] : [00:47:45]

I mean they'll probablyproduce some answer.

[00:47:45] : [00:47:46]

Except they're not gonna be able

[00:47:46] : [00:47:47]

to really kind of produce

[00:47:47] : [00:47:48]

millisecond by millisecond muscle control

[00:47:48] : [00:47:50]

of how you stand upfrom your chair, right?

[00:47:50] : [00:47:53]

But down to some level of abstraction

[00:47:53] : [00:47:55]

where you can describe things by words,

[00:47:55] : [00:47:57]

they might be able to give you a plan,

[00:47:57] : [00:47:59]

but only under the conditionthat they've been trained

[00:47:59] : [00:48:01]

to produce those kind of plans, right?

[00:48:01] : [00:48:04]

They're not gonna be ableto plan for situations

[00:48:04] : [00:48:06]

they never encountered before.

[00:48:06] : [00:48:09]

They basically are going tohave to regurgitate the template

[00:48:09] : [00:48:11]

that they've been trained on.

[00:48:11] : [00:48:12]

- But where, just for theexample of New York to Paris,

[00:48:12] : [00:48:15]

is it gonna start getting into trouble?

[00:48:15] : [00:48:18]

Like at which layer of abstraction

[00:48:18] : [00:48:20]

do you think you'll start?

[00:48:20] : [00:48:22]

Because like I can imagine

[00:48:22] : [00:48:23]

almost every single part of that,

[00:48:23] : [00:48:24]

an LLM will be able toanswer somewhat accurately,

[00:48:24] : [00:48:27]

especially when you're talkingabout New York and Paris,

[00:48:27] : [00:48:29]

major cities.

[00:48:29] : [00:48:31]

- So I mean certainly an LLM

[00:48:31] : [00:48:33]

would be able to solve that problem

[00:48:33] : [00:48:34]

if you fine tune it for it.

[00:48:34] : [00:48:36]

- [Lex] Sure.

[00:48:36] : [00:48:37]

- And so I can't say thatan LLM cannot do this,

[00:48:37] : [00:48:42]

it can't do this if you train it for it,

[00:48:42] : [00:48:44]

there's no question,

[00:48:44] : [00:48:45]

down to a certain level

[00:48:45] : [00:48:47]

where things can beformulated in terms of words.

[00:48:47] : [00:48:51]

But like if you wanna go down

[00:48:51] : [00:48:52]

to like how do you climb down the stairs

[00:48:52] : [00:48:54]

or just stand up from yourchair in terms of words,

[00:48:54] : [00:48:57]

like you can't do it.

[00:48:57] : [00:48:59]

That's one of the reasons you need

[00:48:59] : [00:49:04]

experience of the physical world,

[00:49:04] : [00:49:06]

which is much higher bandwidth

[00:49:06] : [00:49:07]

than what you can express in words,

[00:49:07] : [00:49:10]

in human language.

[00:49:10] : [00:49:11]

- So everything we've been talking about

[00:49:11] : [00:49:12]

on the joint embedding space,

[00:49:12] : [00:49:13]

is it possible that that's what we need

[00:49:13] : [00:49:16]

for like the interactionwith physical reality

[00:49:16] : [00:49:18]

on the robotics front?

[00:49:18] : [00:49:20]

And then just the LLMs are thething that sits on top of it

[00:49:20] : [00:49:24]

for the bigger reasoning

[00:49:24] : [00:49:26]

about like the fact that Ineed to book a plane ticket

[00:49:26] : [00:49:30]

and I need to know know how togo to the websites and so on.

[00:49:30] : [00:49:33]

- Sure.

[00:49:33] : [00:49:34]

And a lot of plans that people know about

[00:49:34] : [00:49:37]

that are relatively highlevel are actually learned.

[00:49:37] : [00:49:41]

Most people don't inventthe plans by themselves.

[00:49:41] : [00:49:46]

We have some ability to dothis, of course, obviously,

[00:49:46] : [00:49:54]

but most plans that people use

[00:49:54] : [00:49:57]

are plans that have been trained on.

[00:49:57] : [00:49:59]

Like they've seen otherpeople use those plans

[00:49:59] : [00:50:01]

or they've been toldhow to do things, right?

[00:50:01] : [00:50:04]

That you can't invent howyou like take a person

[00:50:04] : [00:50:07]

who's never heard of airplanes

[00:50:07] : [00:50:09]

and tell them like, how doyou go from New York to Paris?

[00:50:09] : [00:50:12]

They're probably not going to be able

[00:50:12] : [00:50:14]

to kind of deconstruct the whole plan

[00:50:14] : [00:50:16]

unless they've seenexamples of that before.

[00:50:16] : [00:50:18]

So certainly LLMs aregonna be able to do this.

[00:50:18] : [00:50:21]

But then how you link thisfrom the low level of actions,

[00:50:21] : [00:50:26]

that needs to be donewith things like JEPA,

[00:50:26] : [00:50:30]

that basically lift the abstraction level

[00:50:30] : [00:50:33]

of the representation

[00:50:33] : [00:50:34]

without attempting to reconstruct

[00:50:34] : [00:50:36]

every detail of the situation.

[00:50:36] : [00:50:38]

That's why we need JEPAs for.

[00:50:38] : [00:50:39]

- I would love to sort oflinger on your skepticism

[00:50:39] : [00:50:44]

around autoregressive LLMs.

[00:50:44] : [00:50:48]

So one way I would liketo test that skepticism is

[00:50:48] : [00:50:53]

everything you say makes a lot of sense,

[00:50:53] : [00:50:55]

but if I apply everythingyou said today and in general

[00:50:55] : [00:51:02]

to like, I don't know,

[00:51:02] : [00:51:03]

10 years ago, maybe a little bit less.

[00:51:03] : [00:51:05]

No, let's say three years ago.

[00:51:05] : [00:51:07]

I wouldn't be able topredict the success of LLMs.

[00:51:07] : [00:51:12]

So does it make sense to you

[00:51:12] : [00:51:15]

that autoregressive LLMsare able to be so damn good?

[00:51:15] : [00:51:19]

- [Yann] Yes.

[00:51:19] : [00:51:21]

- Can you explain your intuition?

[00:51:21] : [00:51:24]

Because if I were to takeyour wisdom and intuition

[00:51:24] : [00:51:29]

at face value,

[00:51:29] : [00:51:30]

I would say there's noway autoregressive LLMs

[00:51:30] : [00:51:32]

one token at a time,

[00:51:32] : [00:51:34]

would be able to do the kindof things they're doing.

[00:51:34] : [00:51:36]

- No, there's one thingthat autoregressive LLMs

[00:51:36] : [00:51:39]

or that LLMs in general, notjust the autoregressive ones,

[00:51:39] : [00:51:42]

but including the BERTstyle bidirectional ones,

[00:51:42] : [00:51:45]

are exploiting and itsself supervised running.

[00:51:45] : [00:51:49]

And I've been a very, very strong advocate

[00:51:49] : [00:51:51]

of self supervised running for many years.

[00:51:51] : [00:51:53]

So those things are an incrediblyimpressive demonstration

[00:51:53] : [00:51:58]

that self supervisedlearning actually works.

[00:51:58] : [00:52:01]

The idea that started...

[00:52:01] : [00:52:04]

It didn't start with BERT,

[00:52:04] : [00:52:07]

but it was really kind of agood demonstration with this.

[00:52:07] : [00:52:09]

So the idea that you take apiece of text, you corrupt it,

[00:52:09] : [00:52:14]

and then you train somegigantic neural net

[00:52:14] : [00:52:16]

to reconstruct the parts that are missing.

[00:52:16] : [00:52:18]

That has been an enormous...

[00:52:18] : [00:52:21]

Produced an enormous amount of benefits.

[00:52:21] : [00:52:25]

It allowed us to create systemsthat understand language,

[00:52:25] : [00:52:30]

systems that can translate

[00:52:30] : [00:52:32]

hundreds of languages in any direction,

[00:52:32] : [00:52:36]

systems that are multilingual.

[00:52:36] : [00:52:38]

It's a single system

[00:52:38] : [00:52:40]

that can be trained tounderstand hundreds of languages

[00:52:40] : [00:52:43]

and translate in any direction

[00:52:43] : [00:52:44]

and produce summaries

[00:52:44] : [00:52:48]

and then answer questionsand produce text.

[00:52:48] : [00:52:51]

And then there's a special case of it,

[00:52:51] : [00:52:53]

which is the autoregressive trick

[00:52:53] : [00:52:56]

where you constrain the system

[00:52:56] : [00:52:58]

to not elaborate arepresentation of the text

[00:52:58] : [00:53:02]

from looking at the entire text,

[00:53:02] : [00:53:03]

but only predicting a word

[00:53:03] : [00:53:06]

from the words that have come before.

[00:53:06] : [00:53:08]

Right?

[00:53:08] : [00:53:09]

And you do this

[00:53:09] : [00:53:09]

by constraining thearchitecture of the network.

[00:53:09] : [00:53:11]

And that's what you can buildan autoregressive LLM from.

[00:53:11] : [00:53:15]

So there was a surprise many years ago

[00:53:15] : [00:53:17]

with what's called decoder only LLM.

[00:53:17] : [00:53:20]

So systems of this type

[00:53:20] : [00:53:22]

that are just trying to producewords from the previous one.

[00:53:22] : [00:53:27]

And the fact that when you scale them up,

[00:53:27] : [00:53:31]

they tend to really kind ofunderstand more about language.

[00:53:31] : [00:53:36]

When you train them on lots of data,

[00:53:36] : [00:53:38]

you make them really big.

[00:53:38] : [00:53:39]

That was kind of a surprise.

[00:53:39] : [00:53:40]

And that surprise occurredquite a while back.

[00:53:40] : [00:53:42]

Like with work from Google,Meta, OpenAI, et cetera,

[00:53:42] : [00:53:47]

going back to the GPT

[00:53:47] : [00:53:53]

kind of general pre-trained transformers.

[00:53:53] : [00:53:56]

- You mean like GPT-2?

[00:53:56] : [00:53:58]

Like there's a certain place

[00:53:58] : [00:54:00]

where you start to realize

[00:54:00] : [00:54:01]

scaling might actually keepgiving us an emergent benefit.

[00:54:01] : [00:54:06]

- Yeah, I mean there werework from various places,

[00:54:06] : [00:54:09]

but if you want to kind ofplace it in the GPT timeline,

[00:54:09] : [00:54:14]

that would be around GPT-2, yeah.

[00:54:14] : [00:54:18]

- Well, 'cause you said it,

[00:54:18] : [00:54:20]

you're so charismatic andyou said so many words,

[00:54:20] : [00:54:23]

but self supervised learning, yes.

[00:54:23] : [00:54:25]

But again, the sameintuition you're applying

[00:54:25] : [00:54:28]

to saying that autoregressive LLMs

[00:54:28] : [00:54:31]

cannot have a deepunderstanding of the world,

[00:54:31] : [00:54:35]

if we just apply that same intuition,

[00:54:35] : [00:54:38]

does it make sense to you

[00:54:38] : [00:54:39]

that they're able to form enough

[00:54:39] : [00:54:42]

of a representation in the world

[00:54:42] : [00:54:43]

to be damn convincing,

[00:54:43] : [00:54:45]

essentially passing theoriginal Turing test

[00:54:45] : [00:54:49]

with flying colors.

[00:54:49] : [00:54:50]

- Well, we're fooled bytheir fluency, right?

[00:54:50] : [00:54:53]

We just assume that if a system is fluent

[00:54:53] : [00:54:56]

in manipulating language,

[00:54:56] : [00:54:57]

then it has all the characteristicsof human intelligence.

[00:54:57] : [00:55:00]

But that impression is false.

[00:55:00] : [00:55:04]

We're really fooled by it.

[00:55:04] : [00:55:06]

- Well, what do you thinkAlan Turing would say?

[00:55:06] : [00:55:08]

Without understanding anything,

[00:55:08] : [00:55:10]

just hanging out with it-

[00:55:10] : [00:55:11]

- Alan Turing would decide

[00:55:11] : [00:55:12]

that a Turing test is a really bad test.

[00:55:12] : [00:55:14]

(Lex chuckles)

[00:55:14] : [00:55:15]

Okay.

[00:55:15] : [00:55:16]

This is what the AI communityhas decided many years ago

[00:55:16] : [00:55:18]

that the Turing test was areally bad test of intelligence.

[00:55:18] : [00:55:22]

- What would Hans Moravec say

[00:55:22] : [00:55:23]

about the large language models?

[00:55:23] : [00:55:25]

- Hans Moravec would say

[00:55:25] : [00:55:26]

the Moravec's paradox still applies.

[00:55:26] : [00:55:30]

- [Lex] Okay.

[00:55:30] : [00:55:31]

- Okay?

[00:55:31] : [00:55:31]

Okay, we can pass-

[00:55:31] : [00:55:32]

- You don't think hewould be really impressed.

[00:55:32] : [00:55:34]

- No, of course everybodywould be impressed.

[00:55:34] : [00:55:35]

(laughs)

[00:55:35] : [00:55:36]

But it is not a questionof being impressed or not,

[00:55:36] : [00:55:39]

it is a question of knowing

[00:55:39] : [00:55:41]

what the limit of those systems can do.

[00:55:41] : [00:55:44]

Again, they are impressive.

[00:55:44] : [00:55:45]

They can do a lot of useful things.

[00:55:45] : [00:55:47]

There's a whole industry thatis being built around them.

[00:55:47] : [00:55:49]

They're gonna make progress,

[00:55:49] : [00:55:51]

but there is a lot ofthings they cannot do.

[00:55:51] : [00:55:53]

And we have to realize what they cannot do

[00:55:53] : [00:55:55]

and then figure out how we get there.

[00:55:55] : [00:55:59]

And I'm not saying this...

[00:55:59] : [00:56:02]

I'm saying this frombasically 10 years of research

[00:56:02] : [00:56:07]

on the idea of self supervised running,

[00:56:07] : [00:56:11]

actually that's goingback more than 10 years,

[00:56:11] : [00:56:13]

but the idea of self supervised learning.

[00:56:13] : [00:56:15]

So basically capturingthe internal structure

[00:56:15] : [00:56:17]

of a piece of a set of inputs

[00:56:17] : [00:56:21]

without training the systemfor any particular task, right?

[00:56:21] : [00:56:23]

Learning representations.

[00:56:23] : [00:56:25]

The conference I co-founded 14 years ago

[00:56:25] : [00:56:28]

is called International Conference

[00:56:28] : [00:56:30]

on Learning Representations,

[00:56:30] : [00:56:31]

that's the entire issue thatdeep learning is dealing with.

[00:56:31] : [00:56:34]

Right?

[00:56:34] : [00:56:35]

And it's been my obsessionfor almost 40 years now.

[00:56:35] : [00:56:38]

So learning representationis really the thing.

[00:56:38] : [00:56:42]

For the longest time

[00:56:42] : [00:56:43]

we could only do thiswith supervised learning.

[00:56:43] : [00:56:45]

And then we started working on

[00:56:45] : [00:56:47]

what we used to call unsupervised learning

[00:56:47] : [00:56:50]

and sort of revived the ideaof unsupervised learning

[00:56:50] : [00:56:55]

in the early 2000s withYoshua Bengio and Jeff Hinton.

[00:56:55] : [00:56:59]

Then discovered that supervised learning

[00:56:59] : [00:57:00]

actually works pretty well

[00:57:00] : [00:57:02]

if you can collect enough data.

[00:57:02] : [00:57:03]

And so the whole idea ofunsupervised self supervision

[00:57:03] : [00:57:07]

took a backseat for a bit

[00:57:07] : [00:57:10]

and then I kind of triedto revive it in a big way,

[00:57:10] : [00:57:14]

starting in 2014 basicallywhen we started FAIR,

[00:57:14] : [00:57:20]

and really pushing forlike finding new methods

[00:57:20] : [00:57:24]

to do self supervised running,

[00:57:24] : [00:57:26]

both for text and for imagesand for video and audio.

[00:57:26] : [00:57:29]

And some of that work hasbeen incredibly successful.

[00:57:29] : [00:57:32]

I mean, the reason why we have

[00:57:32] : [00:57:34]

multilingual translation system,

[00:57:34] : [00:57:37]

things to do,

[00:57:37] : [00:57:38]

content moderation on Meta,for example, on Facebook

[00:57:38] : [00:57:41]

that are multilingual,

[00:57:41] : [00:57:42]

that understand whether piece of text

[00:57:42] : [00:57:44]

is hate speech or not, or something

[00:57:44] : [00:57:46]

is due to their progress

[00:57:46] : [00:57:47]

using self supervised running for NLP,

[00:57:47] : [00:57:50]

combining this withtransformer architectures

[00:57:50] : [00:57:52]

and blah blah blah.

[00:57:52] : [00:57:53]

But that's the big successof self supervised running.

[00:57:53] : [00:57:55]

We had similar successin speech recognition,

[00:57:55] : [00:57:59]

a system called Wav2Vec,

[00:57:59] : [00:58:00]

which is also a joint embeddingarchitecture by the way,

[00:58:00] : [00:58:02]

trained with contrastive learning.

[00:58:02] : [00:58:03]

And that system also can produce

[00:58:03] : [00:58:07]

speech recognition systemsthat are multilingual

[00:58:07] : [00:58:10]

with mostly unlabeled data

[00:58:10] : [00:58:13]

and only need a fewminutes of labeled data

[00:58:13] : [00:58:15]

to actually do speech recognition.

[00:58:15] : [00:58:16]

That's amazing.

[00:58:16] : [00:58:18]

We have systems now based onthose combination of ideas

[00:58:18] : [00:58:22]

that can do real time translation

[00:58:22] : [00:58:24]

of hundreds of languages into each other,

[00:58:24] : [00:58:26]

speech to speech.

[00:58:26] : [00:58:28]

- Speech to speech,

[00:58:28] : [00:58:29]

even including, which is fascinating,

[00:58:29] : [00:58:31]

languages that don't have written forms-

[00:58:31] : [00:58:33]

- That's right.- They're spoken only.

[00:58:33] : [00:58:35]

- That's right.

[00:58:35] : [00:58:36]

We don't go through text,

[00:58:36] : [00:58:37]

it goes directly from speech to speech

[00:58:37] : [00:58:38]

using an internal representation

[00:58:38] : [00:58:40]

of kinda speech units that are discrete.

[00:58:40] : [00:58:41]

But it's called Textless NLP.

[00:58:41] : [00:58:44]

We used to call it this way.

[00:58:44] : [00:58:45]

But yeah.

[00:58:45] : [00:58:47]

I mean incredible success there.

[00:58:47] : [00:58:49]

And then for 10 years wetried to apply this idea

[00:58:49] : [00:58:53]

to learning representations of images

[00:58:53] : [00:58:55]

by training a system to predict videos,

[00:58:55] : [00:58:57]

learning intuitive physics

[00:58:57] : [00:58:58]

by training a system to predict

[00:58:58] : [00:59:00]

what's gonna happen in the video.

[00:59:00] : [00:59:02]

And tried and tried and failed and failed

[00:59:02] : [00:59:05]

with generative models,

[00:59:05] : [00:59:06]

with models that predict pixels.

[00:59:06] : [00:59:08]

We could not get them to learn

[00:59:08] : [00:59:10]

good representations of images,

[00:59:10] : [00:59:13]

we could not get them to learngood presentations of videos.

[00:59:13] : [00:59:16]

And we tried many times,

[00:59:16] : [00:59:17]

we published lots of papers on it.

[00:59:17] : [00:59:19]

They kind of sort of worked,but not really great.

[00:59:19] : [00:59:22]

It started working,

[00:59:22] : [00:59:24]

we abandoned this ideaof predicting every pixel

[00:59:24] : [00:59:27]

and basically just doing thejoint embedding and predicting

[00:59:27] : [00:59:30]

in representation space.

[00:59:30] : [00:59:32]

That works.

[00:59:32] : [00:59:33]

So there's ample evidence

[00:59:33] : [00:59:36]

that we're not gonna be ableto learn good representations

[00:59:36] : [00:59:40]

of the real world

[00:59:40] : [00:59:42]

using generative model.

[00:59:42] : [00:59:43]

So I'm telling people,

[00:59:43] : [00:59:44]

everybody's talking about generative AI.

[00:59:44] : [00:59:46]

If you're really interestedin human level AI,

[00:59:46] : [00:59:48]

abandon the idea of generative AI.

[00:59:48] : [00:59:50]

(Lex laughs)

[00:59:50] : [00:59:51]

- Okay.

[00:59:51] : [00:59:52]

But you really think it's possible

[00:59:52] : [00:59:54]

to get far with jointembedding representation?

[00:59:54] : [00:59:57]

So like there's common sense reasoning

[00:59:57] : [01:00:01]

and then there's high level reasoning.

[01:00:01] : [01:00:05]

Like I feel like those are two...

[01:00:05] : [01:00:08]

The kind of reasoningthat LLMs are able to do.

[01:00:08] : [01:00:11]

Okay, let me not use the word reasoning,

[01:00:11] : [01:00:13]

but the kind of stuffthat LLMs are able to do

[01:00:13] : [01:00:16]

seems fundamentally different

[01:00:16] : [01:00:17]

than the common sense reasoning we use

[01:00:17] : [01:00:19]

to navigate the world.

[01:00:19] : [01:00:20]

- [Yann] Yeah.

[01:00:20] : [01:00:21]

- It seems like we're gonna need both-

[01:00:21] : [01:00:23]

- Sure.- Would you be able to get,

[01:00:23] : [01:00:25]

with the joint embedding whichis a JEPA type of approach,

[01:00:25] : [01:00:27]

looking at video, wouldyou be able to learn,

[01:00:27] : [01:00:30]

let's see,

[01:00:30] : [01:00:33]

well, how to get from New York to Paris,

[01:00:33] : [01:00:35]

or how to understand the stateof politics in the world?

[01:00:35] : [01:00:40]

(both laugh)

[01:00:40] : [01:00:43]

Right?

[01:00:43] : [01:00:44]

These are things where various humans

[01:00:44] : [01:00:46]

generate a lot oflanguage and opinions on,

[01:00:46] : [01:00:49]

in the space of language,

[01:00:49] : [01:00:50]

but don't visually represent that

[01:00:50] : [01:00:52]

in any clearly compressible way.

[01:00:52] : [01:00:56]

- Right.

[01:00:56] : [01:00:56]

Well, there's a lot of situations

[01:00:56] : [01:00:58]

that might be difficult

[01:00:58] : [01:01:00]

for a purely languagebased system to know.

[01:01:00] : [01:01:04]

Like, okay, you can probablylearn from reading texts,

[01:01:04] : [01:01:08]

the entirety of the publiclyavailable text in the world

[01:01:08] : [01:01:11]

that I cannot get from New York to Paris

[01:01:11] : [01:01:13]

by snapping my fingers.

[01:01:13] : [01:01:15]

That's not gonna work, right?

[01:01:15] : [01:01:16]

- [Lex] Yes.

[01:01:16] : [01:01:17]

- But there's probablysort of more complex

[01:01:17] : [01:01:20]

scenarios of this type

[01:01:20] : [01:01:22]

which an LLM may never have encountered

[01:01:22] : [01:01:25]

and may not be able to determine

[01:01:25] : [01:01:27]

whether it's possible or not.

[01:01:27] : [01:01:29]

So that link from the lowlevel to the high level...

[01:01:29] : [01:01:34]

The thing is that the highlevel that language expresses

[01:01:34] : [01:01:38]

is based on the commonexperience of the low level,

[01:01:38] : [01:01:43]

which LLMs currently do not have.

[01:01:43] : [01:01:45]

When we talk to each other,

[01:01:45] : [01:01:47]

we know we have a commonexperience of the world.

[01:01:47] : [01:01:50]

Like a lot of it is similar.

[01:01:50] : [01:01:54]

And LLMs don't have that.

[01:01:54] : [01:01:59]

- But see, there it's present.

[01:01:59] : [01:02:01]

You and I have a commonexperience of the world

[01:02:01] : [01:02:02]

in terms of the physicsof how gravity works

[01:02:02] : [01:02:05]

and stuff like this.

[01:02:05] : [01:02:06]

And that common knowledge of the world,

[01:02:06] : [01:02:11]

I feel like is there in the language.

[01:02:11] : [01:02:15]

We don't explicitly express it,

[01:02:15] : [01:02:17]

but if you have a huge amount of text,

[01:02:17] : [01:02:21]

you're going to get this stuffthat's between the lines.

[01:02:21] : [01:02:24]

In order to form a consistent world model,

[01:02:24] : [01:02:28]

you're going to have tounderstand how gravity works,

[01:02:28] : [01:02:31]

even if you don't have anexplicit explanation of gravity.

[01:02:31] : [01:02:34]

So even though, in the case of gravity,

[01:02:34] : [01:02:37]

there is explicit explanation.

[01:02:37] : [01:02:38]

There's gravity in Wikipedia.

[01:02:38] : [01:02:40]

But like the stuff that we think of

[01:02:40] : [01:02:44]

as common sense reasoning,

[01:02:44] : [01:02:46]

I feel like to generatelanguage correctly,

[01:02:46] : [01:02:49]

you're going to have to figure that out.

[01:02:49] : [01:02:51]

Now, you could say as you have,

[01:02:51] : [01:02:53]

there's not enough text-- Well, I agree.

[01:02:53] : [01:02:54]

- Sorry.

[01:02:54] : [01:02:55]

Okay, yeah.

[01:02:55] : [01:02:56]

(laughs)

[01:02:56] : [01:02:57]

You don't think so?

[01:02:57] : [01:02:58]

- No, I agree with what you just said,

[01:02:58] : [01:02:59]

which is that to be able todo high level common sense...

[01:02:59] : [01:03:03]

To have high level common sense,

[01:03:03] : [01:03:04]

you need to have thelow level common sense

[01:03:04] : [01:03:06]

to build on top of.

[01:03:06] : [01:03:08]

- [Lex] Yeah.

[01:03:08] : [01:03:09]

But that's not there.

[01:03:09] : [01:03:10]

- That's not there in LLMs.

[01:03:10] : [01:03:11]

LLMs are purely trained from text.

[01:03:11] : [01:03:13]

So then the other statement you made,

[01:03:13] : [01:03:15]

I would not agree

[01:03:15] : [01:03:16]

with the fact that implicitin all languages in the world

[01:03:16] : [01:03:20]

is the underlying reality.

[01:03:20] : [01:03:22]

There's a lot about underlying reality

[01:03:22] : [01:03:24]

which is not expressed in language.

[01:03:24] : [01:03:26]

- Is that obvious to you?

[01:03:26] : [01:03:27]

- Yeah, totally.

[01:03:27] : [01:03:29]

- So like all the conversations we have...

[01:03:29] : [01:03:34]

Okay, there's the dark web,

[01:03:34] : [01:03:36]

meaning whatever,

[01:03:36] : [01:03:37]

the private conversationslike DMs and stuff like this,

[01:03:37] : [01:03:41]

which is much, much largerprobably than what's available,

[01:03:41] : [01:03:45]

what LLMs are trained on.

[01:03:45] : [01:03:46]

- You don't need to communicate

[01:03:46] : [01:03:48]

the stuff that is common.

[01:03:48] : [01:03:50]

- But the humor, all of it.

[01:03:50] : [01:03:51]

No, you do.

[01:03:51] : [01:03:52]

You don't need to, but it comes through.

[01:03:52] : [01:03:54]

Like if I accidentally knock this over,

[01:03:54] : [01:03:58]

you'll probably make fun of me.

[01:03:58] : [01:03:59]

And in the content ofthe you making fun of me

[01:03:59] : [01:04:02]

will be explanation ofthe fact that cups fall

[01:04:02] : [01:04:07]

and then gravity works in this way.

[01:04:07] : [01:04:09]

And then you'll have somevery vague information

[01:04:09] : [01:04:12]

about what kind of thingsexplode when they hit the ground.

[01:04:12] : [01:04:16]

And then maybe you'llmake a joke about entropy

[01:04:16] : [01:04:19]

or something like this

[01:04:19] : [01:04:20]

and we will never be ableto reconstruct this again.

[01:04:20] : [01:04:22]

Like, okay, you'll makea little joke like this

[01:04:22] : [01:04:24]

and there'll be trillion of other jokes.

[01:04:24] : [01:04:27]

And from the jokes,

[01:04:27] : [01:04:28]

you can piece together thefact that gravity works

[01:04:28] : [01:04:30]

and mugs can break andall this kind of stuff,

[01:04:30] : [01:04:32]

you don't need to see...

[01:04:32] : [01:04:34]

It'll be very inefficient.

[01:04:34] : [01:04:36]

It's easier for like

[01:04:36] : [01:04:38]

to not knock the thing over. (laughing)

[01:04:38] : [01:04:41]

- [Yann] Yeah.

[01:04:41] : [01:04:42]

- But I feel like it would be there

[01:04:42] : [01:04:44]

if you have enough of that data.

[01:04:44] : [01:04:46]

- I just think that most ofthe information of this type

[01:04:46] : [01:04:50]

that we have accumulatedwhen we were babies

[01:04:50] : [01:04:53]

is just not present in text,

[01:04:53] : [01:04:58]

in any description, essentially.

[01:04:58] : [01:04:59]

And the sensory datais a much richer source

[01:04:59] : [01:05:02]

for getting that kind of understanding.

[01:05:02] : [01:05:04]

I mean, that's the 16,000 hours

[01:05:04] : [01:05:06]

of wake time of a 4-year-old.

[01:05:06] : [01:05:09]

And tend to do 15 bytes,going through vision.

[01:05:09] : [01:05:12]

Just vision, right?

[01:05:12] : [01:05:13]

There is a similar bandwidth of touch

[01:05:13] : [01:05:17]

and a little less through audio.

[01:05:17] : [01:05:20]

And then text doesn't...

[01:05:20] : [01:05:21]

Language doesn't come inuntil like a year in life.

[01:05:21] : [01:05:26]

And by the time you are nine years old,

[01:05:26] : [01:05:28]

you've learned about gravity,

[01:05:28] : [01:05:30]

you know about inertia,

[01:05:30] : [01:05:31]

you know about gravity,

[01:05:31] : [01:05:32]

you know there's stability,

[01:05:32] : [01:05:33]

you know about the distinction

[01:05:33] : [01:05:36]

between animate and inanimate objects.

[01:05:36] : [01:05:38]

By 18 months,

[01:05:38] : [01:05:39]

you know about like whypeople want to do things

[01:05:39] : [01:05:42]

and you help them if they can't.

[01:05:42] : [01:05:45]

I mean there's a lot ofthings that you learn

[01:05:45] : [01:05:47]

mostly by observation,

[01:05:47] : [01:05:49]

really not even through interaction.

[01:05:49] : [01:05:52]

In the first few months of life,

[01:05:52] : [01:05:53]

babies don't really haveany influence on the world.

[01:05:53] : [01:05:55]

They can only observe, right?

[01:05:55] : [01:05:58]

And you accumulate like agigantic amount of knowledge

[01:05:58] : [01:06:02]

just from that.

[01:06:02] : [01:06:03]

So that's what we're missingfrom current AI systems.

[01:06:03] : [01:06:06]

- I think in one of yourslides you have this nice plot

[01:06:06] : [01:06:10]

that is one of the ways youshow that LLMs are limited.

[01:06:10] : [01:06:13]

I wonder if you couldtalk about hallucinations

[01:06:13] : [01:06:16]

from your perspectives.

[01:06:16] : [01:06:17]

Why hallucinations happenfrom large language models,

[01:06:17] : [01:06:23]

and to what degree isthat a fundamental flaw

[01:06:23] : [01:06:27]

of large language models.

[01:06:27] : [01:06:29]

- Right.

[01:06:29] : [01:06:30]

So because of theautoregressive prediction,

[01:06:30] : [01:06:34]

every time an LLM producesa token or a word,

[01:06:34] : [01:06:37]

there is some level ofprobability for that word

[01:06:37] : [01:06:40]

to take you out of theset of reasonable answers.

[01:06:40] : [01:06:44]

And if you assume,

[01:06:44] : [01:06:46]

which is a very strong assumption,

[01:06:46] : [01:06:48]

that the probability of such error

[01:06:48] : [01:06:50]

is those errors are independent

[01:06:50] : [01:06:55]

across a sequence oftokens being produced.

[01:06:55] : [01:06:59]

What that means is that everytime you produce a token,

[01:06:59] : [01:07:02]

the probability

[01:07:02] : [01:07:03]

that you stay within the setof correct answer decreases

[01:07:03] : [01:07:06]

and it decreases exponentially.

[01:07:06] : [01:07:08]

- So there's a strong, likeyou said, assumption there

[01:07:08] : [01:07:11]

that if there's a non-zeroprobability of making a mistake,

[01:07:11] : [01:07:14]

which there appears to be,

[01:07:14] : [01:07:16]

then there's going to be a kind of drift.

[01:07:16] : [01:07:18]

- Yeah.

[01:07:18] : [01:07:19]

And that drift is exponential.

[01:07:19] : [01:07:21]

It's like errors accumulate, right?

[01:07:21] : [01:07:23]

So the probability that ananswer would be nonsensical

[01:07:23] : [01:07:27]

increases exponentiallywith the number of tokens.

[01:07:27] : [01:07:31]

- Is that obvious to you by the way?

[01:07:31] : [01:07:33]

Well, so mathematically speaking maybe,

[01:07:33] : [01:07:36]

but like isn't there akind of gravitational pull

[01:07:36] : [01:07:39]

towards the truth?

[01:07:39] : [01:07:41]

Because on average, hopefully,

[01:07:41] : [01:07:44]

the truth is well representedin the training set.

[01:07:44] : [01:07:48]

- No, it's basically a struggle

[01:07:48] : [01:07:50]

against the curse of dimensionality.

[01:07:50] : [01:07:55]

So the way you can correct for this

[01:07:55] : [01:07:57]

is that you fine tune the system

[01:07:57] : [01:07:58]

by having it produce answers

[01:07:58] : [01:08:01]

for all kinds of questionsthat people might come up with.

[01:08:01] : [01:08:04]

And people are people,

[01:08:04] : [01:08:06]

so a lot of the questions that they have

[01:08:06] : [01:08:08]

are very similar to each other.

[01:08:08] : [01:08:10]

So you can probably cover,

[01:08:10] : [01:08:11]

you know, 80% or whatever ofquestions that people will ask

[01:08:11] : [01:08:16]

by collecting data.

[01:08:16] : [01:08:20]

And then you fine tune the system

[01:08:20] : [01:08:23]

to produce good answersfor all of those things.

[01:08:23] : [01:08:25]

And it's probably gonnabe able to learn that

[01:08:25] : [01:08:27]

because it's got a lotof capacity to learn.

[01:08:27] : [01:08:31]

But then there is theenormous set of prompts

[01:08:31] : [01:08:36]

that you have not covered during training.

[01:08:36] : [01:08:39]

And that set is enormous.

[01:08:39] : [01:08:41]

Like within the set ofall possible prompts,

[01:08:41] : [01:08:43]

the proportion of prompts thathave been used for training

[01:08:43] : [01:08:47]

is absolutely tiny.

[01:08:47] : [01:08:48]

It's a tiny, tiny, tiny subsetof all possible prompts.

[01:08:48] : [01:08:53]

And so the system will behave properly

[01:08:53] : [01:08:56]

on the prompts that it'sbeen either trained,

[01:08:56] : [01:08:58]

pre-trained or fine tuned.

[01:08:58] : [01:08:59]

But then there is anentire space of things

[01:08:59] : [01:09:04]

that it cannot possiblyhave been trained on

[01:09:04] : [01:09:06]

because it's just the number is gigantic.

[01:09:06] : [01:09:09]

So whatever training the system

[01:09:09] : [01:09:13]

has been subject to produceappropriate answers,

[01:09:13] : [01:09:18]

you can break it by finding out a prompt

[01:09:18] : [01:09:20]

that will be outside of the set of prompts

[01:09:20] : [01:09:24]

it's been trained on

[01:09:24] : [01:09:25]

or things that are similar,

[01:09:25] : [01:09:27]

and then it will justspew complete nonsense.

[01:09:27] : [01:09:29]

- When you say prompt,

[01:09:29] : [01:09:31]

do you mean that exact prompt

[01:09:31] : [01:09:33]

or do you mean a prompt that's like,

[01:09:33] : [01:09:36]

in many parts very different than...

[01:09:36] : [01:09:38]

Is it that easy to ask a question

[01:09:38] : [01:09:42]

or to say a thing thathasn't been said before

[01:09:42] : [01:09:45]

on the internet?

[01:09:45] : [01:09:46]

- I mean, people have come up with things

[01:09:46] : [01:09:48]

where like you put essentially

[01:09:48] : [01:09:51]

a random sequence ofcharacters in a prompt

[01:09:51] : [01:09:53]

and that's enough to kind ofthrow the system into a mode

[01:09:53] : [01:09:57]

where it's gonna answersomething completely different

[01:09:57] : [01:10:00]

than it would have answered without this.

[01:10:00] : [01:10:03]

So that's a way to jailbreakthe system, basically.

[01:10:03] : [01:10:05]

Go outside of its conditioning, right?

[01:10:05] : [01:10:09]

- So that's a very cleardemonstration of it.

[01:10:09] : [01:10:11]

But of course, that goes outside

[01:10:11] : [01:10:16]

of what it's designed to do, right?

[01:10:16] : [01:10:19]

If you actually stitch together

[01:10:19] : [01:10:20]

reasonably grammatical sentences,

[01:10:20] : [01:10:22]

is it that easy to break it?

[01:10:22] : [01:10:26]

- Yeah.

[01:10:26] : [01:10:27]

Some people have done things like

[01:10:27] : [01:10:29]

you write a sentence in English

[01:10:29] : [01:10:31]

or you ask a question in English

[01:10:31] : [01:10:33]

and it produces a perfectly fine answer.

[01:10:33] : [01:10:36]

And then you just substitute a few words

[01:10:36] : [01:10:38]

by the same word in another language,

[01:10:38] : [01:10:42]

and all of a sudden theanswer is complete nonsense.

[01:10:42] : [01:10:44]

- Yeah.

[01:10:44] : [01:10:45]

So I guess what I'm saying is like,

[01:10:45] : [01:10:46]

which fraction of prompts thathumans are likely to generate

[01:10:46] : [01:10:51]

are going to break the system?

[01:10:51] : [01:10:54]

- So the problem is thatthere is a long tail.

[01:10:54] : [01:10:57]

- [Lex] Yes.

[01:10:57] : [01:10:58]

- This is an issue that alot of people have realized

[01:10:58] : [01:11:02]

in social networks and stuff like that,

[01:11:02] : [01:11:04]

which is there's a very, very long tail

[01:11:04] : [01:11:05]

of things that people will ask.

[01:11:05] : [01:11:07]

And you can fine tune the system

[01:11:07] : [01:11:09]

for the 80% or whatever

[01:11:09] : [01:11:12]

of the things that most people will ask.

[01:11:12] : [01:11:16]

And then this long tail is so large

[01:11:16] : [01:11:18]

that you're not gonna beable to fine tune the system

[01:11:18] : [01:11:20]

for all the conditions.

[01:11:20] : [01:11:21]

And in the end,

[01:11:21] : [01:11:22]

the system ends up being

[01:11:22] : [01:11:23]

kind of a giant lookuptable, right? (laughing)

[01:11:23] : [01:11:25]

Essentially.

[01:11:25] : [01:11:26]

Which is not really what you want.

[01:11:26] : [01:11:27]

You want systems that can reason,

[01:11:27] : [01:11:29]

certainly that can plan.

[01:11:29] : [01:11:30]

So the type of reasoningthat takes place in LLM

[01:11:30] : [01:11:33]

is very, very primitive.

[01:11:33] : [01:11:35]

And the reason you can tell it's primitive

[01:11:35] : [01:11:37]

is because the amount of computation

[01:11:37] : [01:11:39]

that is spent per tokenproduced is constant.

[01:11:39] : [01:11:43]

So if you ask a question

[01:11:43] : [01:11:45]

and that question has an answerin a given number of token,

[01:11:45] : [01:11:50]

the amount of computationdevoted to computing that answer

[01:11:50] : [01:11:52]

can be exactly estimated.

[01:11:52] : [01:11:54]

It's the size of the prediction network

[01:11:54] : [01:11:59]

with its 36 layers or 92layers or whatever it is,

[01:11:59] : [01:12:03]

multiplied by number of tokens.

[01:12:03] : [01:12:05]

That's it.

[01:12:05] : [01:12:06]

And so essentially,

[01:12:06] : [01:12:08]

it doesn't matter ifthe question being asked

[01:12:08] : [01:12:10]

is simple to answer,complicated to answer,

[01:12:10] : [01:12:16]

impossible to answer

[01:12:16] : [01:12:17]

because it's decided,well, there's something.

[01:12:17] : [01:12:20]

The amount of computation

[01:12:20] : [01:12:22]

the system will be able todevote to the answer is constant

[01:12:22] : [01:12:25]

or is proportional to thenumber of token produced

[01:12:25] : [01:12:27]

in the answer, right?

[01:12:27] : [01:12:29]

This is not the way we work,

[01:12:29] : [01:12:30]

the way we reason is that

[01:12:30] : [01:12:33]

when we are faced with a complex problem

[01:12:33] : [01:12:37]

or a complex question,

[01:12:37] : [01:12:38]

we spend more time trying tosolve it and answer it, right?

[01:12:38] : [01:12:42]

Because it's more difficult.

[01:12:42] : [01:12:43]

- There's a prediction element,

[01:12:43] : [01:12:45]

there's an iterative element

[01:12:45] : [01:12:47]

where you're like adjustingyour understanding of a thing

[01:12:47] : [01:12:52]

by going over and over and over.

[01:12:52] : [01:12:54]

There's a hierarchical elements on.

[01:12:54] : [01:12:56]

Does this mean it's afundamental flaw of LLMs-

[01:12:56] : [01:12:59]

- [Yann] Yeah.

[01:12:59] : [01:13:00]

- Or does it mean that... (laughs)

[01:13:00] : [01:13:01]

There's more part to that question?

[01:13:01] : [01:13:03]

(laughs)

[01:13:03] : [01:13:04]

Now you're just behaving like an LLM.

[01:13:04] : [01:13:06]

(laughs)

[01:13:06] : [01:13:07]

Immediately answering.

[01:13:07] : [01:13:08]

No, that it's just thelow level world model

[01:13:08] : [01:13:13]

on top of which we can then build

[01:13:13] : [01:13:17]

some of these kinds of mechanisms,

[01:13:17] : [01:13:18]

like you said, persistentlong-term memory or reasoning,

[01:13:18] : [01:13:23]

so on.

[01:13:23] : [01:13:25]

But we need that world modelthat comes from language.

[01:13:25] : [01:13:29]

Maybe it is not so difficult

[01:13:29] : [01:13:30]

to build this kind of reasoning system

[01:13:30] : [01:13:33]

on top of a well constructed world model.

[01:13:33] : [01:13:36]

- Okay.

[01:13:36] : [01:13:37]

Whether it's difficult or not,

[01:13:37] : [01:13:38]

the near future will say,

[01:13:38] : [01:13:40]

because a lot of peopleare working on reasoning

[01:13:40] : [01:13:43]

and planning abilitiesfor dialogue systems.

[01:13:43] : [01:13:46]

I mean, even if we restrictourselves to language,

[01:13:46] : [01:13:50]

just having the ability

[01:13:50] : [01:13:53]

to plan your answer before you answer,

[01:13:53] : [01:13:55]

in terms that are not necessarily linked

[01:13:55] : [01:13:59]

with the language you're gonnause to produce the answer.

[01:13:59] : [01:14:02]

Right?

[01:14:02] : [01:14:02]

So this idea of this mental model

[01:14:02] : [01:14:04]

that allows you to planwhat you're gonna say

[01:14:04] : [01:14:06]

before you say it.

[01:14:06] : [01:14:06]

That is very important.

[01:14:06] : [01:14:11]

I think there's goingto be a lot of systems

[01:14:11] : [01:14:13]

over the next few years

[01:14:13] : [01:14:14]

that are going to have this capability,

[01:14:14] : [01:14:17]

but the blueprint of those systems

[01:14:17] : [01:14:19]

will be extremely differentfrom autoregressive LLMs.

[01:14:19] : [01:14:23]

So it's the same difference

[01:14:23] : [01:14:27]

as the difference between

[01:14:27] : [01:14:29]

what psychology has calledsystem one and system two

[01:14:29] : [01:14:31]

in humans, right?

[01:14:31] : [01:14:32]

So system one is the type oftask that you can accomplish

[01:14:32] : [01:14:35]

without like deliberatelyconsciously think about

[01:14:35] : [01:14:39]

how you do them.

[01:14:39] : [01:14:40]

You just do them.

[01:14:40] : [01:14:42]

You've done them enough

[01:14:42] : [01:14:43]

that you can just do itsubconsciously, right?

[01:14:43] : [01:14:45]

Without thinking about them.

[01:14:45] : [01:14:46]

If you're an experienced driver,

[01:14:46] : [01:14:48]

you can drive withoutreally thinking about it

[01:14:48] : [01:14:51]

and you can talk tosomeone at the same time

[01:14:51] : [01:14:52]

or listen to the radio, right?

[01:14:52] : [01:14:54]

If you are a veryexperienced chess player,

[01:14:54] : [01:14:58]

you can play against anon-experienced chess player

[01:14:58] : [01:15:01]

without really thinking either,

[01:15:01] : [01:15:02]

you just recognize thepattern and you play, right?

[01:15:02] : [01:15:05]

That's system one.

[01:15:05] : [01:15:06]

So all the things thatyou do instinctively

[01:15:06] : [01:15:09]

without really having to deliberately plan

[01:15:09] : [01:15:12]

and think about it.

[01:15:12] : [01:15:13]

And then there is othertasks where you need to plan.

[01:15:13] : [01:15:15]

So if you are a not tooexperienced chess player

[01:15:15] : [01:15:19]

or you are experienced

[01:15:19] : [01:15:20]

but you play against anotherexperienced chess player,

[01:15:20] : [01:15:22]

you think about allkinds of options, right?

[01:15:22] : [01:15:24]

You think about it for a while, right?

[01:15:24] : [01:15:27]

And you're much better if youhave time to think about it

[01:15:27] : [01:15:30]

than you are if you playblitz with limited time.

[01:15:30] : [01:15:35]

And so this type of deliberate planning,

[01:15:35] : [01:15:39]

which uses your internal worldmodel, that's system two,

[01:15:39] : [01:15:43]

this is what LLMs currently cannot do.

[01:15:43] : [01:15:46]

How do we get them to do this, right?

[01:15:46] : [01:15:48]

How do we build a system

[01:15:48] : [01:15:50]

that can do this kindof planning or reasoning

[01:15:50] : [01:15:55]

that devotes more resources

[01:15:55] : [01:15:57]

to complex problemsthan to simple problems.

[01:15:57] : [01:16:00]

And it's not going to be

[01:16:00] : [01:16:01]

autoregressive prediction of tokens,

[01:16:01] : [01:16:03]

it's going to be moresomething akin to inference

[01:16:03] : [01:16:08]

of latent variables

[01:16:08] : [01:16:09]

in what used to be calledprobabilistic models

[01:16:09] : [01:16:14]

or graphical models andthings of that type.

[01:16:14] : [01:16:17]

So basically the principle is like this.

[01:16:17] : [01:16:19]

The prompt is like observed variables.

[01:16:19] : [01:16:24]

And what the model does

[01:16:24] : [01:16:29]

is that it's basically a measure of...

[01:16:29] : [01:16:33]

It can measure to what extent an answer

[01:16:33] : [01:16:36]

is a good answer for a prompt.

[01:16:36] : [01:16:37]

Okay?

[01:16:37] : [01:16:38]

So think of it as somegigantic neural net,

[01:16:38] : [01:16:41]

but it's got only one output.

[01:16:41] : [01:16:42]

And that output is a scaler number,

[01:16:42] : [01:16:45]

which is let's say zero

[01:16:45] : [01:16:47]

if the answer is a goodanswer for the question,

[01:16:47] : [01:16:49]

and a large number

[01:16:49] : [01:16:51]

if the answer is not a goodanswer for the question.

[01:16:51] : [01:16:53]

Imagine you had this model.

[01:16:53] : [01:16:55]

If you had such a model,

[01:16:55] : [01:16:56]

you could use it to produce good answers.

[01:16:56] : [01:16:58]

The way you would do is produce the prompt

[01:16:58] : [01:17:02]

and then search through thespace of possible answers

[01:17:02] : [01:17:05]

for one that minimizes that number.

[01:17:05] : [01:17:07]

That's called an energy based model.

[01:17:07] : [01:17:11]

- But that energy based model

[01:17:11] : [01:17:14]

would need the modelconstructed by the LLM.

[01:17:14] : [01:17:18]

- Well, so really what you need to do

[01:17:18] : [01:17:20]

would be to not search overpossible strings of text

[01:17:20] : [01:17:24]

that minimize that energy.

[01:17:24] : [01:17:27]

But what you would do

[01:17:27] : [01:17:28]

is do this in abstractrepresentation space.

[01:17:28] : [01:17:31]

So in sort of the spaceof abstract thoughts,

[01:17:31] : [01:17:34]

you would elaborate a thought, right?

[01:17:34] : [01:17:37]

Using this process of minimizingthe output of your model.

[01:17:37] : [01:17:42]

Okay?

[01:17:42] : [01:17:42]

Which is just a scaler.

[01:17:42] : [01:17:44]

It's an optimization process, right?

[01:17:44] : [01:17:46]

So now the way the systemproduces its answer

[01:17:46] : [01:17:48]

is through optimization

[01:17:48] : [01:17:50]

by minimizing an objectivefunction basically, right?

[01:17:50] : [01:17:56]

And this is, we'retalking about inference,

[01:17:56] : [01:17:57]

we're not talking about training, right?

[01:17:57] : [01:17:59]

The system has been trained already.

[01:17:59] : [01:18:01]

So now we have an abstract representation

[01:18:01] : [01:18:03]

of the thought of the answer,

[01:18:03] : [01:18:04]

representation of the answer.

[01:18:04] : [01:18:06]

We feed that to basicallyan autoregressive decoder,

[01:18:06] : [01:18:10]

which can be very simple,

[01:18:10] : [01:18:11]

that turns this into a textthat expresses this thought.

[01:18:11] : [01:18:15]

Okay?

[01:18:15] : [01:18:16]

So that in my opinion

[01:18:16] : [01:18:17]

is the blueprint of future data systems.

[01:18:17] : [01:18:21]

They will think about their answer,

[01:18:21] : [01:18:23]

plan their answer by optimization

[01:18:23] : [01:18:25]

before turning it into text.

[01:18:25] : [01:18:27]

And that is turning complete.

[01:18:27] : [01:18:31]

- Can you explain exactly

[01:18:31] : [01:18:32]

what the optimization problem there is?

[01:18:32] : [01:18:34]

Like what's the objective function?

[01:18:34] : [01:18:37]

Just linger on it.

[01:18:37] : [01:18:38]

You kind of briefly described it,

[01:18:38] : [01:18:40]

but over what space are you optimizing?

[01:18:40] : [01:18:43]

- The space of representations-

[01:18:43] : [01:18:45]

- Goes abstract representation.

[01:18:45] : [01:18:47]

- That's right.

[01:18:47] : [01:18:47]

So you have an abstractrepresentation inside the system.

[01:18:47] : [01:18:51]

You have a prompt.

[01:18:51] : [01:18:52]

The prompt goes through an encoder,

[01:18:52] : [01:18:53]

produces a representation,

[01:18:53] : [01:18:55]

perhaps goes through a predictor

[01:18:55] : [01:18:56]

that predicts arepresentation of the answer,

[01:18:56] : [01:18:58]

of the proper answer.

[01:18:58] : [01:18:59]

But that representationmay not be a good answer

[01:18:59] : [01:19:03]

because there might besome complicated reasoning

[01:19:03] : [01:19:06]

you need to do, right?

[01:19:06] : [01:19:07]

So then you have another process

[01:19:07] : [01:19:11]

that takes the representationof the answers and modifies it

[01:19:11] : [01:19:15]

so as to minimize a cost function

[01:19:15] : [01:19:20]

that measures to what extent

[01:19:20] : [01:19:21]

the answer is a goodanswer for the question.

[01:19:21] : [01:19:22]

Now we sort of ignore the fact for...

[01:19:22] : [01:19:27]

I mean, the issue for a moment

[01:19:27] : [01:19:29]

of how you train that system

[01:19:29] : [01:19:30]

to measure whether an answeris a good answer for sure.

[01:19:30] : [01:19:35]

- But suppose such asystem could be created,

[01:19:35] : [01:19:38]

what's the process?

[01:19:38] : [01:19:40]

This kind of search like process.

[01:19:40] : [01:19:42]

- It's an optimization process.

[01:19:42] : [01:19:44]

You can do this if the entiresystem is differentiable,

[01:19:44] : [01:19:47]

that scaler output

[01:19:47] : [01:19:49]

is the result of runningthrough some neural net,

[01:19:49] : [01:19:52]

running the answer,

[01:19:52] : [01:19:54]

the representation of theanswer through some neural net.

[01:19:54] : [01:19:56]

Then by gradient descent,

[01:19:56] : [01:19:58]

by back propagating gradients,

[01:19:58] : [01:20:00]

you can figure out

[01:20:00] : [01:20:01]

like how to modify therepresentation of the answers

[01:20:01] : [01:20:03]

so as to minimize that.

[01:20:03] : [01:20:05]

- So that's still a gradient based.

[01:20:05] : [01:20:06]

- It's gradient based inference.

[01:20:06] : [01:20:08]

So now you have arepresentation of the answer

[01:20:08] : [01:20:10]

in abstract space.

[01:20:10] : [01:20:12]

Now you can turn it into text, right?

[01:20:12] : [01:20:14]

And the cool thing about this

[01:20:14] : [01:20:17]

is that the representation now

[01:20:17] : [01:20:20]

can be optimized through gradient descent,

[01:20:20] : [01:20:22]

but also is independent of the language

[01:20:22] : [01:20:24]

in which you're goingto express the answer.

[01:20:24] : [01:20:27]

- Right.

[01:20:27] : [01:20:28]

So you're operating in thesubstruct of representation.

[01:20:28] : [01:20:30]

I mean this goes backto the joint embedding.

[01:20:30] : [01:20:32]

- [Yann] Right.

[01:20:32] : [01:20:33]

- That it's better towork in the space of...

[01:20:33] : [01:20:36]

I don't know.

[01:20:36] : [01:20:37]

Or to romanticize the notion

[01:20:37] : [01:20:39]

like space of concepts

[01:20:39] : [01:20:40]

versus the space of concretesensory information.

[01:20:40] : [01:20:45]

- Right.

[01:20:45] : [01:20:47]

- Okay.

[01:20:47] : [01:20:48]

But can this do something like reasoning,

[01:20:48] : [01:20:50]

which is what we're talking about?

[01:20:50] : [01:20:51]

- Well, not really,

[01:20:51] : [01:20:53]

only in a very simple way.

[01:20:53] : [01:20:54]

I mean basically you canthink of those things as doing

[01:20:54] : [01:20:57]

the kind of optimizationI was talking about,

[01:20:57] : [01:20:59]

except they're optimizingthe discrete space

[01:20:59] : [01:21:01]

which is the space ofpossible sequences of tokens.

[01:21:01] : [01:21:05]

And they do this optimizationin a horribly inefficient way,

[01:21:05] : [01:21:09]

which is generate a lot of hypothesis

[01:21:09] : [01:21:11]

and then select the best ones.

[01:21:11] : [01:21:13]

And that's incredibly wasteful

[01:21:13] : [01:21:16]

in terms of competition,

[01:21:16] : [01:21:18]

'cause you basically have to run your LLM

[01:21:18] : [01:21:20]

for like every possiblegenerative sequence.

[01:21:20] : [01:21:24]

And it's incredibly wasteful.

[01:21:24] : [01:21:27]

So it's much better to do an optimization

[01:21:27] : [01:21:31]

in continuous space

[01:21:31] : [01:21:33]

where you can do gradient descent

[01:21:33] : [01:21:34]

as opposed to like generate tons of things

[01:21:34] : [01:21:36]

and then select the best,

[01:21:36] : [01:21:38]

you just iteratively refine your answer

[01:21:38] : [01:21:41]

to go towards the best, right?

[01:21:41] : [01:21:42]

That's much more efficient.

[01:21:42] : [01:21:44]

But you can only do thisin continuous spaces

[01:21:44] : [01:21:46]

with differentiable functions.

[01:21:46] : [01:21:48]

- You're talking about the reasoning,

[01:21:48] : [01:21:50]

like ability to thinkdeeply or to reason deeply.

[01:21:50] : [01:21:54]

How do you know what is an answer

[01:21:54] : [01:21:58]

that's better or worsebased on deep reasoning?

[01:21:58] : [01:22:03]

- Right.

[01:22:03] : [01:22:05]

So then we're asking the question,

[01:22:05] : [01:22:06]

of conceptually, how do youtrain an energy based model?

[01:22:06] : [01:22:09]

Right?

[01:22:09] : [01:22:10]

So energy based model

[01:22:10] : [01:22:11]

is a function with a scaleroutput, just a number.

[01:22:11] : [01:22:13]

You give it two inputs, X and Y,

[01:22:13] : [01:22:17]

and it tells you whether Yis compatible with X or not.

[01:22:17] : [01:22:20]

X you observe,

[01:22:20] : [01:22:21]

let's say it's a prompt, animage, a video, whatever.

[01:22:21] : [01:22:24]

And Y is a proposal for an answer,

[01:22:24] : [01:22:28]

a continuation of video, whatever.

[01:22:28] : [01:22:30]

And it tells you whetherY is compatible with X.

[01:22:30] : [01:22:32]

And the way it tells youthat Y is compatible with X

[01:22:32] : [01:22:37]

is that the output of thatfunction would be zero

[01:22:37] : [01:22:39]

if Y is compatible with X,

[01:22:39] : [01:22:40]

it would be a positive number, non-zero

[01:22:40] : [01:22:44]

if Y is not compatible with X.

[01:22:44] : [01:22:46]

Okay.

[01:22:46] : [01:22:48]

How do you train a system like this?

[01:22:48] : [01:22:49]

At a completely general level,

[01:22:49] : [01:22:51]

is you show it pairs of Xand Ys that are compatible,

[01:22:51] : [01:22:56]

a question and the corresponding answer.

[01:22:56] : [01:22:58]

And you train the parametersof the big neural net inside

[01:22:58] : [01:23:02]

to produce zero.

[01:23:02] : [01:23:03]

Okay.

[01:23:03] : [01:23:05]

Now that doesn't completely work

[01:23:05] : [01:23:07]

because the system might decide,

[01:23:07] : [01:23:08]

well, I'm just gonnasay zero for everything.

[01:23:08] : [01:23:11]

So now you have to have a process

[01:23:11] : [01:23:12]

to make sure that for a wrong Y,

[01:23:12] : [01:23:16]

the energy will be larger than zero.

[01:23:16] : [01:23:18]

And there you have two options,

[01:23:18] : [01:23:20]

one is contrastive methods.

[01:23:20] : [01:23:21]

So contrastive method isyou show an X and a bad Y,

[01:23:21] : [01:23:25]

and you tell the system,

[01:23:25] : [01:23:27]

well, give a high energy to this.

[01:23:27] : [01:23:29]

Like push up the energy, right?

[01:23:29] : [01:23:30]

Change the weights in the neuralnet that compute the energy

[01:23:30] : [01:23:33]

so that it goes up.

[01:23:33] : [01:23:34]

So that's contrasting methods.

[01:23:34] : [01:23:37]

The problem with this isif the space of Y is large,

[01:23:37] : [01:23:41]

the number of such contrasted samples

[01:23:41] : [01:23:43]

you're gonna have to show is gigantic.

[01:23:43] : [01:23:47]

But people do this.

[01:23:47] : [01:23:49]

They do this when youtrain a system with RLHF,

[01:23:49] : [01:23:53]

basically what you're training

[01:23:53] : [01:23:55]

is what's called a reward model,

[01:23:55] : [01:23:57]

which is basically an objective function

[01:23:57] : [01:24:00]

that tells you whetheran answer is good or bad.

[01:24:00] : [01:24:02]

And that's basically exactly what this is.

[01:24:02] : [01:24:06]

So we already do this to some extent.

[01:24:06] : [01:24:08]

We're just not using it for inference,

[01:24:08] : [01:24:09]

we're just using it for training.

[01:24:09] : [01:24:11]

There is another set of methods

[01:24:11] : [01:24:15]

which are non-contrastive,and I prefer those.

[01:24:15] : [01:24:18]

And those non-contrastivemethod basically say,

[01:24:18] : [01:24:22]

okay, the energy function

[01:24:22] : [01:24:26]

needs to have low energy onpairs of XYs that are compatible

[01:24:26] : [01:24:29]

that come from your training set.

[01:24:29] : [01:24:31]

How do you make sure that the energy

[01:24:31] : [01:24:34]

is gonna be higher everywhere else?

[01:24:34] : [01:24:36]

And the way you do this

[01:24:36] : [01:24:38]

is by having a regularizer, a criterion,

[01:24:38] : [01:24:43]

a term in your cost function

[01:24:43] : [01:24:45]

that basically minimizesthe volume of space

[01:24:45] : [01:24:49]

that can take low energy.

[01:24:49] : [01:24:50]

And the precise way to do this,

[01:24:50] : [01:24:53]

there's all kinds of differentspecific ways to do this

[01:24:53] : [01:24:55]

depending on the architecture,

[01:24:55] : [01:24:56]

but that's the basic principle.

[01:24:56] : [01:24:58]

So that if you pushdown the energy function

[01:24:58] : [01:25:00]

for particular regions in the XY space,

[01:25:00] : [01:25:04]

it will automaticallygo up in other places

[01:25:04] : [01:25:06]

because there's only alimited volume of space

[01:25:06] : [01:25:09]

that can take low energy.

[01:25:09] : [01:25:11]

Okay?

[01:25:11] : [01:25:11]

By the construction of the system

[01:25:11] : [01:25:13]

or by the regularizing function.

[01:25:13] : [01:25:16]

- We've been talking very generally,

[01:25:16] : [01:25:18]

but what is a good X and a good Y?

[01:25:18] : [01:25:21]

What is a good representation of X and Y?

[01:25:21] : [01:25:25]

Because we've been talking about language.

[01:25:25] : [01:25:27]

And if you just take language directly,

[01:25:27] : [01:25:30]

that presumably is not good,

[01:25:30] : [01:25:32]

so there has to be

[01:25:32] : [01:25:33]

some kind of abstractrepresentation of ideas.

[01:25:33] : [01:25:35]

- Yeah.

[01:25:35] : [01:25:37]

I mean you can do thiswith language directly

[01:25:37] : [01:25:39]

by just, you know, X is a text

[01:25:39] : [01:25:42]

and Y is the continuation of that text.

[01:25:42] : [01:25:43]

- [Lex] Yes.

[01:25:43] : [01:25:45]

- Or X is a question, Y is the answer.

[01:25:45] : [01:25:48]

- But you're sayingthat's not gonna take it.

[01:25:48] : [01:25:49]

I mean, that's going todo what LLMs are doing.

[01:25:49] : [01:25:52]

- Well, no.

[01:25:52] : [01:25:53]

It depends on how the internalstructure of the system

[01:25:53] : [01:25:56]

is built.

[01:25:56] : [01:25:57]

If the internal structure of the system

[01:25:57] : [01:25:59]

is built in such a waythat inside of the system

[01:25:59] : [01:26:02]

there is a latent variable,

[01:26:02] : [01:26:03]

let's called it Z,

[01:26:03] : [01:26:04]

that you can manipulate

[01:26:04] : [01:26:09]

so as to minimize the output energy,

[01:26:09] : [01:26:11]

then that Z can be viewed asrepresentation of a good answer

[01:26:11] : [01:26:16]

that you can translate intoa Y that is a good answer.

[01:26:16] : [01:26:19]

- So this kind of system could be trained

[01:26:19] : [01:26:22]

in a very similar way?

[01:26:22] : [01:26:24]

- Very similar way.

[01:26:24] : [01:26:25]

But you have to have thisway of preventing collapse,

[01:26:25] : [01:26:26]

of ensuring that there is high energy

[01:26:26] : [01:26:31]

for things you don't train it on.

[01:26:31] : [01:26:33]

And currently it's very implicit in LLMs.

[01:26:33] : [01:26:38]

It is done in a way

[01:26:38] : [01:26:39]

that people don't realize it's being done,

[01:26:39] : [01:26:40]

but it is being done.

[01:26:40] : [01:26:42]

It's due to the fact

[01:26:42] : [01:26:43]

that when you give a highprobability to a word,

[01:26:43] : [01:26:48]

automatically you give lowprobability to other words

[01:26:48] : [01:26:51]

because you only have

[01:26:51] : [01:26:52]

a finite amount of probabilityto go around. (laughing)

[01:26:52] : [01:26:55]

Right?

[01:26:55] : [01:26:56]

They have to sub to one.

[01:26:56] : [01:26:57]

So when you minimize thecross entropy or whatever,

[01:26:57] : [01:27:00]

when you train your LLMto predict the next word,

[01:27:00] : [01:27:05]

you are increasing the probability

[01:27:05] : [01:27:07]

your system will give to the correct word,

[01:27:07] : [01:27:09]

but you're also decreasing the probability

[01:27:09] : [01:27:10]

it will give to the incorrect words.

[01:27:10] : [01:27:12]

Now, indirectly, that givesa low probability to...

[01:27:12] : [01:27:17]

A high probability to sequencesof words that are good

[01:27:17] : [01:27:19]

and low probability twosequences of words that are bad,

[01:27:19] : [01:27:21]

but it's very indirect.

[01:27:21] : [01:27:23]

It's not obvious why thisactually works at all,

[01:27:23] : [01:27:26]

because you're not doingit on a joint probability

[01:27:26] : [01:27:30]

of all the symbols in a sequence,

[01:27:30] : [01:27:32]

you're just doing it kind of,

[01:27:32] : [01:27:34]

sort of factorized that probability

[01:27:34] : [01:27:37]

in terms of conditional probabilities

[01:27:37] : [01:27:39]

over successive tokens.

[01:27:39] : [01:27:41]

- So how do you do this for visual data?

[01:27:41] : [01:27:43]

- So we've been doing this

[01:27:43] : [01:27:44]

with all JEPA architectures,basically the-

[01:27:44] : [01:27:47]

- [Lex] The joint embedding?

[01:27:47] : [01:27:47]

- I-JEPA.

[01:27:47] : [01:27:48]

So there, the compatibilitybetween two things

[01:27:48] : [01:27:52]

is here's an image or a video,

[01:27:52] : [01:27:56]

here is a corrupted, shiftedor transformed version

[01:27:56] : [01:27:58]

of that image or video or masked.

[01:27:58] : [01:28:01]

Okay?

[01:28:01] : [01:28:01]

And then the energy of the system

[01:28:01] : [01:28:04]

is the prediction errorof the representation.

[01:28:04] : [01:28:09]

The predicted representationof the good thing

[01:28:09] : [01:28:14]

versus the actual representationof the good thing, right?

[01:28:14] : [01:28:17]

So you run the corruptedimage to the system,

[01:28:17] : [01:28:20]

predict the representation ofthe good input uncorrupted,

[01:28:20] : [01:28:24]

and then compute the prediction error.

[01:28:24] : [01:28:26]

That's the energy of the system.

[01:28:26] : [01:28:28]

So this system will tell you,

[01:28:28] : [01:28:30]

this is a good image andthis is a corrupted version.

[01:28:30] : [01:28:35]

It will give you zero energy

[01:28:35] : [01:28:38]

if those two things are effectively,

[01:28:38] : [01:28:41]

one of them is a corruptedversion of the other,

[01:28:41] : [01:28:43]

give you a high energy

[01:28:43] : [01:28:44]

if the two images arecompletely different.

[01:28:44] : [01:28:46]

- And hopefully that whole process

[01:28:46] : [01:28:48]

gives you a really nicecompressed representation

[01:28:48] : [01:28:51]

of reality, of visual reality.

[01:28:51] : [01:28:54]

- And we know it does

[01:28:54] : [01:28:55]

because then we use those presentations

[01:28:55] : [01:28:57]

as input to a classificationsystem or something,

[01:28:57] : [01:28:59]

and it works-- And then

[01:28:59] : [01:29:00]

that classification systemworks really nicely.

[01:29:00] : [01:29:01]

Okay.

[01:29:01] : [01:29:03]

Well, so to summarize,

[01:29:03] : [01:29:04]

you recommend in a spicy waythat only Yann LeCun can,

[01:29:04] : [01:29:09]

you recommend that weabandon generative models

[01:29:09] : [01:29:12]

in favor of joint embedding architectures?

[01:29:12] : [01:29:14]

- [Yann] Yes.

[01:29:14] : [01:29:16]

- Abandon autoregressive generation.

[01:29:16] : [01:29:17]

- [Yann] Yes.

[01:29:17] : [01:29:18]

- Abandon... (laughs)

[01:29:18] : [01:29:19]

This feels like court testimony.

[01:29:19] : [01:29:21]

Abandon probabilistic models

[01:29:21] : [01:29:23]

in favor of energy basedmodels, as we talked about.

[01:29:23] : [01:29:26]

Abandon contrastive methods

[01:29:26] : [01:29:27]

in favor of regularized methods.

[01:29:27] : [01:29:30]

And let me ask you about this;

[01:29:30] : [01:29:32]

you've been for a while, acritic of reinforcement learning.

[01:29:32] : [01:29:36]

- [Yann] Yes.

[01:29:36] : [01:29:37]

- So the last recommendationis that we abandon RL

[01:29:37] : [01:29:41]

in favor of model predictive control,

[01:29:41] : [01:29:43]

as you were talking about.

[01:29:43] : [01:29:45]

And only use RL

[01:29:45] : [01:29:46]

when planning doesn't yieldthe predicted outcome.

[01:29:46] : [01:29:50]

And we use RL in that case

[01:29:50] : [01:29:52]

to adjust the world model or the critic.

[01:29:52] : [01:29:55]

- [Yann] Yes.

[01:29:55] : [01:29:57]

- So you've mentioned RLHF,

[01:29:57] : [01:30:00]

reinforcement learningwith human feedback.

[01:30:00] : [01:30:02]

Why do you still hatereinforcement learning?

[01:30:02] : [01:30:05]

- [Yann] I don't hatereinforcement learning,

[01:30:05] : [01:30:07]

and I think it's-- So it's all love?

[01:30:07] : [01:30:08]

- I think it should notbe abandoned completely,

[01:30:08] : [01:30:12]

but I think it's use should be minimized

[01:30:12] : [01:30:14]

because it's incrediblyinefficient in terms of samples.

[01:30:14] : [01:30:18]

And so the proper way to train a system

[01:30:18] : [01:30:21]

is to first have it learn

[01:30:21] : [01:30:24]

good representations ofthe world and world models

[01:30:24] : [01:30:27]

from mostly observation,

[01:30:27] : [01:30:29]

maybe a little bit of interactions.

[01:30:29] : [01:30:31]

- And then steer it based on that.

[01:30:31] : [01:30:33]

If the representation is good,

[01:30:33] : [01:30:34]

then the adjustments should be minimal.

[01:30:34] : [01:30:36]

- Yeah.

[01:30:36] : [01:30:37]

Now there's two things.

[01:30:37] : [01:30:39]

If you've learned the world model,

[01:30:39] : [01:30:40]

you can use the world modelto plan a sequence of actions

[01:30:40] : [01:30:42]

to arrive at a particular objective.

[01:30:42] : [01:30:44]

You don't need RL,

[01:30:44] : [01:30:47]

unless the way you measurewhether you succeed

[01:30:47] : [01:30:50]

might be inexact.

[01:30:50] : [01:30:51]

Your idea of whether you weregonna fall from your bike

[01:30:51] : [01:30:56]

might be wrong,

[01:30:56] : [01:30:59]

or whether the personyou're fighting with MMA

[01:30:59] : [01:31:02]

was gonna do something

[01:31:02] : [01:31:03]

and they do something else. (laughing)

[01:31:03] : [01:31:05]

So there's two ways you can be wrong.

[01:31:05] : [01:31:09]

Either your objective function

[01:31:09] : [01:31:12]

does not reflect

[01:31:12] : [01:31:13]

the actual objective functionyou want to optimize,

[01:31:13] : [01:31:16]

or your world model is inaccurate, right?

[01:31:16] : [01:31:19]

So the prediction you were making

[01:31:19] : [01:31:22]

about what was gonna happenin the world is inaccurate.

[01:31:22] : [01:31:25]

So if you want to adjust your world model

[01:31:25] : [01:31:27]

while you are operating the world

[01:31:27] : [01:31:30]

or your objective function,

[01:31:30] : [01:31:32]

that is basically in the realm of RL.

[01:31:32] : [01:31:35]

This is what RL deals withto some extent, right?

[01:31:35] : [01:31:38]

So adjust your world model.

[01:31:38] : [01:31:40]

And the way to adjust yourworld model, even in advance,

[01:31:40] : [01:31:44]

is to explore parts of thespace with your world model,

[01:31:44] : [01:31:48]

where you know that yourworld model is inaccurate.

[01:31:48] : [01:31:50]

That's called curiositybasically, or play, right?

[01:31:50] : [01:31:54]

When you play,

[01:31:54] : [01:31:55]

you kind of explorepart of the state space

[01:31:55] : [01:31:58]

that you don't want to do for real

[01:31:58] : [01:32:03]

because it might be dangerous,

[01:32:03] : [01:32:05]

but you can adjust your world model

[01:32:05] : [01:32:07]

without killing yourselfbasically. (laughs)

[01:32:07] : [01:32:13]

So that's what you want to use RL for.

[01:32:13] : [01:32:14]

When it comes time tolearning a particular task,

[01:32:14] : [01:32:18]

you already have all thegood representations,

[01:32:18] : [01:32:20]

you already have your world model,

[01:32:20] : [01:32:21]

but you need to adjust itfor the situation at hand.

[01:32:21] : [01:32:25]

That's when you use RL.

[01:32:25] : [01:32:26]

- Why do you think RLHF works so well?

[01:32:26] : [01:32:29]

This enforcement learningwith human feedback,

[01:32:29] : [01:32:32]

why did it have such atransformational effect

[01:32:32] : [01:32:34]

on large language models that came before?

[01:32:34] : [01:32:38]

- So what's had thetransformational effect

[01:32:38] : [01:32:39]

is human feedback.

[01:32:39] : [01:32:42]

There is many ways to use it

[01:32:42] : [01:32:43]

and some of it is justpurely supervised, actually,

[01:32:43] : [01:32:45]

it's not really reinforcement learning.

[01:32:45] : [01:32:47]

- So it's the HF. (laughing)

[01:32:47] : [01:32:49]

- It's the HF.

[01:32:49] : [01:32:50]

And then there is various waysto use human feedback, right?

[01:32:50] : [01:32:53]

So you can ask humans to rate answers,

[01:32:53] : [01:32:56]

multiple answers that areproduced by a world model.

[01:32:56] : [01:33:00]

And then what you do is youtrain an objective function

[01:33:00] : [01:33:05]

to predict that rating.

[01:33:05] : [01:33:07]

And then you can usethat objective function

[01:33:07] : [01:33:11]

to predict whether an answer is good,

[01:33:11] : [01:33:13]

and you can back propagatereally through this

[01:33:13] : [01:33:15]

to fine tune your system

[01:33:15] : [01:33:16]

so that it only produceshighly rated answers.

[01:33:16] : [01:33:19]

Okay?

[01:33:19] : [01:33:22]

So that's one way.

[01:33:22] : [01:33:23]

So that's like in RL,

[01:33:23] : [01:33:26]

that means training what'scalled a reward model, right?

[01:33:26] : [01:33:29]

So something that,

[01:33:29] : [01:33:30]

basically your small neural net

[01:33:30] : [01:33:31]

that estimates to what extentan answer is good, right?

[01:33:31] : [01:33:35]

It's very similar to the objective

[01:33:35] : [01:33:36]

I was talking about earlier for planning,

[01:33:36] : [01:33:39]

except now it's not used for planning,

[01:33:39] : [01:33:41]

it's used for fine tuning your system.

[01:33:41] : [01:33:43]

I think it would be much more efficient

[01:33:43] : [01:33:45]

to use it for planning,

[01:33:45] : [01:33:46]

but currently it's used

[01:33:46] : [01:33:49]

to fine tune the parameters of the system.

[01:33:49] : [01:33:52]

Now, there's several ways to do this.

[01:33:52] : [01:33:54]

Some of them are supervised.

[01:33:54] : [01:33:57]

You just ask a human person,

[01:33:57] : [01:33:59]

like what is a goodanswer for this, right?

[01:33:59] : [01:34:02]

Then you just type the answer.

[01:34:02] : [01:34:04]

I mean, there's lots of ways

[01:34:04] : [01:34:07]

that those systems are being adjusted.

[01:34:07] : [01:34:09]

- Now, a lot of peoplehave been very critical

[01:34:09] : [01:34:13]

of the recently releasedGoogle's Gemini 1.5

[01:34:13] : [01:34:17]

for essentially, in my words,I could say super woke.

[01:34:17] : [01:34:23]

Woke in the negativeconnotation of that word.

[01:34:23] : [01:34:26]

There is some almost hilariouslyabsurd things that it does,

[01:34:26] : [01:34:30]

like it modifies history,

[01:34:30] : [01:34:32]

like generating images ofa black George Washington

[01:34:32] : [01:34:37]

or perhaps more seriously

[01:34:37] : [01:34:40]

something that you commented on Twitter,

[01:34:40] : [01:34:43]

which is refusing to commenton or generate images of,

[01:34:43] : [01:34:48]

or even descriptions ofTiananmen Square or the tank men,

[01:34:48] : [01:34:54]

one of the most sort of legendaryprotest images in history.

[01:34:54] : [01:35:00]

And of course, theseimages are highly censored

[01:35:00] : [01:35:04]

by the Chinese government.

[01:35:04] : [01:35:06]

And therefore everybodystarted asking questions

[01:35:06] : [01:35:09]

of what is the processof designing these LLMs?

[01:35:09] : [01:35:14]

What is the role of censorship in these,

[01:35:14] : [01:35:16]

and all that kind of stuff.

[01:35:16] : [01:35:19]

So you commented on Twitter

[01:35:19] : [01:35:22]

saying that open source is the answer.

[01:35:22] : [01:35:23]

(laughs)- Yeah.

[01:35:23] : [01:35:25]

- Essentially.

[01:35:25] : [01:35:26]

So can you explain?

[01:35:26] : [01:35:28]

- I actually made that comment

[01:35:28] : [01:35:31]

on just about every social network I can.

[01:35:31] : [01:35:32]

(Lex laughs)

[01:35:32] : [01:35:33]

And I've made that pointmultiple times in various forums.

[01:35:33] : [01:35:38]

Here's my point of view on this.

[01:35:38] : [01:35:43]

People can complain thatAI systems are biased,

[01:35:43] : [01:35:47]

and they generally are biased

[01:35:47] : [01:35:49]

by the distribution of the training data

[01:35:49] : [01:35:51]

that they've been trained on

[01:35:51] : [01:35:53]

that reflects biases in society.

[01:35:53] : [01:35:57]

And that is potentiallyoffensive to some people

[01:35:57] : [01:36:03]

or potentially not.

[01:36:03] : [01:36:06]

And some techniques to de-bias

[01:36:06] : [01:36:10]

then become offensive to some people

[01:36:10] : [01:36:13]

because of historicalincorrectness and things like that.

[01:36:13] : [01:36:20]

And so you can ask the question.

[01:36:20] : [01:36:25]

You can ask two questions.

[01:36:25] : [01:36:27]

The first question is,

[01:36:27] : [01:36:28]

is it possible to produce anAI system that is not biased?

[01:36:28] : [01:36:30]

And the answer is absolutely not.

[01:36:30] : [01:36:33]

And it's not because oftechnological challenges,

[01:36:33] : [01:36:36]

although there are technologicalchallenges to that.

[01:36:36] : [01:36:41]

It's because bias is inthe eye of the beholder.

[01:36:41] : [01:36:45]

Different people may have different ideas

[01:36:45] : [01:36:48]

about what constitutesbias for a lot of things.

[01:36:48] : [01:36:52]

I mean there are factsthat are indisputable,

[01:36:52] : [01:36:57]

but there are a lot of opinions or things

[01:36:57] : [01:36:59]

that can be expressed in different ways.

[01:36:59] : [01:37:01]

And so you cannot have an unbiased system,

[01:37:01] : [01:37:04]

that's just an impossibility.

[01:37:04] : [01:37:06]

And so what's the answer to this?

[01:37:06] : [01:37:12]

And the answer is thesame answer that we found

[01:37:12] : [01:37:16]

in liberal democracy about the press.

[01:37:16] : [01:37:20]

The press needs to be free and diverse.

[01:37:20] : [01:37:24]

We have free speech for a good reason.

[01:37:24] : [01:37:28]

It's because we don't wantall of our information

[01:37:28] : [01:37:31]

to come from a unique source,

[01:37:31] : [01:37:35]

'cause that's opposite tothe whole idea of democracy

[01:37:35] : [01:37:40]

and progressive ideasand even science, right?

[01:37:40] : [01:37:44]

In science, people have toargue for different opinions.

[01:37:44] : [01:37:48]

And science makes progresswhen people disagree

[01:37:48] : [01:37:51]

and they come up with an answer

[01:37:51] : [01:37:52]

and a consensus forms, right?

[01:37:52] : [01:37:54]

And it's true in alldemocracies around the world.

[01:37:54] : [01:37:57]

So there is a futurewhich is already happening

[01:37:57] : [01:38:02]

where every single one of our interaction

[01:38:02] : [01:38:05]

with the digital world

[01:38:05] : [01:38:07]

will be mediated by AI systems,

[01:38:07] : [01:38:10]

AI assistance, right?

[01:38:10] : [01:38:11]

We're gonna have smart glasses.

[01:38:11] : [01:38:14]

You can already buy themfrom Meta, (laughing)

[01:38:14] : [01:38:16]

the Ray-Ban Meta.

[01:38:16] : [01:38:18]

Where you can talk to them

[01:38:18] : [01:38:20]

and they are connected with an LLM

[01:38:20] : [01:38:21]

and you can get answerson any question you have.

[01:38:21] : [01:38:25]

Or you can be looking at a monument

[01:38:25] : [01:38:28]

and there is a camera inthe system, in the glasses,

[01:38:28] : [01:38:31]

you can ask it like what can you tell me

[01:38:31] : [01:38:34]

about this building or this monument?

[01:38:34] : [01:38:36]

You can be looking at amenu in a foreign language

[01:38:36] : [01:38:39]

and the thing we willtranslate it for you.

[01:38:39] : [01:38:40]

We can do real time translation

[01:38:40] : [01:38:43]

if we speak different languages.

[01:38:43] : [01:38:44]

So a lot of our interactionswith the digital world

[01:38:44] : [01:38:48]

are going to be mediated by those systems

[01:38:48] : [01:38:49]

in the near future.

[01:38:49] : [01:38:50]

Increasingly, the searchengines that we're gonna use

[01:38:50] : [01:38:56]

are not gonna be search engines,

[01:38:56] : [01:38:58]

they're gonna be dialogue systems

[01:38:58] : [01:39:01]

that we just ask a question,

[01:39:01] : [01:39:04]

and it will answer

[01:39:04] : [01:39:05]

and then point you

[01:39:05] : [01:39:05]

to the perhaps appropriatereference for it.

[01:39:05] : [01:39:09]

But here is the thing,

[01:39:09] : [01:39:10]

we cannot afford those systems

[01:39:10] : [01:39:11]

to come from a handful of companies

[01:39:11] : [01:39:13]

on the west coast of the US

[01:39:13] : [01:39:15]

because those systems will constitute

[01:39:15] : [01:39:18]

the repository of all human knowledge.

[01:39:18] : [01:39:21]

And we cannot have that be controlled

[01:39:21] : [01:39:25]

by a small number of people, right?

[01:39:25] : [01:39:27]

It has to be diverse

[01:39:27] : [01:39:29]

for the same reason thepress has to be diverse.

[01:39:29] : [01:39:32]

So how do we get a diverseset of AI assistance?

[01:39:32] : [01:39:35]

It's very expensive and difficult

[01:39:35] : [01:39:38]

to train a base model, right?

[01:39:38] : [01:39:40]

A base LLM at the moment.

[01:39:40] : [01:39:42]

In the future might besomething different,

[01:39:42] : [01:39:43]

but at the moment that's an LLM.

[01:39:43] : [01:39:46]

So only a few companiescan do this properly.

[01:39:46] : [01:39:49]

And if some of thosesubsystems are open source,

[01:39:49] : [01:39:55]

anybody can use them,

[01:39:55] : [01:39:57]

anybody can fine tune them.

[01:39:57] : [01:39:58]

If we put in place some systems

[01:39:58] : [01:40:01]

that allows any group of people,

[01:40:01] : [01:40:05]

whether they are individual citizens,

[01:40:05] : [01:40:10]

groups of citizens,

[01:40:10] : [01:40:11]

government organizations,

[01:40:11] : [01:40:13]

NGOs, companies, whatever,

[01:40:13] : [01:40:18]

to take those open sourcesystems, AI systems,

[01:40:18] : [01:40:23]

and fine tune them for theirown purpose on their own data,

[01:40:23] : [01:40:27]

there we're gonna havea very large diversity

[01:40:27] : [01:40:29]

of different AI systems

[01:40:29] : [01:40:31]

that are specialized forall of those things, right?

[01:40:31] : [01:40:34]

So I'll tell you,

[01:40:34] : [01:40:35]

I talked to the Frenchgovernment quite a bit

[01:40:35] : [01:40:38]

and the French government will not accept

[01:40:38] : [01:40:41]

that the digital dietof all their citizens

[01:40:41] : [01:40:44]

be controlled by three companies

[01:40:44] : [01:40:46]

on the west coast of the US.

[01:40:46] : [01:40:48]

That's just not acceptable.

[01:40:48] : [01:40:49]

It's a danger to democracy.

[01:40:49] : [01:40:51]

Regardless of how well intentioned

[01:40:51] : [01:40:52]

those companies are, right?

[01:40:52] : [01:40:54]

And it's also a danger to local culture,

[01:40:54] : [01:41:00]

to values, to language, right?

[01:41:00] : [01:41:05]

I was talking with thefounder of Infosys in India.

[01:41:05] : [01:41:10]

He's funding a projectto fine tune LLaMA 2,

[01:41:10] : [01:41:16]

the open source model produced by Meta.

[01:41:16] : [01:41:19]

So that LLaMA 2 speaks all 22official languages in India.

[01:41:19] : [01:41:23]

It's very important for people in India.

[01:41:23] : [01:41:26]

I was talking to aformer colleague of mine,

[01:41:26] : [01:41:28]

Moustapha Cisse,

[01:41:28] : [01:41:29]

who used to be a scientist at FAIR,

[01:41:29] : [01:41:31]

and then moved back to Africa

[01:41:31] : [01:41:32]

and created a researchlab for Google in Africa

[01:41:32] : [01:41:35]

and now has a new startup Kera.

[01:41:35] : [01:41:37]

And what he's trying todo is basically have LLM

[01:41:37] : [01:41:40]

that speaks the local languages in Senegal

[01:41:40] : [01:41:42]

so that people can haveaccess to medical information,

[01:41:42] : [01:41:46]

'cause they don't have access to doctors,

[01:41:46] : [01:41:47]

it's a very small number ofdoctors per capita in Senegal.

[01:41:47] : [01:41:52]

I mean, you can't have any of this

[01:41:52] : [01:41:55]

unless you have open source platforms.

[01:41:55] : [01:41:58]

So with open source platforms,

[01:41:58] : [01:41:59]

you can have AI systems

[01:41:59] : [01:42:00]

that are not only diverse interms of political opinions

[01:42:00] : [01:42:02]

or things of that type,

[01:42:02] : [01:42:05]

but in terms of language,culture, value systems,

[01:42:05] : [01:42:10]

political opinions, technicalabilities in various domains.

[01:42:10] : [01:42:16]

And you can have an industry,

[01:42:16] : [01:42:20]

an ecosystem of companies

[01:42:20] : [01:42:22]

that fine tune those open source systems

[01:42:22] : [01:42:24]

for vertical applicationsin industry, right?

[01:42:24] : [01:42:26]

You have, I don't know, apublisher has thousands of books

[01:42:26] : [01:42:30]

and they want to build a system

[01:42:30] : [01:42:31]

that allows a customerto just ask a question

[01:42:31] : [01:42:33]

about the content of any of their books.

[01:42:33] : [01:42:37]

You need to train on theirproprietary data, right?

[01:42:37] : [01:42:40]

You have a company,

[01:42:40] : [01:42:42]

we have one within Metait's called Meta Mate.

[01:42:42] : [01:42:44]

And it's basically an LLM

[01:42:44] : [01:42:46]

that can answer any question

[01:42:46] : [01:42:47]

about internal stuffabout about the company.

[01:42:47] : [01:42:52]

Very useful.

[01:42:52] : [01:42:53]

A lot of companies want this, right?

[01:42:53] : [01:42:54]

A lot of companies want thisnot just for their employees,

[01:42:54] : [01:42:57]

but also for their customers,

[01:42:57] : [01:42:59]

to take care of their customers.

[01:42:59] : [01:43:00]

So the only way you'regonna have an AI industry,

[01:43:00] : [01:43:04]

the only way you're gonna have AI systems

[01:43:04] : [01:43:06]

that are not uniquely biased,

[01:43:06] : [01:43:08]

is if you have open source platforms

[01:43:08] : [01:43:10]

on top of which any group canbuild specialized systems.

[01:43:10] : [01:43:15]

So the inevitable direction of history

[01:43:15] : [01:43:21]

is that the vast majority of AI systems

[01:43:21] : [01:43:25]

will be built on top ofopen source platforms.

[01:43:25] : [01:43:28]

- So that's a beautiful vision.

[01:43:28] : [01:43:30]

So meaning like a companylike Meta or Google or so on,

[01:43:30] : [01:43:35]

should take only minimal fine tuning steps

[01:43:35] : [01:43:40]

after the building, thefoundation, pre-trained model.

[01:43:40] : [01:43:44]

As few steps as possible.

[01:43:44] : [01:43:47]

- Basically.

[01:43:47] : [01:43:48]

(Lex sighs)

[01:43:48] : [01:43:49]

- Can Meta afford to do that?

[01:43:49] : [01:43:51]

- No.

[01:43:51] : [01:43:52]

- So I don't know if you know this,

[01:43:52] : [01:43:53]

but companies are supposedto make money somehow.

[01:43:53] : [01:43:56]

And open source is like giving away...

[01:43:56] : [01:44:00]

I don't know, Mark made a video,

[01:44:00] : [01:44:02]

Mark Zuckerberg.

[01:44:02] : [01:44:04]

A very sexy video talkingabout 350,000 Nvidia H100s.

[01:44:04] : [01:44:08]

The math of that is,

[01:44:08] : [01:44:14]

just for the GPUs,that's a hundred billion,

[01:44:14] : [01:44:17]

plus the infrastructurefor training everything.

[01:44:17] : [01:44:22]

So I'm no business guy,

[01:44:22] : [01:44:26]

but how do you make money on that?

[01:44:26] : [01:44:27]

So the vision you paintis a really powerful one,

[01:44:27] : [01:44:30]

but how is it possible to make money?

[01:44:30] : [01:44:32]

- Okay.

[01:44:32] : [01:44:33]

So you have severalbusiness models, right?

[01:44:33] : [01:44:36]

The business model thatMeta is built around

[01:44:36] : [01:44:39]

is you offer a service,

[01:44:39] : [01:44:44]

and the financing of that service

[01:44:44] : [01:44:48]

is either through ads orthrough business customers.

[01:44:48] : [01:44:52]

So for example, if you have an LLM

[01:44:52] : [01:44:54]

that can help a mom-and-pop pizza place

[01:44:54] : [01:44:58]

by talking to theircustomers through WhatsApp,

[01:44:58] : [01:45:03]

and so the customerscan just order a pizza

[01:45:03] : [01:45:05]

and the system will just ask them,

[01:45:05] : [01:45:08]

like what topping do you wantor what size, blah blah, blah.

[01:45:08] : [01:45:11]

The business will pay for that.

[01:45:11] : [01:45:14]

Okay?

[01:45:14] : [01:45:14]

That's a model.

[01:45:14] : [01:45:15]

And otherwise, if it's a system

[01:45:15] : [01:45:21]

that is on the more kindof classical services,

[01:45:21] : [01:45:24]

it can be ad supported orthere's several models.

[01:45:24] : [01:45:28]

But the point is,

[01:45:28] : [01:45:29]

if you have a big enoughpotential customer base

[01:45:29] : [01:45:34]

and you need to build thatsystem anyway for them,

[01:45:34] : [01:45:39]

it doesn't hurt you

[01:45:39] : [01:45:41]

to actually distribute it to open source.

[01:45:41] : [01:45:43]

- Again, I'm no business guy,

[01:45:43] : [01:45:45]

but if you release the open source model,

[01:45:45] : [01:45:48]

then other people cando the same kind of task

[01:45:48] : [01:45:51]

and compete on it.

[01:45:51] : [01:45:52]

Basically provide finetuned models for businesses,

[01:45:52] : [01:45:57]

is the bet that Meta is making...

[01:45:57] : [01:45:59]

By the way, I'm a huge fan of all this.

[01:45:59] : [01:46:01]

But is the bet that Meta is making

[01:46:01] : [01:46:03]

is like, "we'll do a better job of it?"

[01:46:03] : [01:46:05]

- Well, no.

[01:46:05] : [01:46:06]

The bet is more,

[01:46:06] : [01:46:08]

we already have a huge userbase and customer base.

[01:46:08] : [01:46:13]

- [Lex] Ah, right.- Right?

[01:46:13] : [01:46:13]

So it's gonna be useful to them.

[01:46:13] : [01:46:15]

Whatever we offer them is gonna be useful

[01:46:15] : [01:46:17]

and there is a way toderive revenue from this.

[01:46:17] : [01:46:21]

- [Lex] Sure.

[01:46:21] : [01:46:22]

- And it doesn't hurt

[01:46:22] : [01:46:23]

that we provide that systemor the base model, right?

[01:46:23] : [01:46:28]

The foundation model in open source

[01:46:28] : [01:46:32]

for others to buildapplications on top of it too.

[01:46:32] : [01:46:35]

If those applications

[01:46:35] : [01:46:36]

turn out to be useful for our customers,

[01:46:36] : [01:46:38]

we can just buy it for them.

[01:46:38] : [01:46:39]

It could be that theywill improve the platform.

[01:46:39] : [01:46:44]

In fact, we see this already.

[01:46:44] : [01:46:46]

I mean there is literallymillions of downloads of LLaMA 2

[01:46:46] : [01:46:50]

and thousands of peoplewho have provided ideas

[01:46:50] : [01:46:53]

about how to make it better.

[01:46:53] : [01:46:55]

So this clearly accelerates progress

[01:46:55] : [01:46:58]

to make the system available

[01:46:58] : [01:47:00]

to sort of a wide community of people.

[01:47:00] : [01:47:05]

And there is literallythousands of businesses

[01:47:05] : [01:47:07]

who are building applications with it.

[01:47:07] : [01:47:09]

Meta's ability to deriverevenue from this technology

[01:47:09] : [01:47:19]

is not impaired by the distribution

[01:47:19] : [01:47:24]

of base models in open source.

[01:47:24] : [01:47:26]

- The fundamental criticismthat Gemini is getting

[01:47:26] : [01:47:28]

is that, as you pointedout on the west coast...

[01:47:28] : [01:47:31]

Just to clarify,

[01:47:31] : [01:47:32]

we're currently in the east coast,

[01:47:32] : [01:47:34]

where I would suppose MetaAI headquarters would be.

[01:47:34] : [01:47:38]

(laughs)

[01:47:38] : [01:47:39]

So strong words about the west coast.

[01:47:39] : [01:47:42]

But I guess the issue that happens is,

[01:47:42] : [01:47:46]

I think it's fair to saythat most tech people

[01:47:46] : [01:47:50]

have a political affiliationwith the left wing.

[01:47:50] : [01:47:53]

They lean left.

[01:47:53] : [01:47:55]

And so the problem that peopleare criticizing Gemini with

[01:47:55] : [01:47:58]

is that in that de-biasingprocess that you mentioned,

[01:47:58] : [01:48:02]

that their ideologicallean becomes obvious.

[01:48:02] : [01:48:07]

Is this something that could be escaped?

[01:48:07] : [01:48:14]

You're saying open source is the only way?

[01:48:14] : [01:48:16]

- [Yann] Yeah.

[01:48:16] : [01:48:17]

- Have you witnessed thiskind of ideological lean

[01:48:17] : [01:48:19]

that makes engineering difficult?

[01:48:19] : [01:48:22]

- No, I don't think it has to do...

[01:48:22] : [01:48:24]

I don't think the issue has to do

[01:48:24] : [01:48:25]

with the political leaning

[01:48:25] : [01:48:26]

of the people designing those systems.

[01:48:26] : [01:48:29]

It has to do with theacceptability or political leanings

[01:48:29] : [01:48:34]

of their customer base or audience, right?

[01:48:34] : [01:48:38]

So a big company cannot affordto offend too many people.

[01:48:38] : [01:48:43]

So they're going to make sure

[01:48:43] : [01:48:46]

that whatever productthey put out is "safe,"

[01:48:46] : [01:48:49]

whatever that means.

[01:48:49] : [01:48:50]

And it's very possible to overdo it.

[01:48:50] : [01:48:55]

And it's also very possible to...

[01:48:55] : [01:48:58]

It's impossible to do itproperly for everyone.

[01:48:58] : [01:49:00]

You're not going to satisfy everyone.

[01:49:00] : [01:49:02]

So that's what I said before,

[01:49:02] : [01:49:03]

you cannot have a system that is unbiased

[01:49:03] : [01:49:05]

and is perceived as unbiased by everyone.

[01:49:05] : [01:49:07]

It's gonna be,

[01:49:07] : [01:49:09]

you push it in one way,

[01:49:09] : [01:49:11]

one set of people aregonna see it as biased.

[01:49:11] : [01:49:14]

And then you push it the other way

[01:49:14] : [01:49:15]

and another set of peopleis gonna see it as biased.

[01:49:15] : [01:49:18]

And then in addition to this,

[01:49:18] : [01:49:19]

there's the issue ofif you push the system

[01:49:19] : [01:49:22]

perhaps a little too far in one direction,

[01:49:22] : [01:49:24]

it's gonna be non-factual, right?

[01:49:24] : [01:49:25]

You're gonna have black Nazi soldiers in-

[01:49:25] : [01:49:30]

- Yeah.

[01:49:30] : [01:49:31]

So we should mention image generation

[01:49:31] : [01:49:34]

of black Nazi soldiers,

[01:49:34] : [01:49:36]

which is not factually accurate.

[01:49:36] : [01:49:38]

- Right.

[01:49:38] : [01:49:39]

And can be offensive forsome people as well, right?

[01:49:39] : [01:49:42]

So it's gonna be impossible

[01:49:42] : [01:49:46]

to kind of produce systemsthat are unbiased for everyone.

[01:49:46] : [01:49:49]

So the only solutionthat I see is diversity.

[01:49:49] : [01:49:53]

- And diversity in fullmeaning of that word,

[01:49:53] : [01:49:55]

diversity in every possible way.

[01:49:55] : [01:49:57]

- [Yann] Yeah.

[01:49:57] : [01:49:58]

- Marc Andreessen just tweeted today,

[01:49:58] : [01:50:02]

let me do a TL;DR.

[01:50:02] : [01:50:06]

The conclusion is onlystartups and open source

[01:50:06] : [01:50:08]

can avoid the issue that he'shighlighting with big tech.

[01:50:08] : [01:50:12]

He's asking,

[01:50:12] : [01:50:14]

can big tech actually fieldgenerative AI products?

[01:50:14] : [01:50:17]

One, ever escalating demandsfrom internal activists,

[01:50:17] : [01:50:20]

employee mobs, crazed executives,

[01:50:20] : [01:50:23]

broken boards, pressure groups,

[01:50:23] : [01:50:25]

extremist regulators,government agencies, the press,

[01:50:25] : [01:50:28]

in quotes "experts,"

[01:50:28] : [01:50:30]

and everything corrupting the output.

[01:50:30] : [01:50:34]

Two, constant risk ofgenerating a bad answer

[01:50:34] : [01:50:36]

or drawing a bad pictureor rendering a bad video.

[01:50:36] : [01:50:40]

Who knows what it's goingto say or do at any moment?

[01:50:40] : [01:50:44]

Three, legal exposure,product liability, slander,

[01:50:44] : [01:50:48]

election law, many other things and so on.

[01:50:48] : [01:50:51]

Anything that makes Congress mad.

[01:50:51] : [01:50:53]

Four, continuous attempts

[01:50:53] : [01:50:56]

to tighten grip on acceptable output,

[01:50:56] : [01:50:58]

degrade the model,

[01:50:58] : [01:50:59]

like how good it actually is

[01:50:59] : [01:51:01]

in terms of usable andpleasant to use and effective

[01:51:01] : [01:51:05]

and all that kind of stuff.

[01:51:05] : [01:51:06]

And five, publicity ofbad text, images, video,

[01:51:06] : [01:51:10]

actual puts those examplesinto the training data

[01:51:10] : [01:51:13]

for the next version.

[01:51:13] : [01:51:14]

And so on.

[01:51:14] : [01:51:15]

So he just highlightshow difficult this is.

[01:51:15] : [01:51:18]

From all kinds of people being unhappy.

[01:51:18] : [01:51:21]

He just said you can't create a system

[01:51:21] : [01:51:23]

that makes everybody happy.

[01:51:23] : [01:51:24]

- [Yann] Yes.

[01:51:24] : [01:51:25]

- So if you're going to dothe fine tuning yourself

[01:51:25] : [01:51:29]

and keep a close source,

[01:51:29] : [01:51:30]

essentially the problem there

[01:51:30] : [01:51:33]

is then trying to minimizethe number of people

[01:51:33] : [01:51:35]

who are going to be unhappy.

[01:51:35] : [01:51:36]

- [Yann] Yeah.

[01:51:36] : [01:51:38]

- And you're saying like the only...

[01:51:38] : [01:51:39]

That that's almostimpossible to do, right?

[01:51:39] : [01:51:42]

And the better way is to do open source.

[01:51:42] : [01:51:44]

- Basically, yeah.

[01:51:44] : [01:51:46]

I mean Marc is right about anumber of things that he lists

[01:51:46] : [01:51:51]

that indeed scare large companies.

[01:51:51] : [01:51:55]

Certainly, congressionalinvestigations is one of them.

[01:51:55] : [01:52:00]

Legal liability.

[01:52:00] : [01:52:01]

Making things

[01:52:01] : [01:52:05]

that get people to hurtthemselves or hurt others.

[01:52:05] : [01:52:09]

Like big companies are really careful

[01:52:09] : [01:52:12]

about not producing things of this type,

[01:52:12] : [01:52:15]

because they have...

[01:52:15] : [01:52:19]

They don't want to hurtanyone, first of all.

[01:52:19] : [01:52:21]

And then second, they wannapreserve their business.

[01:52:21] : [01:52:23]

So it's essentially impossiblefor systems like this

[01:52:23] : [01:52:26]

that can inevitablyformulate political opinions

[01:52:26] : [01:52:30]

and opinions about various things

[01:52:30] : [01:52:32]

that may be political or not,

[01:52:32] : [01:52:34]

but that people may disagree about.

[01:52:34] : [01:52:36]

About, you know, moral issues

[01:52:36] : [01:52:37]

and things about likequestions about religion

[01:52:37] : [01:52:42]

and things like that, right?

[01:52:42] : [01:52:44]

Or cultural issues

[01:52:44] : [01:52:46]

that people from different communities

[01:52:46] : [01:52:48]

would disagree with in the first place.

[01:52:48] : [01:52:50]

So there's only kind of arelatively small number of things

[01:52:50] : [01:52:52]

that people will sort of agree on,

[01:52:52] : [01:52:55]

basic principles.

[01:52:55] : [01:52:57]

But beyond that,

[01:52:57] : [01:52:58]

if you want those systems to be useful,

[01:52:58] : [01:53:01]

they will necessarily haveto offend a number of people,

[01:53:01] : [01:53:06]

inevitably.

[01:53:06] : [01:53:08]

- And so open source is just better-

[01:53:08] : [01:53:11]

- [Yann] Diversity is better, right?

[01:53:11] : [01:53:12]

- And open source enables diversity.

[01:53:12] : [01:53:15]

- That's right.

[01:53:15] : [01:53:16]

Open source enables diversity.

[01:53:16] : [01:53:17]

- This can be a fascinating world

[01:53:17] : [01:53:19]

where if it's true thatthe open source world,

[01:53:19] : [01:53:22]

if Meta leads the way

[01:53:22] : [01:53:24]

and creates this kind of opensource foundation model world,

[01:53:24] : [01:53:27]

there's going to be,

[01:53:27] : [01:53:28]

like governments will have afine tuned model. (laughing)

[01:53:28] : [01:53:31]

- [Yann] Yeah.

[01:53:31] : [01:53:33]

- And then potentially,

[01:53:33] : [01:53:34]

people that vote left and right

[01:53:34] : [01:53:39]

will have their own model and preference

[01:53:39] : [01:53:40]

to be able to choose.

[01:53:40] : [01:53:42]

And it will potentiallydivide us even more

[01:53:42] : [01:53:44]

but that's on us humans.

[01:53:44] : [01:53:46]

We get to figure out...

[01:53:46] : [01:53:48]

Basically the technology enables humans

[01:53:48] : [01:53:50]

to human more effectively.

[01:53:50] : [01:53:53]

And all the difficult ethicalquestions that humans raise

[01:53:53] : [01:53:57]

we'll just leave it upto us to figure that out.

[01:53:57] : [01:54:02]

- Yeah, I mean there aresome limits to what...

[01:54:02] : [01:54:04]

The same way there arelimits to free speech,

[01:54:04] : [01:54:06]

there has to be somelimit to the kind of stuff

[01:54:06] : [01:54:08]

that those systems mightbe authorized to produce,

[01:54:08] : [01:54:13]

some guardrails.

[01:54:13] : [01:54:16]

So I mean, that's one thingI've been interested in,

[01:54:16] : [01:54:18]

which is in the type of architecture

[01:54:18] : [01:54:20]

that we were discussing before,

[01:54:20] : [01:54:22]

where the output of the system

[01:54:22] : [01:54:26]

is a result of an inferenceto satisfy an objective.

[01:54:26] : [01:54:29]

That objective can include guardrails.

[01:54:29] : [01:54:32]

And we can put guardrailsin open source systems.

[01:54:32] : [01:54:37]

I mean, if we eventually have systems

[01:54:37] : [01:54:39]

that are built with this blueprint,

[01:54:39] : [01:54:41]

we can put guardrails in those systems

[01:54:41] : [01:54:43]

that guarantee

[01:54:43] : [01:54:44]

that there is sort of aminimum set of guardrails

[01:54:44] : [01:54:47]

that make the system non-dangerousand non-toxic, et cetera.

[01:54:47] : [01:54:50]

Basic things thateverybody would agree on.

[01:54:50] : [01:54:53]

And then the fine tuningthat people will add

[01:54:53] : [01:54:57]

or the additional guardrailsthat people will add

[01:54:57] : [01:54:59]

will kind of cater to theircommunity, whatever it is.

[01:54:59] : [01:55:04]

- And yeah, the fine tuning

[01:55:04] : [01:55:06]

would be more about the grayareas of what is hate speech,

[01:55:06] : [01:55:09]

what is dangerous andall that kind of stuff.

[01:55:09] : [01:55:11]

I mean, you've-

[01:55:11] : [01:55:12]

- [Yann] Or different value systems.

[01:55:12] : [01:55:13]

- Different value systems.

[01:55:13] : [01:55:14]

But still even with the objectives

[01:55:14] : [01:55:16]

of how to build a bio weapon, for example,

[01:55:16] : [01:55:18]

I think something you've commented on,

[01:55:18] : [01:55:20]

or at least there's a paper

[01:55:20] : [01:55:23]

where a collection of researchers

[01:55:23] : [01:55:24]

is trying to understand thesocial impacts of these LLMs.

[01:55:24] : [01:55:28]

And I guess one threshold that's nice

[01:55:28] : [01:55:31]

is like does the LLM make itany easier than a search would,

[01:55:31] : [01:55:36]

like a Google search would?

[01:55:36] : [01:55:39]

- Right.

[01:55:39] : [01:55:40]

So the increasing numberof studies on this

[01:55:40] : [01:55:44]

seems to point to thefact that it doesn't help.

[01:55:44] : [01:55:49]

So having an LLM doesn't help you

[01:55:49] : [01:55:52]

design or build a bioweapon or a chemical weapon

[01:55:52] : [01:55:57]

if you already have access toa search engine and a library.

[01:55:57] : [01:56:01]

And so the sort of increasedinformation you get

[01:56:01] : [01:56:04]

or the ease with which you getit doesn't really help you.

[01:56:04] : [01:56:07]

That's the first thing.

[01:56:07] : [01:56:08]

The second thing is,

[01:56:08] : [01:56:10]

it's one thing to havea list of instructions

[01:56:10] : [01:56:12]

of how to make a chemical weapon,for example, a bio weapon.

[01:56:12] : [01:56:17]

It's another thing to actually build it.

[01:56:17] : [01:56:19]

And it's much harder than you might think,

[01:56:19] : [01:56:21]

and then LLM will not help you with that.

[01:56:21] : [01:56:23]

In fact, nobody in the world,

[01:56:23] : [01:56:27]

not even like countries use bio weapons

[01:56:27] : [01:56:29]

because most of the time they have no idea

[01:56:29] : [01:56:31]

how to protect their ownpopulations against it.

[01:56:31] : [01:56:34]

So it's too dangerousactually to kind of ever use.

[01:56:34] : [01:56:39]

And it's in fact bannedby international treaties.

[01:56:39] : [01:56:43]

Chemical weapons is different.

[01:56:43] : [01:56:45]

It's also banned by treaties,

[01:56:45] : [01:56:47]

but it's the same problem.

[01:56:47] : [01:56:50]

It's difficult to use

[01:56:50] : [01:56:51]

in situations that doesn'tturn against the perpetrators.

[01:56:51] : [01:56:56]

But we could ask Elon Musk.

[01:56:56] : [01:56:57]

Like I can give you a veryprecise list of instructions

[01:56:57] : [01:57:01]

of how you build a rocket engine.

[01:57:01] : [01:57:03]

And even if you havea team of 50 engineers

[01:57:03] : [01:57:06]

that are really experienced building it,

[01:57:06] : [01:57:08]

you're still gonna haveto blow up a dozen of them

[01:57:08] : [01:57:10]

before you get one that works.

[01:57:10] : [01:57:11]

And it's the same withchemical weapons or bio weapons

[01:57:11] : [01:57:18]

or things like this.

[01:57:18] : [01:57:19]

It requires expertise in the real world

[01:57:19] : [01:57:23]

that the LLM is not gonna help you with.

[01:57:23] : [01:57:25]

- And it requires eventhe common sense expertise

[01:57:25] : [01:57:28]

that we've been talking about,

[01:57:28] : [01:57:29]

which is how to takelanguage based instructions

[01:57:29] : [01:57:34]

and materialize them in the physical world

[01:57:34] : [01:57:36]

requires a lot of knowledgethat's not in the instructions.

[01:57:36] : [01:57:41]

- Yeah, exactly.

[01:57:41] : [01:57:42]

A lot of biologists haveposted on this actually

[01:57:42] : [01:57:44]

in response to those things

[01:57:44] : [01:57:45]

saying like do you realize how hard it is

[01:57:45] : [01:57:47]

to actually do the lab work?

[01:57:47] : [01:57:49]

Like this is not trivial.

[01:57:49] : [01:57:50]

- Yeah.

[01:57:50] : [01:57:52]

And that's Hans Moraveccomes to light once again.

[01:57:52] : [01:57:56]

Just to linger on LLaMA.

[01:57:56] : [01:57:59]

Mark announced that LLaMA3 is coming out eventually,

[01:57:59] : [01:58:01]

I don't think there's a release date,

[01:58:01] : [01:58:03]

but what are you most excited about?

[01:58:03] : [01:58:06]

First of all, LLaMA 2that's already out there,

[01:58:06] : [01:58:09]

and maybe the future LLaMA 3, 4, 5, 6, 10,

[01:58:09] : [01:58:12]

just the future of theopen source under Meta?

[01:58:12] : [01:58:15]

- Well, a number of things.

[01:58:15] : [01:58:18]

So there's gonna be likevarious versions of LLaMA

[01:58:18] : [01:58:22]

that are improvements of previous LLaMAs.

[01:58:22] : [01:58:26]

Bigger, better, multimodal,things like that.

[01:58:26] : [01:58:30]

And then in future generations,

[01:58:30] : [01:58:32]

systems that are capable of planning,

[01:58:32] : [01:58:34]

that really understandhow the world works,

[01:58:34] : [01:58:36]

maybe are trained from videoso they have some world model.

[01:58:36] : [01:58:39]

Maybe capable of the typeof reasoning and planning

[01:58:39] : [01:58:42]

I was talking about earlier.

[01:58:42] : [01:58:44]

Like how long is that gonna take?

[01:58:44] : [01:58:45]

Like when is the research thatis going in that direction

[01:58:45] : [01:58:48]

going to sort of feed intothe product line, if you want,

[01:58:48] : [01:58:52]

of LLaMA?

[01:58:52] : [01:58:53]

I don't know, I can't tell you.

[01:58:53] : [01:58:54]

And there's a few breakthroughs

[01:58:54] : [01:58:56]

that we have to basically go through

[01:58:56] : [01:58:59]

before we can get there.

[01:58:59] : [01:59:01]

But you'll be able to monitor our progress

[01:59:01] : [01:59:03]

because we publish our research, right?

[01:59:03] : [01:59:07]

So last week we published the V-JEPA work,

[01:59:07] : [01:59:11]

which is sort of a first step

[01:59:11] : [01:59:13]

towards training systems from video.

[01:59:13] : [01:59:15]

And then the next stepis gonna be world models

[01:59:15] : [01:59:18]

based on kind of this type of idea,

[01:59:18] : [01:59:21]

training from video.

[01:59:21] : [01:59:23]

There's similar work atDeepMind also taking place,

[01:59:23] : [01:59:28]

and also at UC Berkeleyon world models and video.

[01:59:28] : [01:59:33]

A lot of people are working on this.

[01:59:33] : [01:59:35]

I think a lot of good ideas are appearing.

[01:59:35] : [01:59:38]

My bet is that those systemsare gonna be JEPA-like,

[01:59:38] : [01:59:41]

they're not gonna be generative models.

[01:59:41] : [01:59:43]

And we'll see what the future will tell.

[01:59:43] : [01:59:49]

There's really good work at...

[01:59:49] : [01:59:52]

A gentleman called DanijarHafner who is now DeepMind,

[01:59:52] : [01:59:56]

who's worked on kindof models of this type

[01:59:56] : [01:59:58]

that learn representations

[01:59:58] : [02:00:00]

and then use them forplanning or learning tasks

[02:00:00] : [02:00:02]

by reinforcement training.

[02:00:02] : [02:00:04]

And a lot of work at Berkeley

[02:00:04] : [02:00:07]

by Pieter Abbeel, Sergey Levine,

[02:00:07] : [02:00:11]

a bunch of other people of that type.

[02:00:11] : [02:00:13]

I'm collaborating with actually

[02:00:13] : [02:00:14]

in the context of somegrants with my NYU hat.

[02:00:14] : [02:00:18]

And then collaborations also through Meta,

[02:00:18] : [02:00:22]

'cause the lab at Berkeley

[02:00:22] : [02:00:24]

is associated with Metain some way, with FAIR.

[02:00:24] : [02:00:28]

So I think it's very exciting.

[02:00:28] : [02:00:29]

I think I'm super excited about...

[02:00:29] : [02:00:34]

I haven't been that excited

[02:00:34] : [02:00:35]

about like the directionof machine learning and AI

[02:00:35] : [02:00:38]

since 10 years ago when FAIR was started,

[02:00:38] : [02:00:41]

and before that, 30 years ago,

[02:00:41] : [02:00:44]

when we were working on,

[02:00:44] : [02:00:45]

sorry 35,

[02:00:45] : [02:00:46]

on combination nets and theearly days of neural net.

[02:00:46] : [02:00:51]

So I'm super excited

[02:00:51] : [02:00:54]

because I see a path towards

[02:00:54] : [02:00:57]

potentially human level intelligence

[02:00:57] : [02:00:59]

with systems that canunderstand the world,

[02:00:59] : [02:01:04]

remember, plan, reason.

[02:01:04] : [02:01:06]

There is some set of ideasto make progress there

[02:01:06] : [02:01:09]

that might have a chance of working.

[02:01:09] : [02:01:12]

And I'm really excited about this.

[02:01:12] : [02:01:14]

What I like is that

[02:01:14] : [02:01:15]

somewhat we get onto like a good direction

[02:01:15] : [02:01:20]

and perhaps succeed before mybrain turns to a white sauce

[02:01:20] : [02:01:24]

or before I need to retire.

[02:01:24] : [02:01:26]

(laughs)

[02:01:26] : [02:01:28]

- Yeah.

[02:01:28] : [02:01:29]

Yeah.

[02:01:29] : [02:01:30]

Are you also excited by...

[02:01:30] : [02:01:32]

Is it beautiful to you justthe amount of GPUs involved,

[02:01:32] : [02:01:38]

sort of the whole trainingprocess on this much compute?

[02:01:38] : [02:01:42]

Just zooming out,

[02:01:42] : [02:01:43]

just looking at earth and humans together

[02:01:43] : [02:01:47]

have built these computing devices

[02:01:47] : [02:01:49]

and are able to train this one brain,

[02:01:49] : [02:01:52]

we then open source.

[02:01:52] : [02:01:56]

(laughs)

[02:01:56] : [02:01:57]

Like giving birth tothis open source brain

[02:01:57] : [02:02:01]

trained on this gigantic compute system.

[02:02:01] : [02:02:04]

There's just the detailsof how to train on that,

[02:02:04] : [02:02:07]

how to build the infrastructureand the hardware,

[02:02:07] : [02:02:10]

the cooling, all of this kind of stuff.

[02:02:10] : [02:02:12]

Are you just still themost of your excitement

[02:02:12] : [02:02:14]

is in the theory aspect of it?

[02:02:14] : [02:02:16]

Meaning like the software.

[02:02:16] : [02:02:19]

- Well, I used to be ahardware guy many years ago.

[02:02:19] : [02:02:21]

(laughs)- Yes, yes, that's right.

[02:02:21] : [02:02:22]

- Decades ago.

[02:02:22] : [02:02:23]

- Hardware has improved a little bit.

[02:02:23] : [02:02:25]

Changed a little bit, yeah.

[02:02:25] : [02:02:27]

- I mean, certainly scale isnecessary but not sufficient.

[02:02:27] : [02:02:32]

- [Lex] Absolutely.

[02:02:32] : [02:02:33]

- So we certainly need computation.

[02:02:33] : [02:02:34]

I mean, we're still farin terms of compute power

[02:02:34] : [02:02:37]

from what we would need

[02:02:37] : [02:02:39]

to match the computepower of the human brain.

[02:02:39] : [02:02:42]

This may occur in the next couple decades,

[02:02:42] : [02:02:45]

but we're still some ways away.

[02:02:45] : [02:02:47]

And certainly in termsof power efficiency,

[02:02:47] : [02:02:49]

we're really far.

[02:02:49] : [02:02:50]

So a lot of progress to make in hardware.

[02:02:50] : [02:02:56]

And right now a lot ofthe progress is not...

[02:02:56] : [02:03:00]

I mean, there's a bit comingfrom Silicon technology,

[02:03:00] : [02:03:03]

but a lot of it coming fromarchitectural innovation

[02:03:03] : [02:03:06]

and quite a bit coming fromlike more efficient ways

[02:03:06] : [02:03:10]

of implementing the architecturesthat have become popular.

[02:03:10] : [02:03:13]

Basically combination oftransformers and com net, right?

[02:03:13] : [02:03:17]

And so there's still some ways to go

[02:03:17] : [02:03:22]

until we are going to saturate.

[02:03:22] : [02:03:27]

We're gonna have to come up

[02:03:27] : [02:03:28]

with like new principles,new fabrication technology,

[02:03:28] : [02:03:31]

new basic components,

[02:03:31] : [02:03:34]

perhaps based on sortof different principles

[02:03:34] : [02:03:38]

than those classical digital CMOS.

[02:03:38] : [02:03:41]

- Interesting.

[02:03:41] : [02:03:42]

So you think in order to build AmI, ami,

[02:03:42] : [02:03:46]

we potentially might needsome hardware innovation too?

[02:03:46] : [02:03:52]

- Well, if we wanna make it ubiquitous,

[02:03:52] : [02:03:55]

yeah, certainly.

[02:03:55] : [02:03:56]

Because we're gonna have toreduce the power consumption.

[02:03:56] : [02:04:01]

A GPU today, right?

[02:04:01] : [02:04:03]

Is half a kilowatt to a kilowatt.

[02:04:03] : [02:04:05]

Human brain is about 25 watts.

[02:04:05] : [02:04:08]

And the GPU is way belowthe power of human brain.

[02:04:08] : [02:04:13]

You need something like a hundred thousand

[02:04:13] : [02:04:14]

or a million to match it.

[02:04:14] : [02:04:16]

So we are off by a huge factor.

[02:04:16] : [02:04:19]

- You often say thatAGI is not coming soon.

[02:04:19] : [02:04:26]

Meaning like not this year,not the next few years,

[02:04:26] : [02:04:30]

potentially farther away.

[02:04:30] : [02:04:32]

What's your basic intuition behind that?

[02:04:32] : [02:04:35]

- So first of all, it'snot to be an event, right?

[02:04:35] : [02:04:39]

The idea somehow

[02:04:39] : [02:04:40]

which is popularized byscience fiction in Hollywood

[02:04:40] : [02:04:42]

that somehow somebody isgonna discover the secret,

[02:04:42] : [02:04:47]

the secret to AGI orhuman level AI or AmI,

[02:04:47] : [02:04:50]

whatever you wanna call it,

[02:04:50] : [02:04:52]

and then turn on a machineand then we have AGI.

[02:04:52] : [02:04:55]

That's just not going to happen.

[02:04:55] : [02:04:57]

It's not going to be an event.

[02:04:57] : [02:04:58]

It's gonna be gradual progress.

[02:04:58] : [02:05:02]

Are we gonna have systems

[02:05:02] : [02:05:04]

that can learn fromvideo how the world works

[02:05:04] : [02:05:07]

and learn good representations?

[02:05:07] : [02:05:09]

Yeah.

[02:05:09] : [02:05:10]

Before we get them tothe scale and performance

[02:05:10] : [02:05:13]

that we observe in humans,

[02:05:13] : [02:05:14]

it's gonna take quite a while.

[02:05:14] : [02:05:15]

It's not gonna happen in one day.

[02:05:15] : [02:05:17]

Are we gonna get systems

[02:05:17] : [02:05:20]

that can have large amountof associated memories

[02:05:20] : [02:05:24]

so they can remember stuff?

[02:05:24] : [02:05:26]

Yeah.

[02:05:26] : [02:05:27]

But same, it's not gonna happen tomorrow.

[02:05:27] : [02:05:28]

I mean, there is some basic techniques

[02:05:28] : [02:05:30]

that need to be developed.

[02:05:30] : [02:05:31]

We have a lot of them,

[02:05:31] : [02:05:32]

but like to get this to worktogether with a full system

[02:05:32] : [02:05:36]

is another story.

[02:05:36] : [02:05:37]

Are we gonna have systemsthat can reason and plan,

[02:05:37] : [02:05:39]

perhaps along the lines ofobjective driven AI architectures

[02:05:39] : [02:05:43]

that I described before?

[02:05:43] : [02:05:45]

Yeah, but like before weget this to work properly,

[02:05:45] : [02:05:47]

it's gonna take a while.

[02:05:47] : [02:05:48]

And before we get all thosethings to work together.

[02:05:48] : [02:05:51]

And then on top of this,

[02:05:51] : [02:05:52]

have systems that can learnlike hierarchical planning,

[02:05:52] : [02:05:55]

hierarchical representations,

[02:05:55] : [02:05:56]

systems that can be configured

[02:05:56] : [02:05:58]

for a lot of different situation at hands

[02:05:58] : [02:06:00]

the way the human brain can.

[02:06:00] : [02:06:02]

All of this is gonnatake at least a decade,

[02:06:02] : [02:06:07]

probably much more,

[02:06:07] : [02:06:08]

because there are a lot of problems

[02:06:08] : [02:06:11]

that we're not seeing right now

[02:06:11] : [02:06:12]

that we have not encountered.

[02:06:12] : [02:06:15]

And so we don't know ifthere is an easy solution

[02:06:15] : [02:06:17]

within this framework.

[02:06:17] : [02:06:18]

It's not just around the corner.

[02:06:18] : [02:06:23]

I mean, I've been hearingpeople for the last 12, 15 years

[02:06:23] : [02:06:27]

claiming that AGI isjust around the corner

[02:06:27] : [02:06:29]

and being systematically wrong.

[02:06:29] : [02:06:32]

And I knew they were wrongwhen they were saying it.

[02:06:32] : [02:06:34]

I called it bullshit.

[02:06:34] : [02:06:35]

(laughs)

[02:06:35] : [02:06:36]

- Why do you think peoplehave been calling...

[02:06:36] : [02:06:38]

First of all, I mean,from the beginning of,

[02:06:38] : [02:06:39]

from the birth of the termartificial intelligence,

[02:06:39] : [02:06:41]

there has been an eternal optimism

[02:06:41] : [02:06:45]

that's perhaps unlike other technologies.

[02:06:45] : [02:06:49]

Is it Moravec's paradox?

[02:06:49] : [02:06:51]

Is it the explanation

[02:06:51] : [02:06:53]

for why people are sooptimistic about AGI?

[02:06:53] : [02:06:56]

- I don't think it'sjust Moravec's paradox.

[02:06:56] : [02:06:58]

Moravec's paradox is a consequence

[02:06:58] : [02:07:00]

of realizing that the worldis not as easy as we think.

[02:07:00] : [02:07:03]

So first of all, intelligenceis not a linear thing

[02:07:03] : [02:07:08]

that you can measure with a scaler,

[02:07:08] : [02:07:10]

with a single number.

[02:07:10] : [02:07:11]

Can you say that humans aresmarter than orangutans?

[02:07:11] : [02:07:17]

In some ways, yes,

[02:07:17] : [02:07:20]

but in some ways orangutansare smarter than humans

[02:07:20] : [02:07:22]

in a lot of domains

[02:07:22] : [02:07:23]

that allows them to survivein the forest, (laughing)

[02:07:23] : [02:07:26]

for example.

[02:07:26] : [02:07:26]

- So IQ is a very limitedmeasure of intelligence.

[02:07:26] : [02:07:30]

True intelligence

[02:07:30] : [02:07:31]

is bigger than what IQ,for example, measures.

[02:07:31] : [02:07:33]

- Well, IQ can measureapproximately something for humans,

[02:07:33] : [02:07:38]

but because humans kind of come

[02:07:38] : [02:07:43]

in relatively kind of uniform form, right?

[02:07:43] : [02:07:48]

- [Lex] Yeah.

[02:07:48] : [02:07:49]

- But it only measures one type of ability

[02:07:49] : [02:07:53]

that may be relevant forsome tasks, but not others.

[02:07:53] : [02:07:56]

But then if you are talkingabout other intelligent entities

[02:07:56] : [02:08:02]

for which the basic thingsthat are easy to them

[02:08:02] : [02:08:07]

is very different,

[02:08:07] : [02:08:08]

then it doesn't mean anything.

[02:08:08] : [02:08:11]

So intelligence is a collection of skills

[02:08:11] : [02:08:15]

and an ability to acquirenew skills efficiently.

[02:08:15] : [02:08:21]

Right?

[02:08:21] : [02:08:23]

And the collection of skills

[02:08:23] : [02:08:25]

that a particularintelligent entity possess

[02:08:25] : [02:08:29]

or is capable of learning quickly

[02:08:29] : [02:08:31]

is different from the collectionof skills of another one.

[02:08:31] : [02:08:35]

And because it's a multidimensional thing,

[02:08:35] : [02:08:37]

the set of skills is ahigh dimensional space,

[02:08:37] : [02:08:39]

you can't measure.

[02:08:39] : [02:08:40]

You cannot compare two things

[02:08:40] : [02:08:42]

as to whether one is moreintelligent than the other.

[02:08:42] : [02:08:45]

It's multidimensional.

[02:08:45] : [02:08:46]

- So you push back against whatare called AI doomers a lot.

[02:08:46] : [02:08:53]

Can you explain their perspective

[02:08:53] : [02:08:57]

and why you think they're wrong?

[02:08:57] : [02:08:59]

- Okay.

[02:08:59] : [02:09:00]

So AI doomers imagine allkinds of catastrophe scenarios

[02:09:00] : [02:09:03]

of how AI could escape our control

[02:09:03] : [02:09:07]

and basically kill us all. (laughs)

[02:09:07] : [02:09:10]

And that relies on awhole bunch of assumptions

[02:09:10] : [02:09:14]

that are mostly false.

[02:09:14] : [02:09:15]

So the first assumption

[02:09:15] : [02:09:18]

is that the emergenceof super intelligence

[02:09:18] : [02:09:20]

could be an event.

[02:09:20] : [02:09:21]

That at some point we'regoing to figure out the secret

[02:09:21] : [02:09:25]

and we'll turn on a machinethat is super intelligent.

[02:09:25] : [02:09:28]

And because we'd never done it before,

[02:09:28] : [02:09:30]

it's gonna take over theworld and kill us all.

[02:09:30] : [02:09:33]

That is false.

[02:09:33] : [02:09:33]

It's not gonna be an event.

[02:09:33] : [02:09:35]

We're gonna have systems thatare like as smart as a cat,

[02:09:35] : [02:09:39]

have all the characteristicsof human level intelligence,

[02:09:39] : [02:09:44]

but their level of intelligence

[02:09:44] : [02:09:46]

would be like a cat or aparrot maybe or something.

[02:09:46] : [02:09:49]

And then we're gonna walk our way up

[02:09:49] : [02:09:53]

to kind of make thosethings more intelligent.

[02:09:53] : [02:09:55]

And as we make them more intelligent,

[02:09:55] : [02:09:56]

we're also gonna putsome guardrails in them

[02:09:56] : [02:09:58]

and learn how to kindof put some guardrails

[02:09:58] : [02:10:00]

so they behave properly.

[02:10:00] : [02:10:01]

And we're not gonna dothis with just one...

[02:10:01] : [02:10:03]

It's not gonna be one effort,

[02:10:03] : [02:10:04]

but it's gonna be lots ofdifferent people doing this.

[02:10:04] : [02:10:07]

And some of them are gonna succeed

[02:10:07] : [02:10:09]

at making intelligent systemsthat are controllable and safe

[02:10:09] : [02:10:12]

and have the right guardrails.

[02:10:12] : [02:10:14]

And if some other goes rogue,

[02:10:14] : [02:10:15]

then we can use the good onesto go against the rogue ones.

[02:10:15] : [02:10:19]

(laughs)

[02:10:19] : [02:10:20]

So it's gonna be smart AIpolice against your rogue AI.

[02:10:20] : [02:10:24]

So it's not gonna be likewe're gonna be exposed

[02:10:24] : [02:10:27]

to like a single rogue AIthat's gonna kill us all.

[02:10:27] : [02:10:29]

That's just not happening.

[02:10:29] : [02:10:31]

Now, there is another fallacy,

[02:10:31] : [02:10:33]

which is the fact that becausethe system is intelligent,

[02:10:33] : [02:10:36]

it necessarily wants to take over.

[02:10:36] : [02:10:38]

And there is several arguments

[02:10:38] : [02:10:43]

that make people scared of this,

[02:10:43] : [02:10:44]

which I think arecompletely false as well.

[02:10:44] : [02:10:48]

So one of them is in nature,

[02:10:48] : [02:10:53]

it seems to be that themore intelligent species

[02:10:53] : [02:10:54]

are the ones that endup dominating the other.

[02:10:54] : [02:10:58]

And even extinguishing the others

[02:10:58] : [02:11:03]

sometimes by design,sometimes just by mistake.

[02:11:03] : [02:11:06]

And so there is sort of a thinking

[02:11:06] : [02:11:12]

by which you say, well, if AI systems

[02:11:12] : [02:11:15]

are more intelligent than us,

[02:11:15] : [02:11:17]

surely they're going to eliminate us,

[02:11:17] : [02:11:19]

if not by design,

[02:11:19] : [02:11:21]

simply because they don't care about us.

[02:11:21] : [02:11:23]

And that's just preposterousfor a number of reasons.

[02:11:23] : [02:11:27]

First reason is they'renot going to be a species.

[02:11:27] : [02:11:30]

They're not gonna be aspecies that competes with us.

[02:11:30] : [02:11:33]

They're not gonna havethe desire to dominate

[02:11:33] : [02:11:35]

because the desire to dominate

[02:11:35] : [02:11:36]

is something that has to be hardwired

[02:11:36] : [02:11:38]

into an intelligent system.

[02:11:38] : [02:11:41]

It is hardwired in humans,

[02:11:41] : [02:11:43]

it is hardwired in baboons,

[02:11:43] : [02:11:46]

in chimpanzees, in wolves,

[02:11:46] : [02:11:47]

not in orangutans.

[02:11:47] : [02:11:49]

The species in which thisdesire to dominate or submit

[02:11:49] : [02:11:56]

or attain status in other ways

[02:11:56] : [02:11:59]

is specific to social species.

[02:11:59] : [02:12:03]

Non-social species likeorangutans don't have it.

[02:12:03] : [02:12:06]

Right?

[02:12:06] : [02:12:07]

And they are as smart as we are, almost.

[02:12:07] : [02:12:09]

Right?

[02:12:09] : [02:12:10]

- And to you, there'snot significant incentive

[02:12:10] : [02:12:12]

for humans to encodethat into the AI systems.

[02:12:12] : [02:12:15]

And to the degree they do,

[02:12:15] : [02:12:17]

there'll be other AIs thatsort of punish them for it.

[02:12:17] : [02:12:22]

Out-compete them over-

[02:12:22] : [02:12:22]

- Well, there's all kinds of incentive

[02:12:22] : [02:12:24]

to make AI systems submissive to humans.

[02:12:24] : [02:12:26]

Right?- [Lex] Right.

[02:12:26] : [02:12:27]

- I mean, this is the waywe're gonna build them, right?

[02:12:27] : [02:12:29]

And so then people say,oh, but look at LLMs.

[02:12:29] : [02:12:32]

LLMs are not controllable.

[02:12:32] : [02:12:33]

And they're right,

[02:12:33] : [02:12:35]

LLMs are not controllable.

[02:12:35] : [02:12:36]

But objective driven AI,

[02:12:36] : [02:12:37]

so systems that derive their answers

[02:12:37] : [02:12:41]

by optimization of an objective

[02:12:41] : [02:12:43]

means they have tooptimize this objective,

[02:12:43] : [02:12:45]

and that objective can include guardrails.

[02:12:45] : [02:12:48]

One guardrail is obey humans.

[02:12:48] : [02:12:52]

Another guardrail is don't obey humans

[02:12:52] : [02:12:54]

if it's hurting other humans-

[02:12:54] : [02:12:56]

- I've heard that beforesomewhere, I don't remember-

[02:12:56] : [02:12:59]

- [Yann] Yes.(Lex laughs)

[02:12:59] : [02:13:00]

Maybe in a book. (laughs)

[02:13:00] : [02:13:01]

- Yeah.

[02:13:01] : [02:13:03]

But speaking of that book,

[02:13:03] : [02:13:04]

could there be unintendedconsequences also

[02:13:04] : [02:13:08]

from all of this?

[02:13:08] : [02:13:09]

- No, of course.

[02:13:09] : [02:13:09]

So this is not a simple problem, right?

[02:13:09] : [02:13:12]

I mean designing those guardrails

[02:13:12] : [02:13:14]

so that the system behaves properly

[02:13:14] : [02:13:16]

is not gonna be a simple issue

[02:13:16] : [02:13:20]

for which there is a silver bullet,

[02:13:20] : [02:13:22]

for which you have a mathematical proof

[02:13:22] : [02:13:23]

that the system can be safe.

[02:13:23] : [02:13:25]

It's gonna be very progressive,

[02:13:25] : [02:13:27]

iterative design system

[02:13:27] : [02:13:28]

where we put those guardrails

[02:13:28] : [02:13:31]

in such a way that thesystem behave properly.

[02:13:31] : [02:13:32]

And sometimes they'regoing to do something

[02:13:32] : [02:13:35]

that was unexpected becausethe guardrail wasn't right,

[02:13:35] : [02:13:38]

and we're gonna correct themso that they do it right.

[02:13:38] : [02:13:41]

The idea somehow that wecan't get it slightly wrong,

[02:13:41] : [02:13:44]

because if we get itslightly wrong we all die,

[02:13:44] : [02:13:46]

is ridiculous.

[02:13:46] : [02:13:47]

We're just gonna go progressively.

[02:13:47] : [02:13:50]

The analogy I've used manytimes is turbojet design.

[02:13:50] : [02:13:56]

How did we figure out

[02:13:56] : [02:14:02]

how to make turbojets sounbelievably reliable, right?

[02:14:02] : [02:14:06]

I mean, those are like incrediblycomplex pieces of hardware

[02:14:06] : [02:14:10]

that run at really high temperatures

[02:14:10] : [02:14:12]

for 20 hours at a time sometimes.

[02:14:12] : [02:14:17]

And we can fly halfway around the world

[02:14:17] : [02:14:20]

on a two engine jet linerat near the speed of sound.

[02:14:20] : [02:14:25]

Like how incredible is this?

[02:14:25] : [02:14:28]

It is just unbelievable.

[02:14:28] : [02:14:30]

And did we do this

[02:14:30] : [02:14:33]

because we inventedlike a general principle

[02:14:33] : [02:14:35]

of how to make turbojets safe?

[02:14:35] : [02:14:37]

No, it took decades

[02:14:37] : [02:14:39]

to kind of fine tune thedesign of those systems

[02:14:39] : [02:14:40]

so that they were safe.

[02:14:40] : [02:14:43]

Is there a separate group

[02:14:43] : [02:14:46]

within General Electricor Snecma or whatever

[02:14:46] : [02:14:50]

that is specialized in turbojet safety?

[02:14:50] : [02:14:54]

No.

[02:14:54] : [02:14:56]

The design is all about safety.

[02:14:56] : [02:14:58]

Because a better turbojetis also a safer turbojet,

[02:14:58] : [02:15:01]

a more reliable one.

[02:15:01] : [02:15:03]

It's the same for AI.

[02:15:03] : [02:15:04]

Like do you need specificprovisions to make AI safe?

[02:15:04] : [02:15:08]

No, you need to make better AI systems

[02:15:08] : [02:15:10]

and they will be safe

[02:15:10] : [02:15:11]

because they are designedto be more useful

[02:15:11] : [02:15:14]

and more controllable.

[02:15:14] : [02:15:16]

- So let's imagine a system,

[02:15:16] : [02:15:17]

AI system that's able tobe incredibly convincing

[02:15:17] : [02:15:22]

and can convince you of anything.

[02:15:22] : [02:15:24]

I can at least imagine such a system.

[02:15:24] : [02:15:28]

And I can see such asystem be weapon-like,

[02:15:28] : [02:15:33]

because it can control people's minds,

[02:15:33] : [02:15:35]

we're pretty gullible.

[02:15:35] : [02:15:37]

We want to believe a thing.

[02:15:37] : [02:15:38]

And you can have an AIsystem that controls it

[02:15:38] : [02:15:40]

and you could see governmentsusing that as a weapon.

[02:15:40] : [02:15:43]

So do you think if youimagine such a system,

[02:15:43] : [02:15:47]

there's any parallel tosomething like nuclear weapons?

[02:15:47] : [02:15:53]

- [Yann] No.

[02:15:53] : [02:15:54]

- So why is that technology different?

[02:15:54] : [02:15:58]

So you're saying there's goingto be gradual development?

[02:15:58] : [02:16:01]

- [Yann] Yeah.

[02:16:01] : [02:16:02]

- I mean it might be rapid,

[02:16:02] : [02:16:03]

but they'll be iterative.

[02:16:03] : [02:16:05]

And then we'll be able tokind of respond and so on.

[02:16:05] : [02:16:09]

- So that AI system designedby Vladimir Putin or whatever,

[02:16:09] : [02:16:12]

or his minions (laughing)

[02:16:12] : [02:16:16]

is gonna be like tryingto talk to every American

[02:16:16] : [02:16:21]

to convince them to vote for-

[02:16:21] : [02:16:24]

- [Lex] Whoever.

[02:16:24] : [02:16:25]

- Whoever pleases Putin or whatever.

[02:16:25] : [02:16:30]

Or rile people up against each other

[02:16:30] : [02:16:36]

as they've been trying to do.

[02:16:36] : [02:16:37]

They're not gonna be talking to you,

[02:16:37] : [02:16:40]

they're gonna be talkingto your AI assistant

[02:16:40] : [02:16:43]

which is going to be assmart as theirs, right?

[02:16:43] : [02:16:47]

Because as I said, in the future,

[02:16:47] : [02:16:51]

every single one of yourinteraction with the digital world

[02:16:51] : [02:16:53]

will be mediated by your AI assistant.

[02:16:53] : [02:16:55]

So the first thing you'regonna ask is, is this a scam?

[02:16:55] : [02:16:58]

Like is this thing liketelling me the truth?

[02:16:58] : [02:17:00]

Like it's not even goingto be able to get to you

[02:17:00] : [02:17:03]

because it's only going totalk to your AI assistant,

[02:17:03] : [02:17:05]

and your AI is not even going to...

[02:17:05] : [02:17:07]

It's gonna be like a spam filter, right?

[02:17:07] : [02:17:10]

You're not even seeing theemail, the spam email, right?

[02:17:10] : [02:17:13]

It's automatically put in afolder that you never see.

[02:17:13] : [02:17:16]

It's gonna be the same thing.

[02:17:16] : [02:17:18]

That AI system that tries toconvince you of something,

[02:17:18] : [02:17:21]

it's gonna be talking to an AI system

[02:17:21] : [02:17:22]

which is gonna be at least as smart as it.

[02:17:22] : [02:17:25]

And is gonna say, this is spam. (laughs)

[02:17:25] : [02:17:29]

It's not even going tobring it to your attention.

[02:17:29] : [02:17:32]

- So to you it's verydifficult for any one AI system

[02:17:32] : [02:17:34]

to take such a big leap ahead

[02:17:34] : [02:17:37]

to where it can convinceeven the other AI systems?

[02:17:37] : [02:17:40]

So like there's always goingto be this kind of race

[02:17:40] : [02:17:43]

where nobody's way ahead?

[02:17:43] : [02:17:46]

- That's the history of the world.

[02:17:46] : [02:17:48]

History of the world

[02:17:48] : [02:17:49]

is whenever there is a progress someplace,

[02:17:49] : [02:17:51]

there is a countermeasure.

[02:17:51] : [02:17:54]

And it's a cat and mouse game.

[02:17:54] : [02:17:57]

- Mostly yes,

[02:17:57] : [02:17:58]

but this is why nuclearweapons are so interesting

[02:17:58] : [02:18:01]

because that was such a powerful weapon

[02:18:01] : [02:18:05]

that it mattered who got it first.

[02:18:05] : [02:18:07]

That you could imagine Hitler, Stalin, Mao

[02:18:07] : [02:18:13]

getting the weapon first

[02:18:13] : [02:18:17]

and that having a differentkind of impact on the world

[02:18:17] : [02:18:21]

than the United Statesgetting the weapon first.

[02:18:21] : [02:18:24]

To you, nuclear weapons is like...

[02:18:24] : [02:18:27]

You don't imagine a breakthrough discovery

[02:18:27] : [02:18:32]

and then Manhattan projectlike effort for AI?

[02:18:32] : [02:18:35]

- No.

[02:18:35] : [02:18:36]

As I said, it's not going to be an event.

[02:18:36] : [02:18:39]

It's gonna be continuous progress.

[02:18:39] : [02:18:41]

And whenever one breakthrough occurs,

[02:18:41] : [02:18:45]

it's gonna be widelydisseminated really quickly.

[02:18:45] : [02:18:48]

Probably first within industry.

[02:18:48] : [02:18:51]

I mean, this is not a domain

[02:18:51] : [02:18:52]

where government or military organizations

[02:18:52] : [02:18:55]

are particularly innovative,

[02:18:55] : [02:18:57]

and they're in fact way behind.

[02:18:57] : [02:18:59]

And so this is gonna come from industry.

[02:18:59] : [02:19:02]

And this kind of informationdisseminates extremely quickly.

[02:19:02] : [02:19:04]

We've seen this over thelast few years, right?

[02:19:04] : [02:19:08]

Where you have a new...

[02:19:08] : [02:19:10]

Like even take AlphaGo.

[02:19:10] : [02:19:12]

This was reproduced within three months

[02:19:12] : [02:19:13]

even without like particularlydetailed information, right?

[02:19:13] : [02:19:18]

- Yeah.

[02:19:18] : [02:19:18]

This is an industry that'snot good at secrecy.

[02:19:18] : [02:19:20]

(laughs)

[02:19:20] : [02:19:21]

- But even if there is,

[02:19:21] : [02:19:22]

just the fact that you knowthat something is possible

[02:19:22] : [02:19:26]

makes you like realize

[02:19:26] : [02:19:28]

that it's worth investingthe time to actually do it.

[02:19:28] : [02:19:31]

You may be the second personto do it but you'll do it.

[02:19:31] : [02:19:35]

Say for all the innovations

[02:19:35] : [02:19:40]

of self supervised running transformers,

[02:19:40] : [02:19:43]

decoder only architectures, LLMs.

[02:19:43] : [02:19:46]

I mean those things,

[02:19:46] : [02:19:47]

you don't need to know exactlythe details of how they work

[02:19:47] : [02:19:49]

to know that it's possible

[02:19:49] : [02:19:52]

because it's deployed andthen it's getting reproduced.

[02:19:52] : [02:19:54]

And then people who workfor those companies move.

[02:19:54] : [02:19:59]

They go from one company to another.

[02:19:59] : [02:20:02]

And the information disseminates.

[02:20:02] : [02:20:05]

What makes the successof the US tech industry

[02:20:05] : [02:20:09]

and Silicon Valley inparticular, is exactly that,

[02:20:09] : [02:20:11]

is because informationcirculates really, really quickly

[02:20:11] : [02:20:14]

and disseminates very quickly.

[02:20:14] : [02:20:17]

And so the whole region sort of is ahead

[02:20:17] : [02:20:21]

because of thatcirculation of information.

[02:20:21] : [02:20:24]

- Maybe just to linger onthe psychology of AI doomers.

[02:20:24] : [02:20:28]

You give in the classic Yann LeCun way,

[02:20:28] : [02:20:31]

a pretty good example

[02:20:31] : [02:20:33]

of just when a new technology comes to be,

[02:20:33] : [02:20:36]

you say engineer says,

[02:20:36] : [02:20:38]

"I invented this new thing,I call it a ballpen."

[02:20:38] : [02:20:43]

And then the TwitterSphere responds,

[02:20:43] : [02:20:46]

"OMG people could writehorrible things with it

[02:20:46] : [02:20:48]

like misinformation,propaganda, hate speech.

[02:20:48] : [02:20:51]

Ban it now!"

[02:20:51] : [02:20:52]

Then writing doomers come in,

[02:20:52] : [02:20:54]

akin to the AI doomers,

[02:20:54] : [02:20:57]

"imagine if everyone can get a ballpen.

[02:20:57] : [02:21:01]

This could destroy society.

[02:21:01] : [02:21:01]

There should be a law

[02:21:01] : [02:21:03]

against using ballpento write hate speech,

[02:21:03] : [02:21:05]

regulate ballpens now."

[02:21:05] : [02:21:07]

And then the pencil industry mogul says,

[02:21:07] : [02:21:09]

"yeah, ballpens are very dangerous,

[02:21:09] : [02:21:12]

unlike pencil writing which is erasable,

[02:21:12] : [02:21:15]

ballpen writing stays forever.

[02:21:15] : [02:21:18]

Government should require alicense for a pen manufacturer."

[02:21:18] : [02:21:21]

I mean, this does seem tobe part of human psychology

[02:21:21] : [02:21:27]

when it comes up against new technology.

[02:21:27] : [02:21:31]

What deep insights canyou speak to about this?

[02:21:31] : [02:21:36]

- Well, there is a naturalfear of new technology

[02:21:36] : [02:21:42]

and the impact it can have on society.

[02:21:42] : [02:21:45]

And people have kindof instinctive reaction

[02:21:45] : [02:21:48]

to the world they know

[02:21:48] : [02:21:52]

being threatened by major transformations

[02:21:52] : [02:21:55]

that are either cultural phenomena

[02:21:55] : [02:21:57]

or technological revolutions.

[02:21:57] : [02:22:01]

And they fear for their culture,

[02:22:01] : [02:22:04]

they fear for their job,

[02:22:04] : [02:22:05]

they fear for the future of their children

[02:22:05] : [02:22:10]

and their way of life, right?

[02:22:10] : [02:22:13]

So any change is feared.

[02:22:13] : [02:22:17]

And you see this along history,

[02:22:17] : [02:22:20]

like any technologicalrevolution or cultural phenomenon

[02:22:20] : [02:22:24]

was always accompanied bygroups or reaction in the media

[02:22:24] : [02:22:29]

that basically attributedall the problems,

[02:22:29] : [02:22:36]

the current problems of society

[02:22:36] : [02:22:37]

to that particular change, right?

[02:22:37] : [02:22:40]

Electricity was going tokill everyone at some point.

[02:22:40] : [02:22:44]

The train was going to be a horrible thing

[02:22:44] : [02:22:47]

because you can't breathepast 50 kilometers an hour.

[02:22:47] : [02:22:50]

And so there's a wonderful website

[02:22:50] : [02:22:54]

called a Pessimists Archive, right?

[02:22:54] : [02:22:56]

Which has all thosenewspaper clips (laughing)

[02:22:56] : [02:22:59]

of all the horrible thingspeople imagined would arrive

[02:22:59] : [02:23:02]

because of either technological innovation

[02:23:02] : [02:23:06]

or a cultural phenomenon.

[02:23:06] : [02:23:09]

Wonderful examples of jazz or comic books

[02:23:09] : [02:23:18]

being blamed for unemployment

[02:23:18] : [02:23:23]

or young people notwanting to work anymore

[02:23:23] : [02:23:25]

and things like that, right?

[02:23:25] : [02:23:27]

And that has existed for centuries.

[02:23:27] : [02:23:30]

And it's knee jerk reactions.

[02:23:30] : [02:23:36]

The question is do we embracechange or do we resist it?

[02:23:36] : [02:23:43]

And what are the real dangers

[02:23:43] : [02:23:47]

as opposed to the imagined imagined ones?

[02:23:47] : [02:23:50]

- So people worry about...

[02:23:50] : [02:23:53]

I think one thing theyworry about with big tech,

[02:23:53] : [02:23:55]

something we've beentalking about over and over

[02:23:55] : [02:23:58]

but I think worth mentioning again,

[02:23:58] : [02:24:02]

they worry about how powerful AI will be

[02:24:02] : [02:24:05]

and they worry about it

[02:24:05] : [02:24:07]

being in the hands ofone centralized power

[02:24:07] : [02:24:09]

of just a handful of central control.

[02:24:09] : [02:24:13]

And so that's theskepticism with big tech.

[02:24:13] : [02:24:16]

These companies can makea huge amount of money

[02:24:16] : [02:24:18]

and control this technology.

[02:24:18] : [02:24:21]

And by so doing,

[02:24:21] : [02:24:24]

take advantage, abuse thelittle guy in society.

[02:24:24] : [02:24:29]

- Well, that's exactly why weneed open source platforms.

[02:24:29] : [02:24:31]

- Yeah.

[02:24:31] : [02:24:32]

I just wanted to... (laughs)

[02:24:32] : [02:24:34]

Nail the point home more and more.

[02:24:34] : [02:24:36]

- [Yann] Yes.

[02:24:36] : [02:24:37]

- So let me ask you on your...

[02:24:37] : [02:24:40]

Like I said, you do get a little bit

[02:24:40] : [02:24:42]

flavorful on the internet.

[02:24:42] : [02:24:46]

Joscha Bach tweetedsomething that you LOL'd at

[02:24:46] : [02:24:50]

in reference to HAL 9,000.

[02:24:50] : [02:24:53]

Quote,

[02:24:53] : [02:24:54]

"I appreciate your argument

[02:24:54] : [02:24:55]

and I fully understand your frustration,

[02:24:55] : [02:24:57]

but whether the pod bay doorsshould be opened or closed

[02:24:57] : [02:25:01]

is a complex and nuanced issue."

[02:25:01] : [02:25:03]

So you're at the head of Meta AI.

[02:25:03] : [02:25:06]

This is something that really worries me,

[02:25:06] : [02:25:12]

that our AI overlords

[02:25:12] : [02:25:15]

will speak down to us withcorporate speak of this nature

[02:25:15] : [02:25:20]

and you sort of resist thatwith your way of being.

[02:25:20] : [02:25:23]

Is this something you can just comment on

[02:25:23] : [02:25:27]

sort of working at a big company,

[02:25:27] : [02:25:29]

how you can avoid theover fearing, I suppose,

[02:25:29] : [02:25:34]

the through caution create harm?

[02:25:34] : [02:25:41]

- Yeah.

[02:25:41] : [02:25:42]

Again, I think the answer tothis is open source platforms

[02:25:42] : [02:25:45]

and then enabling a widelydiverse set of people

[02:25:45] : [02:25:49]

to build AI assistants

[02:25:49] : [02:25:53]

that represent the diversity

[02:25:53] : [02:25:55]

of cultures, opinions, languages,

[02:25:55] : [02:25:57]

and value systems across the world.

[02:25:57] : [02:25:59]

So that you're not boundto just be brainwashed

[02:25:59] : [02:26:04]

by a particular way of thinking

[02:26:04] : [02:26:07]

because of a single AI entity.

[02:26:07] : [02:26:10]

So I mean, I think it's areally, really important question

[02:26:10] : [02:26:13]

for society.

[02:26:13] : [02:26:14]

And the problem I'm seeing,

[02:26:14] : [02:26:16]

which is why I've been so vocal

[02:26:16] : [02:26:21]

and sometimes a little sardonic about it-

[02:26:21] : [02:26:25]

- Never stop.

[02:26:25] : [02:26:26]

Never stop, Yann.

[02:26:26] : [02:26:27]

(both laugh)

[02:26:27] : [02:26:28]

We love it.- Is because I see the danger

[02:26:28] : [02:26:31]

of this concentration of power

[02:26:31] : [02:26:32]

through proprietary AI systems

[02:26:32] : [02:26:36]

as a much bigger dangerthan everything else.

[02:26:36] : [02:26:39]

That if we really wantdiversity of opinion AI systems

[02:26:39] : [02:26:44]

that in the future

[02:26:44] : [02:26:48]

that we'll all be interactingthrough AI systems,

[02:26:48] : [02:26:52]

we need those to be diverse

[02:26:52] : [02:26:54]

for the preservationof a diversity of ideas

[02:26:54] : [02:26:58]

and creeds and politicalopinions and whatever,

[02:26:58] : [02:27:03]

and the preservation of democracy.

[02:27:03] : [02:27:07]

And what works against this

[02:27:07] : [02:27:12]

is people who think thatfor reasons of security,

[02:27:12] : [02:27:15]

we should keep AI systemsunder lock and key

[02:27:15] : [02:27:19]

because it's too dangerous

[02:27:19] : [02:27:20]

to put it in the hands of everybody

[02:27:20] : [02:27:22]

because it could be usedby terrorists or something.

[02:27:22] : [02:27:26]

That would lead topotentially a very bad future

[02:27:26] : [02:27:33]

in which all of our information diet

[02:27:33] : [02:27:38]

is controlled by a smallnumber of companies

[02:27:38] : [02:27:41]

through proprietary systems.

[02:27:41] : [02:27:43]

- So you trust humans with this technology

[02:27:43] : [02:27:47]

to build systems that are onthe whole good for humanity?

[02:27:47] : [02:27:52]

- Isn't that what democracyand free speech is all about?

[02:27:52] : [02:27:56]

- I think so.

[02:27:56] : [02:27:57]

- Do you trust institutionsto do the right thing?

[02:27:57] : [02:28:00]

Do you trust people to do the right thing?

[02:28:00] : [02:28:03]

And yeah, there's bad peoplewho are gonna do bad things,

[02:28:03] : [02:28:05]

but they're not going tohave superior technology

[02:28:05] : [02:28:07]

to the good people.

[02:28:07] : [02:28:08]

So then it's gonna be my goodAI against your bad AI, right?

[02:28:08] : [02:28:12]

I mean it's the examples thatwe were just talking about

[02:28:12] : [02:28:15]

of maybe some rogue countrywill build some AI system

[02:28:15] : [02:28:20]

that's gonna try to convince everybody

[02:28:20] : [02:28:23]

to go into a civil war or something

[02:28:23] : [02:28:27]

or elect a favorable ruler.

[02:28:27] : [02:28:31]

But then they will have to gopast our AI systems, right?

[02:28:31] : [02:28:35]

(laughs)

[02:28:35] : [02:28:36]

- An AI system with astrong Russian accent

[02:28:36] : [02:28:38]

will be trying to convince our-

[02:28:38] : [02:28:40]

- And doesn't put anyarticles in their sentences.

[02:28:40] : [02:28:42]

(both laugh)

[02:28:42] : [02:28:45]

- Well, it'll be at the veryleast, absurdly comedic.

[02:28:45] : [02:28:48]

Okay.

[02:28:48] : [02:28:50]

So since we talked aboutsort of the physical reality,

[02:28:50] : [02:28:55]

I'd love to ask your visionof the future with robots

[02:28:55] : [02:28:58]

in this physical reality.

[02:28:58] : [02:29:00]

So many of the kinds of intelligence

[02:29:00] : [02:29:02]

you've been speaking about

[02:29:02] : [02:29:05]

would empower robots

[02:29:05] : [02:29:06]

to be more effectivecollaborators with us humans.

[02:29:06] : [02:29:10]

So since Tesla's Optimus team

[02:29:10] : [02:29:14]

has been showing us someprogress in humanoid robots,

[02:29:14] : [02:29:17]

I think it really reinvigoratedthe whole industry

[02:29:17] : [02:29:20]

that I think BostonDynamics has been leading

[02:29:20] : [02:29:22]

for a very, very long time.

[02:29:22] : [02:29:24]

So now there's all kinds of companies,

[02:29:24] : [02:29:25]

Figure AI, obviously Boston Dynamics-

[02:29:25] : [02:29:28]

- [Yann] Unitree.

[02:29:28] : [02:29:29]

- Unitree.

[02:29:29] : [02:29:31]

But there's like a lot of them.

[02:29:31] : [02:29:33]

It's great.

[02:29:33] : [02:29:34]

It's great.

[02:29:34] : [02:29:35]

I mean I love it.

[02:29:35] : [02:29:36]

So do you think there'll bemillions of humanoid robots

[02:29:36] : [02:29:42]

walking around soon?

[02:29:42] : [02:29:44]

- Not soon, but it's gonna happen.

[02:29:44] : [02:29:46]

Like the next decade

[02:29:46] : [02:29:47]

I think is gonna be reallyinteresting in robots.

[02:29:47] : [02:29:49]

Like the emergence ofthe robotics industry

[02:29:49] : [02:29:53]

has been in the waiting for 10, 20 years,

[02:29:53] : [02:29:57]

without really emerging

[02:29:57] : [02:29:58]

other than for like kindof pre-program behavior

[02:29:58] : [02:30:01]

and stuff like that.

[02:30:01] : [02:30:03]

And the main issue is again,the Moravec's paradox.

[02:30:03] : [02:30:08]

Like how do we get the systems

[02:30:08] : [02:30:09]

to understand how the world works

[02:30:09] : [02:30:11]

and kind of plan actions?

[02:30:11] : [02:30:13]

And so we can do it forreally specialized tasks.

[02:30:13] : [02:30:15]

And the way Boston Dynamics goes about it

[02:30:15] : [02:30:21]

is basically with a lot ofhandcrafted dynamical models

[02:30:21] : [02:30:25]

and careful planning in advance,

[02:30:25] : [02:30:28]

which is very classical roboticswith a lot of innovation,

[02:30:28] : [02:30:32]

a little bit of perception,

[02:30:32] : [02:30:33]

but it's still not...

[02:30:33] : [02:30:35]

Like they can't build adomestic robot, right?

[02:30:35] : [02:30:38]

And we're still some distance away

[02:30:38] : [02:30:43]

from completely autonomouslevel five driving.

[02:30:43] : [02:30:46]

And we're certainly very far away

[02:30:46] : [02:30:49]

from having level five autonomous driving

[02:30:49] : [02:30:53]

by a system that can train itself

[02:30:53] : [02:30:55]

by driving 20 hours, like any 17-year-old.

[02:30:55] : [02:30:59]

So until we have, again, world models,

[02:30:59] : [02:31:05]

systems that can train themselves

[02:31:05] : [02:31:09]

to understand how the world works,

[02:31:09] : [02:31:11]

we're not gonna have significantprogress in robotics.

[02:31:11] : [02:31:16]

So a lot of the people

[02:31:16] : [02:31:18]

working on robotic hardware at the moment

[02:31:18] : [02:31:21]

are betting or banking

[02:31:21] : [02:31:23]

on the fact that AI

[02:31:23] : [02:31:25]

is gonna make sufficientprogress towards that.

[02:31:25] : [02:31:28]

- And they're hoping todiscover a product in it too-

[02:31:28] : [02:31:31]

- [Yann] Yeah.

[02:31:31] : [02:31:32]

- Before you have areally strong world model,

[02:31:32] : [02:31:34]

there'll be an almost strong world model.

[02:31:34] : [02:31:38]

And people are trying to find a product

[02:31:38] : [02:31:40]

in a clumsy robot, I suppose.

[02:31:40] : [02:31:43]

Like not a perfectly efficient robot.

[02:31:43] : [02:31:45]

So there's the factory setting

[02:31:45] : [02:31:46]

where humanoid robots

[02:31:46] : [02:31:48]

can help automate someaspects of the factory.

[02:31:48] : [02:31:51]

I think that's a crazy difficult task

[02:31:51] : [02:31:53]

'cause of all the safety required

[02:31:53] : [02:31:54]

and all this kind of stuff,

[02:31:54] : [02:31:56]

I think in the home is more interesting.

[02:31:56] : [02:31:57]

But then you start to think...

[02:31:57] : [02:32:00]

I think you mentioned loadingthe dishwasher, right?

[02:32:00] : [02:32:03]

- [Yann] Yeah.

[02:32:03] : [02:32:04]

- Like I suppose that'sone of the main problems

[02:32:04] : [02:32:06]

you're working on.

[02:32:06] : [02:32:07]

- I mean there's cleaning up. (laughs)

[02:32:07] : [02:32:10]

- [Lex] Yeah.

[02:32:10] : [02:32:11]

- Cleaning the house,

[02:32:11] : [02:32:13]

clearing up the table after a meal,

[02:32:13] : [02:32:17]

washing the dishes, allthose tasks, cooking.

[02:32:17] : [02:32:21]

I mean all the tasks that inprinciple could be automated

[02:32:21] : [02:32:24]

but are actually incredibly sophisticated,

[02:32:24] : [02:32:26]

really complicated.

[02:32:26] : [02:32:28]

- But even just basic navigation

[02:32:28] : [02:32:29]

around a space full of uncertainty.

[02:32:29] : [02:32:32]

- That sort of works.

[02:32:32] : [02:32:33]

Like you can sort of do this now.

[02:32:33] : [02:32:35]

Navigation is fine.

[02:32:35] : [02:32:37]

- Well, navigation in a waythat's compelling to us humans

[02:32:37] : [02:32:40]

is a different thing.

[02:32:40] : [02:32:42]

- Yeah.

[02:32:42] : [02:32:43]

It's not gonna be necessarily...

[02:32:43] : [02:32:45]

I mean we have demos actually

[02:32:45] : [02:32:46]

'cause there is a so-calledembodied AI group at FAIR

[02:32:46] : [02:32:51]

and they've been notbuilding their own robots

[02:32:51] : [02:32:55]

but using commercial robots.

[02:32:55] : [02:32:57]

And you can tell the robotdog like go to the fridge

[02:32:57] : [02:33:02]

and they can actually open the fridge

[02:33:02] : [02:33:03]

and they can probably pickup a can in the fridge

[02:33:03] : [02:33:05]

and stuff like that and bring it to you.

[02:33:05] : [02:33:08]

So it can navigate,

[02:33:08] : [02:33:10]

it can grab objects

[02:33:10] : [02:33:12]

as long as it's beentrained to recognize them,

[02:33:12] : [02:33:14]

which vision systems workpretty well nowadays.

[02:33:14] : [02:33:17]

But it's not like acompletely general robot

[02:33:17] : [02:33:23]

that would be sophisticated enough

[02:33:23] : [02:33:24]

to do things like clearingup the dinner table.

[02:33:24] : [02:33:28]

(laughs)

[02:33:28] : [02:33:30]

- Yeah, to me that's an exciting future

[02:33:30] : [02:33:33]

of getting humanoid robots.

[02:33:33] : [02:33:35]

Robots in general inthe home more and more

[02:33:35] : [02:33:36]

because it gets humans

[02:33:36] : [02:33:38]

to really directlyinteract with AI systems

[02:33:38] : [02:33:40]

in the physical space.

[02:33:40] : [02:33:42]

And in so doing it allows us

[02:33:42] : [02:33:44]

to philosophically,psychologically explore

[02:33:44] : [02:33:46]

our relationships with robots.

[02:33:46] : [02:33:47]

It can be really, really interesting.

[02:33:47] : [02:33:50]

So I hope you make progresson the whole JEPA thing soon.

[02:33:50] : [02:33:54]

(laughs)

[02:33:54] : [02:33:55]

- Well, I mean, I hopethings can work as planned.

[02:33:55] : [02:33:58]

I mean, again, we've been likekinda working on this idea

[02:33:58] : [02:34:03]

of self supervised learningfrom video for 10 years.

[02:34:03] : [02:34:07]

And only made significantprogress in the last two or three.

[02:34:07] : [02:34:12]

- And actually you've mentioned

[02:34:12] : [02:34:13]

that there's a lot ofinteresting breakthroughs

[02:34:13] : [02:34:15]

that can happen without havingaccess to a lot of compute.

[02:34:15] : [02:34:18]

So if you're interested in doing a PhD

[02:34:18] : [02:34:20]

in this kind of stuff,

[02:34:20] : [02:34:21]

there's a lot of possibilities still

[02:34:21] : [02:34:23]

to do innovative work.

[02:34:23] : [02:34:25]

So like what advice would you give

[02:34:25] : [02:34:26]

to a undergrad that'slooking to go to grad school

[02:34:26] : [02:34:30]

and do a PhD?

[02:34:30] : [02:34:32]

- So basically, I've listed them already.

[02:34:32] : [02:34:35]

This idea of how do you traina world model by observation?

[02:34:35] : [02:34:38]

And you don't have to train necessarily

[02:34:38] : [02:34:41]

on gigantic data sets.

[02:34:41] : [02:34:43]

I mean, it could turn out to be necessary

[02:34:43] : [02:34:47]

to actually train on large data sets

[02:34:47] : [02:34:48]

to have emergent propertieslike we have with LLMs.

[02:34:48] : [02:34:51]

But I think there is a lot ofgood ideas that can be done

[02:34:51] : [02:34:53]

without necessarily scaling up.

[02:34:53] : [02:34:56]

Then there is how do you do planning

[02:34:56] : [02:34:58]

with a learn world model?

[02:34:58] : [02:35:00]

If the world the system evolves in

[02:35:00] : [02:35:02]

is not the physical world,

[02:35:02] : [02:35:03]

but is the world of let's say the internet

[02:35:03] : [02:35:06]

or some sort of world

[02:35:06] : [02:35:09]

of where an action consists

[02:35:09] : [02:35:11]

in doing a search in a search engine

[02:35:11] : [02:35:13]

or interrogating a database,

[02:35:13] : [02:35:14]

or running a simulation

[02:35:14] : [02:35:18]

or calling a calculator

[02:35:18] : [02:35:19]

or solving a differential equation,

[02:35:19] : [02:35:21]

how do you get a system

[02:35:21] : [02:35:23]

to actually plan a sequence of actions

[02:35:23] : [02:35:25]

to give the solution to a problem?

[02:35:25] : [02:35:28]

And so the question of planning

[02:35:28] : [02:35:32]

is not just a question ofplanning physical actions,

[02:35:32] : [02:35:35]

it could be planning actions to use tools

[02:35:35] : [02:35:39]

for a dialogue system

[02:35:39] : [02:35:40]

or for any kind of intelligence system.

[02:35:40] : [02:35:42]

And there's some work onthis but not a huge amount.

[02:35:42] : [02:35:47]

Some work at FAIR,

[02:35:47] : [02:35:48]

one called Toolformer,which was a couple years ago

[02:35:48] : [02:35:52]

and some more recent work on planning,

[02:35:52] : [02:35:55]

but I don't think wehave like a good solution

[02:35:55] : [02:35:59]

for any of that.

[02:35:59] : [02:36:00]

Then there is the questionof hierarchical planning.

[02:36:00] : [02:36:03]

So the example I mentioned

[02:36:03] : [02:36:05]

of planning a trip from New York to Paris,

[02:36:05] : [02:36:10]

that's hierarchical,

[02:36:10] : [02:36:11]

but almost every action that we take

[02:36:11] : [02:36:13]

involves hierarchicalplanning in some sense.

[02:36:13] : [02:36:17]

And we really have absolutelyno idea how to do this.

[02:36:17] : [02:36:20]

Like there's zero demonstration

[02:36:20] : [02:36:22]

of hierarchical planning in AI,

[02:36:22] : [02:36:26]

where the various levelsof representations

[02:36:26] : [02:36:32]

that are necessary have been learned.

[02:36:32] : [02:36:36]

We can do like two levelhierarchical planning

[02:36:36] : [02:36:39]

when we design the two levels.

[02:36:39] : [02:36:41]

So for example, you have likea dog legged robot, right?

[02:36:41] : [02:36:44]

You want it to go from theliving room to the kitchen.

[02:36:44] : [02:36:48]

You can plan a path thatavoids the obstacle.

[02:36:48] : [02:36:51]

And then you can send thisto a lower level planner

[02:36:51] : [02:36:54]

that figures out how to move the legs

[02:36:54] : [02:36:56]

to kind of follow thattrajectories, right?

[02:36:56] : [02:36:59]

So that works,

[02:36:59] : [02:37:00]

but that two level planningis designed by hand, right?

[02:37:00] : [02:37:03]

We specify what the properlevels of abstraction,

[02:37:03] : [02:37:09]

the representation at eachlevel of abstraction have to be.

[02:37:09] : [02:37:13]

How do you learn this?

[02:37:13] : [02:37:14]

How do you learn thathierarchical representation

[02:37:14] : [02:37:16]

of action plans, right?

[02:37:16] : [02:37:19]

With com nets and deep learning,

[02:37:19] : [02:37:22]

we can train the system

[02:37:22] : [02:37:23]

to learn hierarchicalrepresentations of percepts.

[02:37:23] : [02:37:26]

What is the equivalent

[02:37:26] : [02:37:28]

when what you're trying torepresent are action plans?

[02:37:28] : [02:37:30]

- For action plans.

[02:37:30] : [02:37:31]

Yeah.

[02:37:31] : [02:37:32]

So you want basically arobot dog or humanoid robot

[02:37:32] : [02:37:35]

that turns on and travelsfrom New York to Paris

[02:37:35] : [02:37:38]

all by itself.

[02:37:38] : [02:37:40]

- [Yann] For example.

[02:37:40] : [02:37:41]

- All right.

[02:37:41] : [02:37:43]

It might have some trouble at the TSA but-

[02:37:43] : [02:37:47]

- No, but even doingsomething fairly simple

[02:37:47] : [02:37:49]

like a household task.

[02:37:49] : [02:37:50]

- [Lex] Sure.

[02:37:50] : [02:37:51]

- Like cooking or something.

[02:37:51] : [02:37:53]

- Yeah.

[02:37:53] : [02:37:54]

There's a lot involved.

[02:37:54] : [02:37:55]

It's a super complex task.

[02:37:55] : [02:37:56]

Once again, we take it for granted.

[02:37:56] : [02:37:59]

What hope do you have forthe future of humanity?

[02:37:59] : [02:38:05]

We're talking about somany exciting technologies,

[02:38:05] : [02:38:07]

so many exciting possibilities.

[02:38:07] : [02:38:09]

What gives you hope when you look out

[02:38:09] : [02:38:12]

over the next 10, 20, 50, 100 years?

[02:38:12] : [02:38:15]

If you look at social media,

[02:38:15] : [02:38:16]

there's wars going on, there'sdivision, there's hatred,

[02:38:16] : [02:38:21]

all this kind of stuffthat's also part of humanity.

[02:38:21] : [02:38:24]

But amidst all that, what gives you hope?

[02:38:24] : [02:38:27]

- I love that question.

[02:38:27] : [02:38:30]

We can make humanity smarter with AI.

[02:38:30] : [02:38:37]

Okay?

[02:38:37] : [02:38:40]

I mean AI basically willamplify human intelligence.

[02:38:40] : [02:38:44]

It's as if every one of us

[02:38:44] : [02:38:47]

will have a staff of smart AI assistants.

[02:38:47] : [02:38:52]

They might be smarter than us.

[02:38:52] : [02:38:53]

They'll do our bidding,

[02:38:53] : [02:38:55]

perhaps execute a task

[02:38:55] : [02:39:01]

in ways that are much betterthan we could do ourselves

[02:39:01] : [02:39:05]

because they'd be smarter than us.

[02:39:05] : [02:39:07]

And so it's like everyonewould be the boss

[02:39:07] : [02:39:10]

of a staff of super smart virtual people.

[02:39:10] : [02:39:14]

So we shouldn't feel threatened by this

[02:39:14] : [02:39:18]

any more than we should feel threatened

[02:39:18] : [02:39:19]

by being the manager of a group of people,

[02:39:19] : [02:39:22]

some of whom are more intelligent than us.

[02:39:22] : [02:39:24]

I certainly have a lotof experience with this.

[02:39:24] : [02:39:29]

(laughs)

[02:39:29] : [02:39:30]

Of having people working withme who are smarter than me.

[02:39:30] : [02:39:34]

That's actually a wonderful thing.

[02:39:34] : [02:39:36]

So having machines thatare smarter than us,

[02:39:36] : [02:39:40]

that assist us in all ofour tasks, our daily lives,

[02:39:40] : [02:39:44]

whether it's professional or personal,

[02:39:44] : [02:39:45]

I think would be anabsolutely wonderful thing.

[02:39:45] : [02:39:48]

Because intelligence is the commodity

[02:39:48] : [02:39:50]

that is most in demand.

[02:39:50] : [02:39:54]

I mean, all the mistakesthat humanity makes

[02:39:54] : [02:39:57]

is because of lack ofintelligence, really,

[02:39:57] : [02:39:59]

or lack of knowledge, which is related.

[02:39:59] : [02:40:01]

So making people smarterwhich can only be better.

[02:40:01] : [02:40:07]

I mean, for the same reason

[02:40:07] : [02:40:08]

that public education is a good thing

[02:40:08] : [02:40:10]

and books are a good thing,

[02:40:10] : [02:40:15]

and the internet is also agood thing, intrinsically.

[02:40:15] : [02:40:17]

And even social networks are a good thing

[02:40:17] : [02:40:19]

if you run them properly.

[02:40:19] : [02:40:21]

(laughs)

[02:40:21] : [02:40:21]

It's difficult, but you can.

[02:40:21] : [02:40:23]

Because it helps the communication

[02:40:23] : [02:40:30]

of information and knowledge

[02:40:30] : [02:40:32]

and the transmission of knowledge.

[02:40:32] : [02:40:34]

So AI is gonna make humanity smarter.

[02:40:34] : [02:40:36]

And the analogy I've been using

[02:40:36] : [02:40:39]

is the fact that perhapsan equivalent event

[02:40:39] : [02:40:44]

in the history of humanity

[02:40:44] : [02:40:47]

to what might be provided bygeneralization of AI assistant

[02:40:47] : [02:40:52]

is the invention of the printing press.

[02:40:52] : [02:40:55]

It made everybody smarter.

[02:40:55] : [02:40:57]

The fact that people couldhave access to books.

[02:40:57] : [02:41:02]

Books were a lot cheaperthan they were before.

[02:41:02] : [02:41:06]

And so a lot more people hadan incentive to learn to read,

[02:41:06] : [02:41:10]

which wasn't the case before.

[02:41:10] : [02:41:12]

And people became smarter.

[02:41:12] : [02:41:17]

It enabled the enlightenment, right?

[02:41:17] : [02:41:21]

There wouldn't be an enlightenment

[02:41:21] : [02:41:22]

without the printing press.

[02:41:22] : [02:41:24]

It enabled philosophy, rationalism,

[02:41:24] : [02:41:29]

escape from religious doctrine,

[02:41:29] : [02:41:33]

democracy, science.

[02:41:33] : [02:41:38]

And certainly without this

[02:41:38] : [02:41:43]

there wouldn't have beenthe American Revolution

[02:41:43] : [02:41:46]

or the French Revolution.

[02:41:46] : [02:41:47]

And so we'll still be underfeudal regimes perhaps.

[02:41:47] : [02:41:52]

And so it completely transformed the world

[02:41:52] : [02:41:57]

because people became smarter

[02:41:57] : [02:41:59]

and kinda learned about things.

[02:41:59] : [02:42:01]

Now, it also created 200 years

[02:42:01] : [02:42:05]

of essentially religiousconflicts in Europe, right?

[02:42:05] : [02:42:08]

Because the first thing thatpeople read was the Bible

[02:42:08] : [02:42:12]

and realized that

[02:42:12] : [02:42:15]

perhaps there was a differentinterpretation of the Bible

[02:42:15] : [02:42:17]

than what the priests were telling them.

[02:42:17] : [02:42:19]

And so that createdthe Protestant movement

[02:42:19] : [02:42:22]

and created a rift.

[02:42:22] : [02:42:23]

And in fact, the Catholic church

[02:42:23] : [02:42:25]

didn't like the idea of the printing press

[02:42:25] : [02:42:28]

but they had no choice.

[02:42:28] : [02:42:29]

And so it had some badeffects and some good effects.

[02:42:29] : [02:42:32]

I don't think anyone today

[02:42:32] : [02:42:33]

would say that the inventionof the printing press

[02:42:33] : [02:42:35]

had an overall negative effect

[02:42:35] : [02:42:38]

despite the fact that it created 200 years

[02:42:38] : [02:42:40]

of religious conflicts in Europe.

[02:42:40] : [02:42:44]

Now compare this,

[02:42:44] : [02:42:45]

and I was very proud of myself

[02:42:45] : [02:42:48]

to come up with this analogy,

[02:42:48] : [02:42:51]

but realized someone else camewith the same idea before me.

[02:42:51] : [02:42:54]

Compare this with whathappened in the Ottoman Empire.

[02:42:54] : [02:42:58]

The Ottoman Empire banned theprinting press for 200 years.

[02:42:58] : [02:43:03]

And it didn't ban it for all languages,

[02:43:03] : [02:43:10]

only for Arabic.

[02:43:10] : [02:43:11]

You could actually print books

[02:43:11] : [02:43:13]

in Latin or Hebrew or whateverin the Ottoman Empire,

[02:43:13] : [02:43:18]

just not in Arabic.

[02:43:18] : [02:43:19]

And I thought it was because

[02:43:19] : [02:43:25]

the rulers just wanted to preserve

[02:43:25] : [02:43:27]

the control over thepopulation and the dogma,

[02:43:27] : [02:43:30]

religious dogma and everything.

[02:43:30] : [02:43:32]

But after talking withthe UAE Minister of AI,

[02:43:32] : [02:43:37]

Omar Al Olama,

[02:43:37] : [02:43:40]

he told me no, there was another reason.

[02:43:40] : [02:43:44]

And the other reason was that

[02:43:44] : [02:43:47]

it was to preserve the corporationof calligraphers, right?

[02:43:47] : [02:43:52]

There's like an art form

[02:43:52] : [02:43:56]

which is writing thosebeautiful Arabic poems

[02:43:56] : [02:44:01]

or whatever religious text in this thing.

[02:44:01] : [02:44:04]

And it was very powerfulcorporation of scribes basically

[02:44:04] : [02:44:07]

that kinda run a big chunk of the empire.

[02:44:07] : [02:44:12]

And we couldn't put them out of business.

[02:44:12] : [02:44:14]

So they banned the bridging press

[02:44:14] : [02:44:16]

in part to protect that business.

[02:44:16] : [02:44:18]

Now, what's the analogy for AI today?

[02:44:18] : [02:44:23]

Like who are we protecting by banning AI?

[02:44:23] : [02:44:25]

Like who are the people whoare asking that AI be regulated

[02:44:25] : [02:44:28]

to protect their jobs?

[02:44:28] : [02:44:31]

And of course, it's a real question

[02:44:31] : [02:44:35]

of what is gonna be the effect

[02:44:35] : [02:44:37]

of technological transformation like AI

[02:44:37] : [02:44:41]

on the job market and the labor market?

[02:44:41] : [02:44:45]

And there are economists

[02:44:45] : [02:44:46]

who are much more expertat this than I am,

[02:44:46] : [02:44:49]

but when I talk to them,

[02:44:49] : [02:44:50]

they tell us we're notgonna run out of job.

[02:44:50] : [02:44:54]

This is not gonna cause mass unemployment.

[02:44:54] : [02:44:56]

This is just gonna be gradual shift

[02:44:56] : [02:45:01]

of different professions.

[02:45:01] : [02:45:02]

The professions that are gonna be hot

[02:45:02] : [02:45:04]

10 or 15 years from now,

[02:45:04] : [02:45:05]

we have no idea todaywhat they're gonna be.

[02:45:05] : [02:45:09]

The same way if we goback 20 years in the past,

[02:45:09] : [02:45:12]

like who could have thought 20 years ago

[02:45:12] : [02:45:15]

that like the hottest job,

[02:45:15] : [02:45:17]

even like 5, 10 years agowas mobile app developer?

[02:45:17] : [02:45:21]

Like smartphones weren't invented.

[02:45:21] : [02:45:23]

- Most of the jobs of the future

[02:45:23] : [02:45:24]

might be in the Metaverse. (laughs)

[02:45:24] : [02:45:27]

- Well, it could be.

[02:45:27] : [02:45:28]

Yeah.

[02:45:28] : [02:45:29]

- But the point is youcan't possibly predict.

[02:45:29] : [02:45:31]

But you're right.

[02:45:31] : [02:45:33]

I mean, you've made alot of strong points.

[02:45:33] : [02:45:35]

And I believe that peopleare fundamentally good,

[02:45:35] : [02:45:38]

and so if AI, especially open source AI

[02:45:38] : [02:45:42]

can make them smarter,

[02:45:42] : [02:45:45]

it just empowers the goodness in humans.

[02:45:45] : [02:45:48]

- So I share that feeling.

[02:45:48] : [02:45:49]

Okay?

[02:45:49] : [02:45:50]

I think people arefundamentally good. (laughing)

[02:45:50] : [02:45:54]

And in fact a lot of doomers are doomers

[02:45:54] : [02:45:56]

because they don't think thatpeople are fundamentally good.

[02:45:56] : [02:46:00]

And they either don't trust people

[02:46:00] : [02:46:04]

or they don't trust theinstitution to do the right thing

[02:46:04] : [02:46:07]

so that people behave properly.

[02:46:07] : [02:46:09]

- Well, I think both youand I believe in humanity,

[02:46:09] : [02:46:13]

and I think I speak for a lot of people

[02:46:13] : [02:46:16]

in saying thank you for pushingthe open source movement,

[02:46:16] : [02:46:20]

pushing to making bothresearch and AI open source,

[02:46:20] : [02:46:24]

making it available to people,

[02:46:24] : [02:46:25]

and also the models themselves,

[02:46:25] : [02:46:27]

making that open source also.

[02:46:27] : [02:46:28]

So thank you for that.

[02:46:28] : [02:46:30]

And thank you for speaking your mind

[02:46:30] : [02:46:32]

in such colorful and beautifulways on the internet.

[02:46:32] : [02:46:34]

I hope you never stop.

[02:46:34] : [02:46:35]

You're one of the most fun people I know

[02:46:35] : [02:46:37]

and get to be a fan of.

[02:46:37] : [02:46:39]

So Yann, thank you forspeaking to me once again,

[02:46:39] : [02:46:42]

and thank you for being you.

[02:46:42] : [02:46:43]

- Thank you Lex.

[02:46:43] : [02:46:44]

- Thanks for listening to thisconversation with Yann LeCun.

[02:46:44] : [02:46:48]

To support this podcast,

[02:46:48] : [02:46:49]

please check out oursponsors in the description.

[02:46:49] : [02:46:52]

And now let me leave you with some words

[02:46:52] : [02:46:54]

from Arthur C. Clarke,

[02:46:54] : [02:46:55]

"the only way to discoverthe limits of the possible

[02:46:55] : [02:46:59]

is to go beyond them into the impossible."

[02:46:59] : [02:47:03]

Thank you for listening andhope to see you next time.

[02:47:03] : [02:47:07]



L O A D I N G
. . . comments & more!

About Author

Video Man HackerNoon profile picture
Video Man@videoman
i'm a man i'm a man i'm a video man

टिप्पणियाँ


लेबल

इस लेख में चित्रित किया गया था...

Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
X REMOVE AD