paint-brush
ヤン・レクン氏「独自の AI システムを通じた権力集中の危険性」について に@videoman
133 測定値

ヤン・レクン氏「独自の AI システムを通じた権力集中の危険性」について

Video Man
Video Man HackerNoon profile picture

Video Man

@videoman

i'm a man i'm a man i'm a video man...

1 分 read2024/04/04
Read on Terminal Reader
Read this story in a terminal
Print this story

長すぎる; 読むには

Yann LeCun は Meta の主任 AI 科学者であり、ニューヨーク大学の教授、チューリング賞受賞者、そして最も影響力のある研究者の 1 人です。Lex は Lex Luther ではありません。
featured image - ヤン・レクン氏「独自の AI システムを通じた権力集中の危険性」について
Video Man HackerNoon profile picture
Video Man

Video Man

@videoman

i'm a man i'm a man i'm a video man



- I see the danger of thisconcentration of power

[00:00:00] : [00:00:02]

through proprietary AI systems

[00:00:02] : [00:00:06]

as a much bigger dangerthan everything else.

[00:00:06] : [00:00:08]

What works against this

[00:00:08] : [00:00:11]

is people who think thatfor reasons of security,

[00:00:11] : [00:00:15]

we should keep AI systemsunder lock and key

[00:00:15] : [00:00:18]

because it's too dangerous

[00:00:18] : [00:00:19]

to put it in the hands of everybody.

[00:00:19] : [00:00:22]

That would lead to a very bad future

[00:00:22] : [00:00:25]

in which all of our information diet

[00:00:25] : [00:00:27]

is controlled by a smallnumber of companies

[00:00:27] : [00:00:30]

through proprietary systems.

[00:00:30] : [00:00:32]

- I believe that peopleare fundamentally good

[00:00:32] : [00:00:34]

and so if AI, especially open source AI

[00:00:34] : [00:00:38]

can make them smarter,

[00:00:38] : [00:00:41]

it just empowers the goodness in humans.

[00:00:41] : [00:00:44]

- So I share that feeling.

[00:00:44] : [00:00:45]

Okay?

[00:00:45] : [00:00:46]

I think people arefundamentally good. (laughing)

[00:00:46] : [00:00:50]

And in fact a lot of doomers are doomers

[00:00:50] : [00:00:52]

because they don't think thatpeople are fundamentally good.

[00:00:52] : [00:00:55]

- The following is aconversation with Yann LeCun,

[00:00:55] : [00:01:01]

his third time on this podcast.

[00:01:01] : [00:01:02]

He is the chief AI scientist at Meta,

[00:01:02] : [00:01:05]

professor at NYU,

[00:01:05] : [00:01:07]

Turing Award winner

[00:01:07] : [00:01:08]

and one of the seminal figures

[00:01:08] : [00:01:10]

in the history of artificial intelligence.

[00:01:10] : [00:01:13]

He and Meta AI

[00:01:13] : [00:01:15]

have been big proponents ofopen sourcing AI development,

[00:01:15] : [00:01:19]

and have been walking the walk

[00:01:19] : [00:01:21]

by open sourcing manyof their biggest models,

[00:01:21] : [00:01:24]

including LLaMA 2 and eventually LLaMA 3.

[00:01:24] : [00:01:28]

Also, Yann has been an outspoken critic

[00:01:28] : [00:01:31]

of those people in the AI community

[00:01:31] : [00:01:34]

who warn about the looming danger

[00:01:34] : [00:01:36]

and existential threat of AGI.

[00:01:36] : [00:01:39]

He believes the AGIwill be created one day,

[00:01:39] : [00:01:43]

but it will be good.

[00:01:43] : [00:01:45]

It will not escape human control

[00:01:45] : [00:01:47]

nor will it dominate and kill all humans.

[00:01:47] : [00:01:52]

At this moment of rapid AI development,

[00:01:52] : [00:01:54]

this happens to be somewhata controversial position.

[00:01:54] : [00:01:58]

And so it's been fun

[00:01:58] : [00:02:00]

seeing Yann get into a lot of intense

[00:02:00] : [00:02:02]

and fascinating discussions online

[00:02:02] : [00:02:04]

as we do in this very conversation.

[00:02:04] : [00:02:08]

This is the Lex Fridman podcast.

[00:02:08] : [00:02:10]

To support it,

[00:02:10] : [00:02:11]

please check out oursponsors in the description.

[00:02:11] : [00:02:13]

And now, dear friends, here's Yann LeCun.

[00:02:13] : [00:02:17]

You've had some strong statements,

[00:02:17] : [00:02:21]

technical statements

[00:02:21] : [00:02:22]

about the future of artificialintelligence recently,

[00:02:22] : [00:02:25]

throughout your careeractually but recently as well.

[00:02:25] : [00:02:28]

You've said that autoregressive LLMs

[00:02:28] : [00:02:31]

are not the way we'regoing to make progress

[00:02:31] : [00:02:36]

towards superhuman intelligence.

[00:02:36] : [00:02:38]

These are the large language models

[00:02:38] : [00:02:41]

like GPT-4, like LLaMA2 and 3 soon and so on.

[00:02:41] : [00:02:44]

How do they work

[00:02:44] : [00:02:45]

and why are they not goingto take us all the way?

[00:02:45] : [00:02:47]

- For a number of reasons.

[00:02:47] : [00:02:49]

The first is that there isa number of characteristics

[00:02:49] : [00:02:51]

of intelligent behavior.

[00:02:51] : [00:02:53]

For example, the capacityto understand the world,

[00:02:53] : [00:02:58]

understand the physical world,

[00:02:58] : [00:03:00]

the ability to rememberand retrieve things,

[00:03:00] : [00:03:05]

persistent memory,

[00:03:05] : [00:03:08]

the ability to reasonand the ability to plan.

[00:03:08] : [00:03:12]

Those are four essential characteristic

[00:03:12] : [00:03:14]

of intelligent systems or entities,

[00:03:14] : [00:03:18]

humans, animals.

[00:03:18] : [00:03:19]

LLMs can do none of those,

[00:03:19] : [00:03:23]

or they can only do themin a very primitive way.

[00:03:23] : [00:03:26]

And they don't reallyunderstand the physical world,

[00:03:26] : [00:03:29]

they don't really have persistent memory,

[00:03:29] : [00:03:31]

they can't really reason

[00:03:31] : [00:03:32]

and they certainly can't plan.

[00:03:32] : [00:03:34]

And so if you expect thesystem to become intelligent

[00:03:34] : [00:03:38]

just without having thepossibility of doing those things,

[00:03:38] : [00:03:43]

you're making a mistake.

[00:03:43] : [00:03:44]

That is not to say thatautoregressive LLMs are not useful,

[00:03:44] : [00:03:50]

they're certainly useful.

[00:03:50] : [00:03:52]

That they're not interesting,

[00:03:52] : [00:03:55]

that we can't build

[00:03:55] : [00:03:56]

a whole ecosystem ofapplications around them,

[00:03:56] : [00:04:00]

of course we can.

[00:04:00] : [00:04:00]

But as a path towardshuman level intelligence,

[00:04:00] : [00:04:05]

they're missing essential components.

[00:04:05] : [00:04:08]

And then there is another tidbit or fact

[00:04:08] : [00:04:11]

that I think is very interesting;

[00:04:11] : [00:04:14]

those LLMs are trained onenormous amounts of text.

[00:04:14] : [00:04:16]

Basically the entirety

[00:04:16] : [00:04:18]

of all publicly availabletext on the internet, right?

[00:04:18] : [00:04:21]

That's typically on theorder of 10 to the 13 tokens.

[00:04:21] : [00:04:26]

Each token is typically two bytes.

[00:04:26] : [00:04:28]

So that's two 10 to the13 bytes as training data.

[00:04:28] : [00:04:31]

It would take you or me 170,000 years

[00:04:31] : [00:04:34]

to just read through this ateight hours a day. (laughs)

[00:04:34] : [00:04:37]

So it seems like an enormousamount of knowledge, right?

[00:04:37] : [00:04:41]

That those systems can accumulate.

[00:04:41] : [00:04:43]

But then you realize it'sreally not that much data.

[00:04:43] : [00:04:48]

If you talk todevelopmental psychologists,

[00:04:48] : [00:04:51]

and they tell you a 4-year-old

[00:04:51] : [00:04:53]

has been awake for 16,000hours in his or her life,

[00:04:53] : [00:04:57]

and the amount of information

[00:04:57] : [00:05:01]

that has reached thevisual cortex of that child

[00:05:01] : [00:05:06]

in four years

[00:05:06] : [00:05:07]

is about 10 to 15 bytes.

[00:05:07] : [00:05:12]

And you can compute this

[00:05:12] : [00:05:12]

by estimating that the optical nerve

[00:05:12] : [00:05:16]

carry about 20 megabytesper second, roughly.

[00:05:16] : [00:05:19]

And so 10 to the 15 bytes for a 4-year-old

[00:05:19] : [00:05:22]

versus two times 10 to the 13 bytes

[00:05:22] : [00:05:25]

for 170,000 years worth of reading.

[00:05:25] : [00:05:28]

What that tells you isthat through sensory input,

[00:05:28] : [00:05:33]

we see a lot more information

[00:05:33] : [00:05:35]

than we do through language.

[00:05:35] : [00:05:37]

And that despite our intuition,

[00:05:37] : [00:05:40]

most of what we learnand most of our knowledge

[00:05:40] : [00:05:43]

is through our observation and interaction

[00:05:43] : [00:05:46]

with the real world,

[00:05:46] : [00:05:47]

not through language.

[00:05:47] : [00:05:49]

Everything that we learn inthe first few years of life,

[00:05:49] : [00:05:51]

and certainly everythingthat animals learn

[00:05:51] : [00:05:54]

has nothing to do with language.

[00:05:54] : [00:05:57]

- So it would be good

[00:05:57] : [00:05:57]

to maybe push againstsome of the intuition

[00:05:57] : [00:06:00]

behind what you're saying.

[00:06:00] : [00:06:01]

So it is true there'sseveral orders of magnitude

[00:06:01] : [00:06:05]

more data coming into thehuman mind, much faster,

[00:06:05] : [00:06:10]

and the human mind is able tolearn very quickly from that,

[00:06:10] : [00:06:13]

filter the data very quickly.

[00:06:13] : [00:06:14]

Somebody might argue

[00:06:14] : [00:06:16]

your comparison betweensensory data versus language.

[00:06:16] : [00:06:19]

That language is already very compressed.

[00:06:19] : [00:06:23]

It already contains a lot more information

[00:06:23] : [00:06:25]

than the bytes it takes to store them,

[00:06:25] : [00:06:27]

if you compare it to visual data.

[00:06:27] : [00:06:29]

So there's a lot of wisdom in language.

[00:06:29] : [00:06:31]

There's words and the waywe stitch them together,

[00:06:31] : [00:06:33]

it already contains a lot of information.

[00:06:33] : [00:06:36]

So is it possible that language alone

[00:06:36] : [00:06:40]

already has enough wisdomand knowledge in there

[00:06:40] : [00:06:45]

to be able to, from thatlanguage construct a world model

[00:06:45] : [00:06:50]

and understanding of the world,

[00:06:50] : [00:06:52]

an understanding of the physical world

[00:06:52] : [00:06:54]

that you're saying LLMs lack?

[00:06:54] : [00:06:56]

- So it's a big debate among philosophers

[00:06:56] : [00:07:00]

and also cognitive scientists,

[00:07:00] : [00:07:01]

like whether intelligence needsto be grounded in reality.

[00:07:01] : [00:07:05]

I'm clearly in the camp

[00:07:05] : [00:07:07]

that yes, intelligence cannot appear

[00:07:07] : [00:07:10]

without some grounding in some reality.

[00:07:10] : [00:07:14]

It doesn't need to be physical reality,

[00:07:14] : [00:07:17]

it could be simulated

[00:07:17] : [00:07:18]

but the environment is just much richer

[00:07:18] : [00:07:20]

than what you can express in language.

[00:07:20] : [00:07:22]

Language is a very approximaterepresentation or percepts

[00:07:22] : [00:07:27]

and or mental models, right?

[00:07:27] : [00:07:29]

I mean, there's a lot oftasks that we accomplish

[00:07:29] : [00:07:32]

where we manipulate a mentalmodel of the situation at hand,

[00:07:32] : [00:07:37]

and that has nothing to do with language.

[00:07:37] : [00:07:40]

Everything that's physical,mechanical, whatever,

[00:07:40] : [00:07:43]

when we build something,

[00:07:43] : [00:07:44]

when we accomplish a task,

[00:07:44] : [00:07:46]

a moderate task of grabbingsomething, et cetera,

[00:07:46] : [00:07:50]

we plan our action sequences,

[00:07:50] : [00:07:52]

and we do this

[00:07:52] : [00:07:53]

by essentially imagining the result

[00:07:53] : [00:07:55]

of the outcome of sequence ofactions that we might imagine.

[00:07:55] : [00:08:00]

And that requires mental models

[00:08:00] : [00:08:03]

that don't have much to do with language.

[00:08:03] : [00:08:06]

And that's, I would argue,

[00:08:06] : [00:08:07]

most of our knowledge

[00:08:07] : [00:08:09]

is derived from that interactionwith the physical world.

[00:08:09] : [00:08:13]

So a lot of my colleagues

[00:08:13] : [00:08:15]

who are more interested inthings like computer vision

[00:08:15] : [00:08:19]

are really on that camp

[00:08:19] : [00:08:20]

that AI needs to be embodied, essentially.

[00:08:20] : [00:08:25]

And then other peoplecoming from the NLP side

[00:08:25] : [00:08:28]

or maybe some other motivation

[00:08:28] : [00:08:32]

don't necessarily agree with that.

[00:08:32] : [00:08:34]

And philosophers are split as well.

[00:08:34] : [00:08:37]

And the complexity of theworld is hard to imagine.

[00:08:37] : [00:08:42]

It's hard to representall the complexities

[00:08:42] : [00:08:49]

that we take completely forgranted in the real world

[00:08:49] : [00:08:53]

that we don't even imaginerequire intelligence, right?

[00:08:53] : [00:08:55]

This is the old Moravec's paradox

[00:08:55] : [00:08:57]

from the pioneer ofrobotics, Hans Moravec,

[00:08:57] : [00:09:01]

who said, how is it that with computers,

[00:09:01] : [00:09:03]

it seems to be easy to dohigh level complex tasks

[00:09:03] : [00:09:05]

like playing chess and solving integrals

[00:09:05] : [00:09:08]

and doing things like that,

[00:09:08] : [00:09:09]

whereas the thing we take forgranted that we do every day,

[00:09:09] : [00:09:13]

like, I don't know,learning to drive a car

[00:09:13] : [00:09:16]

or grabbing an object,

[00:09:16] : [00:09:18]

we can't do with computers. (laughs)

[00:09:18] : [00:09:21]

And we have LLMs thatcan pass the bar exam,

[00:09:21] : [00:09:26]

so they must be smart.

[00:09:26] : [00:09:29]

But then they can'tlaunch a drive in 20 hours

[00:09:29] : [00:09:33]

like any 17-year-old.

[00:09:33] : [00:09:35]

They can't learn to clearout the dinner table

[00:09:35] : [00:09:37]

and fill out the dishwasher

[00:09:37] : [00:09:40]

like any 10-year-oldcan learn in one shot.

[00:09:40] : [00:09:42]

Why is that?

[00:09:42] : [00:09:44]

Like what are we missing?

[00:09:44] : [00:09:45]

What type of learning

[00:09:45] : [00:09:47]

or reasoning architectureor whatever are we missing

[00:09:47] : [00:09:52]

that basically prevent us

[00:09:52] : [00:09:55]

from having level five self-driving cars

[00:09:55] : [00:09:58]

and domestic robots?

[00:09:58] : [00:10:00]

- Can a large language modelconstruct a world model

[00:10:00] : [00:10:05]

that does know how to drive

[00:10:05] : [00:10:07]

and does know how to fill a dishwasher,

[00:10:07] : [00:10:09]

but just doesn't know

[00:10:09] : [00:10:10]

how to deal with visual data at this time?

[00:10:10] : [00:10:12]

So it can operate in a space of concepts.

[00:10:12] : [00:10:17]

- So yeah, that's what a lotof people are working on.

[00:10:17] : [00:10:19]

So the answer,

[00:10:19] : [00:10:20]

the short answer is no.

[00:10:20] : [00:10:22]

And the more complex answer is

[00:10:22] : [00:10:24]

you can use all kind of tricks

[00:10:24] : [00:10:26]

to get an LLM to basicallydigest visual representations

[00:10:26] : [00:10:31]

of images or video oraudio for that matter.

[00:10:31] : [00:10:40]

And a classical way of doing this

[00:10:40] : [00:10:45]

is you train a vision system in some way,

[00:10:45] : [00:10:48]

and we have a number of waysto train vision systems,

[00:10:48] : [00:10:51]

either supervised,unsupervised, self-supervised,

[00:10:51] : [00:10:53]

all kinds of different ways.

[00:10:53] : [00:10:55]

That will turn any image intoa high level representation.

[00:10:55] : [00:11:01]

Basically, a list of tokens

[00:11:01] : [00:11:03]

that are really similarto the kind of tokens

[00:11:03] : [00:11:05]

that a typical LLM takes as an input.

[00:11:05] : [00:11:10]

And then you just feed that to the LLM

[00:11:10] : [00:11:15]

in addition to the text,

[00:11:15] : [00:11:17]

and you just expectthe LLM during training

[00:11:17] : [00:11:21]

to kind of be able touse those representations

[00:11:21] : [00:11:25]

to help make decisions.

[00:11:25] : [00:11:27]

I mean, there's beenwork along those lines

[00:11:27] : [00:11:29]

for quite a long time.

[00:11:29] : [00:11:30]

And now you see those systems, right?

[00:11:30] : [00:11:32]

I mean, there are LLMs thathave some vision extension.

[00:11:32] : [00:11:36]

But they're basically hacks

[00:11:36] : [00:11:37]

in the sense that those things

[00:11:37] : [00:11:39]

are not like trained to handle,

[00:11:39] : [00:11:41]

to really understand the world.

[00:11:41] : [00:11:43]

They're not trainedwith video, for example.

[00:11:43] : [00:11:46]

They don't really understandintuitive physics,

[00:11:46] : [00:11:48]

at least not at the moment.

[00:11:48] : [00:11:50]

- So you don't think

[00:11:50] : [00:11:52]

there's something special toyou about intuitive physics,

[00:11:52] : [00:11:54]

about sort of common sense reasoning

[00:11:54] : [00:11:55]

about the physical space,about physical reality?

[00:11:55] : [00:11:58]

That to you is a giant leap

[00:11:58] : [00:12:00]

that LLMs are just not able to do?

[00:12:00] : [00:12:02]

- We're not gonna be able to do this

[00:12:02] : [00:12:03]

with the type of LLMs thatwe are working with today.

[00:12:03] : [00:12:07]

And there's a number of reasons for this,

[00:12:07] : [00:12:09]

but the main reason is

[00:12:09] : [00:12:10]

the way LLMs are trained isthat you take a piece of text,

[00:12:10] : [00:12:16]

you remove some of the wordsin that text, you mask them,

[00:12:16] : [00:12:20]

you replace them by black markers,

[00:12:20] : [00:12:22]

and you train a gigantic neural net

[00:12:22] : [00:12:24]

to predict the words that are missing.

[00:12:24] : [00:12:26]

And if you build this neuralnet in a particular way

[00:12:26] : [00:12:30]

so that it can only look at words

[00:12:30] : [00:12:32]

that are to the left of theone it's trying to predict,

[00:12:32] : [00:12:36]

then what you have is a system

[00:12:36] : [00:12:37]

that basically is trying to predict

[00:12:37] : [00:12:38]

the next word in a text, right?

[00:12:38] : [00:12:40]

So then you can feed it a text, a prompt,

[00:12:40] : [00:12:43]

and you can ask it topredict the next word.

[00:12:43] : [00:12:45]

It can never predictthe next word exactly.

[00:12:45] : [00:12:47]

And so what it's gonna do

[00:12:47] : [00:12:49]

is produce a probability distribution

[00:12:49] : [00:12:52]

of all the possiblewords in the dictionary.

[00:12:52] : [00:12:54]

In fact, it doesn't predict words,

[00:12:54] : [00:12:56]

it predicts tokens thatare kind of subword units.

[00:12:56] : [00:12:58]

And so it's easy to handle the uncertainty

[00:12:58] : [00:13:01]

in the prediction there

[00:13:01] : [00:13:02]

because there's only a finite number

[00:13:02] : [00:13:04]

of possible words in the dictionary,

[00:13:04] : [00:13:07]

and you can just computea distribution over them.

[00:13:07] : [00:13:10]

Then what the system does

[00:13:10] : [00:13:12]

is that it picks a wordfrom that distribution.

[00:13:12] : [00:13:16]

Of course, there's a higherchance of picking words

[00:13:16] : [00:13:18]

that have a higher probabilitywithin that distribution.

[00:13:18] : [00:13:21]

So you sample from that distribution

[00:13:21] : [00:13:22]

to actually produce a word,

[00:13:22] : [00:13:24]

and then you shift thatword into the input.

[00:13:24] : [00:13:27]

And so that allows the system now

[00:13:27] : [00:13:29]

to predict the second word, right?

[00:13:29] : [00:13:32]

And once you do this,

[00:13:32] : [00:13:33]

you shift it into the input, et cetera.

[00:13:33] : [00:13:35]

That's called autoregressive prediction,

[00:13:35] : [00:13:38]

which is why those LLMs

[00:13:38] : [00:13:39]

should be called autoregressive LLMs,

[00:13:39] : [00:13:42]

but we just call them at LLMs.

[00:13:42] : [00:13:46]

And there is a differencebetween this kind of process

[00:13:46] : [00:13:50]

and a process by whichbefore producing a word,

[00:13:50] : [00:13:53]

when you talk.

[00:13:53] : [00:13:55]

When you and I talk,

[00:13:55] : [00:13:56]

you and I are bilinguals.

[00:13:56] : [00:13:58]

We think about what we're gonna say,

[00:13:58] : [00:14:00]

and it's relatively independent

[00:14:00] : [00:14:01]

of the language in which we're gonna say.

[00:14:01] : [00:14:04]

When we talk about like, I don't know,

[00:14:04] : [00:14:06]

let's say a mathematicalconcept or something.

[00:14:06] : [00:14:09]

The kind of thinking that we're doing

[00:14:09] : [00:14:10]

and the answer thatwe're planning to produce

[00:14:10] : [00:14:13]

is not linked to whetherwe're gonna say it

[00:14:13] : [00:14:16]

in French or Russian or English.

[00:14:16] : [00:14:19]

- Chomsky just rolled hiseyes, but I understand.

[00:14:19] : [00:14:21]

So you're saying thatthere's a bigger abstraction

[00:14:21] : [00:14:24]

that goes before language-

[00:14:24] : [00:14:28]

- [Yann] Yeah.- And maps onto language.

[00:14:28] : [00:14:30]

- Right.

[00:14:30] : [00:14:31]

It's certainly true for alot of thinking that we do.

[00:14:31] : [00:14:33]

- Is that obvious that we don't?

[00:14:33] : [00:14:35]

Like you're saying yourthinking is same in French

[00:14:35] : [00:14:39]

as it is in English?

[00:14:39] : [00:14:40]

- Yeah, pretty much.

[00:14:40] : [00:14:42]

- Pretty much or is this...

[00:14:42] : [00:14:43]

Like how flexible are you,

[00:14:43] : [00:14:45]

like if there's aprobability distribution?

[00:14:45] : [00:14:48]

(both laugh)

[00:14:48] : [00:14:49]

- Well, it depends whatkind of thinking, right?

[00:14:49] : [00:14:50]

If it's like producing puns,

[00:14:50] : [00:14:53]

I get much better in Frenchthan English about that (laughs)

[00:14:53] : [00:14:56]

or much worse-

[00:14:56] : [00:14:58]

- Is there an abstractrepresentation of puns?

[00:14:58] : [00:15:00]

Like is your humor an abstract...

[00:15:00] : [00:15:01]

Like when you tweet

[00:15:01] : [00:15:03]

and your tweets aresometimes a little bit spicy,

[00:15:03] : [00:15:06]

is there an abstract representationin your brain of a tweet

[00:15:06] : [00:15:09]

before it maps onto English?

[00:15:09] : [00:15:11]

- There is an abstract representation

[00:15:11] : [00:15:13]

of imagining the reactionof a reader to that text.

[00:15:13] : [00:15:18]

- Oh, you start with laughter

[00:15:18] : [00:15:19]

and then figure out howto make that happen?

[00:15:19] : [00:15:22]

- Figure out like areaction you wanna cause

[00:15:22] : [00:15:25]

and then figure out how to say it

[00:15:25] : [00:15:26]

so that it causes that reaction.

[00:15:26] : [00:15:29]

But that's like really close to language.

[00:15:29] : [00:15:30]

But think about likea mathematical concept

[00:15:30] : [00:15:34]

or imagining something youwant to build out of wood

[00:15:34] : [00:15:38]

or something like this, right?

[00:15:38] : [00:15:40]

The kind of thinking you're doing

[00:15:40] : [00:15:41]

has absolutely nothing todo with language, really.

[00:15:41] : [00:15:43]

Like it's not like you have necessarily

[00:15:43] : [00:15:44]

like an internal monologuein any particular language.

[00:15:44] : [00:15:47]

You're imagining mentalmodels of the thing, right?

[00:15:47] : [00:15:51]

I mean, if I ask you to like imagine

[00:15:51] : [00:15:54]

what this water bottle will look like

[00:15:54] : [00:15:56]

if I rotate it 90 degrees,

[00:15:56] : [00:15:59]

that has nothing to do with language.

[00:15:59] : [00:16:01]

And so clearly

[00:16:01] : [00:16:04]

there is a more abstractlevel of representation

[00:16:04] : [00:16:08]

in which we do most of our thinking

[00:16:08] : [00:16:11]

and we plan what we're gonna say

[00:16:11] : [00:16:13]

if the output is uttered words

[00:16:13] : [00:16:18]

as opposed to an outputbeing muscle actions, right?

[00:16:18] : [00:16:24]

We plan our answer before we produce it.

[00:16:24] : [00:16:29]

And LLMs don't do that,

[00:16:29] : [00:16:30]

they just produce oneword after the other,

[00:16:30] : [00:16:32]

instinctively if you want.

[00:16:32] : [00:16:35]

It's a bit like the subconsciousactions where you don't...

[00:16:35] : [00:16:40]

Like you're distracted.

[00:16:40] : [00:16:42]

You're doing something,

[00:16:42] : [00:16:43]

you're completely concentrated

[00:16:43] : [00:16:45]

and someone comes to youand asks you a question.

[00:16:45] : [00:16:47]

And you kind of answer the question.

[00:16:47] : [00:16:49]

You don't have time tothink about the answer,

[00:16:49] : [00:16:51]

but the answer is easy

[00:16:51] : [00:16:52]

so you don't need to pay attention

[00:16:52] : [00:16:54]

and you sort of respond automatically.

[00:16:54] : [00:16:55]

That's kind of what an LLM does, right?

[00:16:55] : [00:16:58]

It doesn't think about its answer, really.

[00:16:58] : [00:17:01]

It retrieves it because it'saccumulated a lot of knowledge,

[00:17:01] : [00:17:04]

so it can retrieve some things,

[00:17:04] : [00:17:06]

but it's going to just spitout one token after the other

[00:17:06] : [00:17:10]

without planning the answer.

[00:17:10] : [00:17:13]

- But you're making it soundjust one token after the other,

[00:17:13] : [00:17:17]

one token at a time generationis bound to be simplistic.

[00:17:17] : [00:17:22]

But if the world model issufficiently sophisticated,

[00:17:22] : [00:17:28]

that one token at a time,

[00:17:28] : [00:17:30]

the most likely thing itgenerates as a sequence of tokens

[00:17:30] : [00:17:35]

is going to be a deeply profound thing.

[00:17:35] : [00:17:39]

- Okay.

[00:17:39] : [00:17:39]

But then that assumes that those systems

[00:17:39] : [00:17:42]

actually possess an internal world model.

[00:17:42] : [00:17:44]

- So it really goes to the...

[00:17:44] : [00:17:46]

I think the fundamental question is

[00:17:46] : [00:17:48]

can you build a reallycomplete world model?

[00:17:48] : [00:17:53]

Not complete,

[00:17:53] : [00:17:54]

but one that has a deepunderstanding of the world.

[00:17:54] : [00:17:58]

- Yeah.

[00:17:58] : [00:17:59]

So can you build thisfirst of all by prediction?

[00:17:59] : [00:18:03]

- [Lex] Right.

[00:18:03] : [00:18:04]

- And the answer is probably yes.

[00:18:04] : [00:18:06]

Can you build it by predicting words?

[00:18:06] : [00:18:10]

And the answer is most probably no,

[00:18:10] : [00:18:14]

because language isvery poor in terms of...

[00:18:14] : [00:18:17]

Or weak or low bandwidth if you want,

[00:18:17] : [00:18:19]

there's just not enough information there.

[00:18:19] : [00:18:21]

So building world modelsmeans observing the world

[00:18:21] : [00:18:26]

and understanding why the worldis evolving the way it is.

[00:18:26] : [00:18:32]

And then the extracomponent of a world model

[00:18:32] : [00:18:38]

is something that can predict

[00:18:38] : [00:18:41]

how the world is going to evolve

[00:18:41] : [00:18:42]

as a consequence of anaction you might take, right?

[00:18:42] : [00:18:45]

So one model really is,

[00:18:45] : [00:18:47]

here is my idea of the stateof the world at time T,

[00:18:47] : [00:18:49]

here is an action I might take.

[00:18:49] : [00:18:51]

What is the predicted state of the world

[00:18:51] : [00:18:53]

at time T plus one?

[00:18:53] : [00:18:55]

Now, that state of the world

[00:18:55] : [00:18:57]

does not need to representeverything about the world,

[00:18:57] : [00:19:01]

it just needs to represent

[00:19:01] : [00:19:02]

enough that's relevant forthis planning of the action,

[00:19:02] : [00:19:06]

but not necessarily all the details.

[00:19:06] : [00:19:08]

Now, here is the problem.

[00:19:08] : [00:19:09]

You're not going to be able to do this

[00:19:09] : [00:19:11]

with generative models.

[00:19:11] : [00:19:14]

So a generative modelthat's trained on video,

[00:19:14] : [00:19:16]

and we've tried to do this for 10 years.

[00:19:16] : [00:19:18]

You take a video,

[00:19:18] : [00:19:20]

show a system a piece of video

[00:19:20] : [00:19:22]

and then ask you to predictthe reminder of the video.

[00:19:22] : [00:19:25]

Basically predict what's gonna happen.

[00:19:25] : [00:19:27]

- One frame at a time.

[00:19:27] : [00:19:29]

Do the same thing as sort ofthe autoregressive LLMs do,

[00:19:29] : [00:19:33]

but for video.

[00:19:33] : [00:19:34]

- Right.

[00:19:34] : [00:19:35]

Either one frame at a time ora group of frames at a time.

[00:19:35] : [00:19:37]

But yeah, a large videomodel, if you want. (laughing)

[00:19:37] : [00:19:42]

The idea of doing this

[00:19:42] : [00:19:45]

has been floating around for a long time.

[00:19:45] : [00:19:46]

And at FAIR,

[00:19:46] : [00:19:48]

some colleagues and I

[00:19:48] : [00:19:51]

have been trying to dothis for about 10 years.

[00:19:51] : [00:19:53]

And you can't really do thesame trick as with LLMs,

[00:19:53] : [00:19:58]

because LLMs, as I said,

[00:19:58] : [00:20:02]

you can't predict exactlywhich word is gonna follow

[00:20:02] : [00:20:05]

a sequence of words,

[00:20:05] : [00:20:06]

but you can predict thedistribution of the words.

[00:20:06] : [00:20:09]

Now, if you go to video,

[00:20:09] : [00:20:11]

what you would have to do

[00:20:11] : [00:20:12]

is predict the distribution

[00:20:12] : [00:20:13]

of all possible frames in a video.

[00:20:13] : [00:20:16]

And we don't really knowhow to do that properly.

[00:20:16] : [00:20:19]

We do not know how torepresent distributions

[00:20:19] : [00:20:21]

over high dimensional continuous spaces

[00:20:21] : [00:20:24]

in ways that are useful.

[00:20:24] : [00:20:25]

And there lies the main issue.

[00:20:25] : [00:20:31]

And the reason we can do this

[00:20:31] : [00:20:33]

is because the world

[00:20:33] : [00:20:34]

is incredibly more complicated and richer

[00:20:34] : [00:20:38]

in terms of information than text.

[00:20:38] : [00:20:40]

Text is discreet.

[00:20:40] : [00:20:41]

Video is high dimensional and continuous.

[00:20:41] : [00:20:45]

A lot of details in this.

[00:20:45] : [00:20:47]

So if I take a video of this room,

[00:20:47] : [00:20:49]

and the video is a camera panning around,

[00:20:49] : [00:20:54]

there is no way I can predict

[00:20:54] : [00:20:57]

everything that's gonna bein the room as I pan around,

[00:20:57] : [00:21:00]

the system cannot predictwhat's gonna be in the room

[00:21:00] : [00:21:02]

as the camera is panning.

[00:21:02] : [00:21:03]

Maybe it's gonna predict,

[00:21:03] : [00:21:06]

this is a room where there'sa light and there is a wall

[00:21:06] : [00:21:08]

and things like that.

[00:21:08] : [00:21:09]

It can't predict what thepainting of the wall looks like

[00:21:09] : [00:21:11]

or what the texture ofthe couch looks like.

[00:21:11] : [00:21:14]

Certainly not the texture of the carpet.

[00:21:14] : [00:21:16]

So there's no way it canpredict all those details.

[00:21:16] : [00:21:19]

So the way to handle this

[00:21:19] : [00:21:22]

is one way to possibly to handle this,

[00:21:22] : [00:21:24]

which we've been working for a long time,

[00:21:24] : [00:21:26]

is to have a model that haswhat's called a latent variable.

[00:21:26] : [00:21:29]

And the latent variableis fed to a neural net,

[00:21:29] : [00:21:33]

and it's supposed to represent

[00:21:33] : [00:21:34]

all the information about the world

[00:21:34] : [00:21:35]

that you don't perceive yet.

[00:21:35] : [00:21:37]

And that you need to augment the system

[00:21:37] : [00:21:42]

for the prediction to do agood job at predicting pixels,

[00:21:42] : [00:21:47]

including the fine textureof the carpet and the couch

[00:21:47] : [00:21:52]

and the painting on the wall.

[00:21:52] : [00:21:54]

That has been a completefailure, essentially.

[00:21:54] : [00:22:00]

And we've tried lots of things.

[00:22:00] : [00:22:01]

We tried just straight neural nets,

[00:22:01] : [00:22:03]

we tried GANs,

[00:22:03] : [00:22:04]

we tried VAEs,

[00:22:04] : [00:22:08]

all kinds of regularized auto encoders,

[00:22:08] : [00:22:10]

we tried many things.

[00:22:10] : [00:22:13]

We also tried those kind of methods

[00:22:13] : [00:22:15]

to learn good representationsof images or video

[00:22:15] : [00:22:20]

that could then be used as input

[00:22:20] : [00:22:24]

for example, an imageclassification system.

[00:22:24] : [00:22:26]

And that also has basically failed.

[00:22:26] : [00:22:29]

Like all the systems thatattempt to predict missing parts

[00:22:29] : [00:22:33]

of an image or a video

[00:22:33] : [00:22:34]

from a corrupted version of it, basically.

[00:22:34] : [00:22:40]

So, right, take an image or a video,

[00:22:40] : [00:22:41]

corrupt it or transform it in some way,

[00:22:41] : [00:22:44]

and then try to reconstructthe complete video or image

[00:22:44] : [00:22:47]

from the corrupted version.

[00:22:47] : [00:22:48]

And then hope that internally,

[00:22:48] : [00:22:52]

the system will develop goodrepresentations of images

[00:22:52] : [00:22:54]

that you can use for object recognition,

[00:22:54] : [00:22:57]

segmentation, whatever it is.

[00:22:57] : [00:22:58]

That has been essentiallya complete failure.

[00:22:58] : [00:23:01]

And it works really well for text.

[00:23:01] : [00:23:04]

That's the principle thatis used for LLMs, right?

[00:23:04] : [00:23:07]

- So where's the failure exactly?

[00:23:07] : [00:23:08]

Is it that it is very difficult to form

[00:23:08] : [00:23:11]

a good representation of an image,

[00:23:11] : [00:23:14]

like a good embedding

[00:23:14] : [00:23:16]

of all the importantinformation in the image?

[00:23:16] : [00:23:19]

Is it in terms of the consistency

[00:23:19] : [00:23:21]

of image to image to image toimage that forms the video?

[00:23:21] : [00:23:24]

If we do a highlight reelof all the ways you failed.

[00:23:24] : [00:23:28]

What's that look like?

[00:23:28] : [00:23:30]

- Okay.

[00:23:30] : [00:23:31]

So the reason this doesn't work is...

[00:23:31] : [00:23:35]

First of all, I have to tellyou exactly what doesn't work

[00:23:35] : [00:23:37]

because there is somethingelse that does work.

[00:23:37] : [00:23:40]

So the thing that does not work

[00:23:40] : [00:23:41]

is training the system tolearn representations of images

[00:23:41] : [00:23:46]

by training it to reconstruct a good image

[00:23:46] : [00:23:51]

from a corrupted version of it.

[00:23:51] : [00:23:53]

Okay.

[00:23:53] : [00:23:54]

That's what doesn't work.

[00:23:54] : [00:23:55]

And we have a whole slewof techniques for this

[00:23:55] : [00:23:58]

that are variant of thenusing auto encoders.

[00:23:58] : [00:24:02]

Something called MAE,

[00:24:02] : [00:24:03]

developed by some ofmy colleagues at FAIR,

[00:24:03] : [00:24:05]

masked autoencoder.

[00:24:05] : [00:24:06]

So it's basically like theLLMs or things like this

[00:24:06] : [00:24:11]

where you train thesystem by corrupting text,

[00:24:11] : [00:24:13]

except you corrupt images.

[00:24:13] : [00:24:15]

You remove patches from it

[00:24:15] : [00:24:16]

and you train a giganticneural network to reconstruct.

[00:24:16] : [00:24:19]

The features you get are not good.

[00:24:19] : [00:24:20]

And you know they're not good

[00:24:20] : [00:24:22]

because if you now trainthe same architecture,

[00:24:22] : [00:24:25]

but you train it tosupervise with label data,

[00:24:25] : [00:24:30]

with textual descriptionsof images, et cetera,

[00:24:30] : [00:24:34]

you do get good representations.

[00:24:34] : [00:24:35]

And the performance onrecognition tasks is much better

[00:24:35] : [00:24:39]

than if you do this selfsupervised free training.

[00:24:39] : [00:24:41]

- So the architecture is good.

[00:24:41] : [00:24:44]

- The architecture is good.

[00:24:44] : [00:24:45]

The architecture of the encoder is good.

[00:24:45] : [00:24:47]

Okay?

[00:24:47] : [00:24:48]

But the fact that you train thesystem to reconstruct images

[00:24:48] : [00:24:51]

does not lead it to produce

[00:24:51] : [00:24:53]

long good generic features of images.

[00:24:53] : [00:24:56]

- [Lex] When you train itin a self supervised way.

[00:24:56] : [00:24:58]

- Self supervised by reconstruction.

[00:24:58] : [00:25:00]

- [Lex] Yeah, by reconstruction.

[00:25:00] : [00:25:01]

- Okay, so what's the alternative?

[00:25:01] : [00:25:02]

(both laugh)

[00:25:02] : [00:25:04]

The alternative is joint embedding.

[00:25:04] : [00:25:07]

- What is joint embedding?

[00:25:07] : [00:25:08]

What are these architecturesthat you're so excited about?

[00:25:08] : [00:25:11]

- Okay, so now insteadof training a system

[00:25:11] : [00:25:13]

to encode the image

[00:25:13] : [00:25:14]

and then training it toreconstruct the full image

[00:25:14] : [00:25:17]

from a corrupted version,

[00:25:17] : [00:25:20]

you take the full image,

[00:25:20] : [00:25:21]

you take the corruptedor transformed version,

[00:25:21] : [00:25:25]

you run them both through encoders,

[00:25:25] : [00:25:27]

which in general areidentical but not necessarily.

[00:25:27] : [00:25:30]

And then you train a predictoron top of those encoders

[00:25:30] : [00:25:36]

to predict the representationof the full input

[00:25:36] : [00:25:42]

from the representationof the corrupted one.

[00:25:42] : [00:25:45]

Okay?

[00:25:45] : [00:25:47]

So joint embedding,

[00:25:47] : [00:25:48]

because you're taking the full input

[00:25:48] : [00:25:51]

and the corrupted versionor transformed version,

[00:25:51] : [00:25:54]

run them both through encoders

[00:25:54] : [00:25:55]

so you get a joint embedding.

[00:25:55] : [00:25:57]

And then you're saying

[00:25:57] : [00:25:59]

can I predict therepresentation of the full one

[00:25:59] : [00:26:02]

from the representationof the corrupted one?

[00:26:02] : [00:26:04]

Okay?

[00:26:04] : [00:26:05]

And I call this a JEPA,

[00:26:05] : [00:26:07]

so that means joint embeddingpredictive architecture

[00:26:07] : [00:26:09]

because there's joint embedding

[00:26:09] : [00:26:11]

and there is this predictor

[00:26:11] : [00:26:12]

that predicts the representation

[00:26:12] : [00:26:13]

of the good guy from the bad guy.

[00:26:13] : [00:26:15]

And the big question is

[00:26:15] : [00:26:18]

how do you train something like this?

[00:26:18] : [00:26:20]

And until five years ago or six years ago,

[00:26:20] : [00:26:23]

we didn't have particularly good answers

[00:26:23] : [00:26:26]

for how you train those things,

[00:26:26] : [00:26:27]

except for one calledcontrastive learning.

[00:26:27] : [00:26:31]

And the idea of contrastive learning

[00:26:31] : [00:26:36]

is you take a pair of images

[00:26:36] : [00:26:38]

that are, again, an imageand a corrupted version

[00:26:38] : [00:26:42]

or degraded version somehow

[00:26:42] : [00:26:44]

or transformed versionof the original one.

[00:26:44] : [00:26:47]

And you train the predicted representation

[00:26:47] : [00:26:49]

to be the same as that.

[00:26:49] : [00:26:51]

If you only do this,

[00:26:51] : [00:26:52]

this system collapses.

[00:26:52] : [00:26:53]

It basically completely ignores the input

[00:26:53] : [00:26:55]

and produces representationsthat are constant.

[00:26:55] : [00:26:58]

So the contrastive methods avoid this.

[00:26:58] : [00:27:02]

And those things have beenaround since the early '90s,

[00:27:02] : [00:27:05]

I had a paper on this in 1993,

[00:27:05] : [00:27:07]

is you also show pairs of imagesthat you know are different

[00:27:07] : [00:27:13]

and then you push away therepresentations from each other.

[00:27:13] : [00:27:17]

So you say not only dorepresentations of things

[00:27:17] : [00:27:20]

that we know are the same,

[00:27:20] : [00:27:22]

should be the same or should be similar,

[00:27:22] : [00:27:23]

but representation of thingsthat we know are different

[00:27:23] : [00:27:25]

should be different.

[00:27:25] : [00:27:26]

And that prevents the collapse,

[00:27:26] : [00:27:29]

but it has some limitation.

[00:27:29] : [00:27:30]

And there's a whole bunch of techniques

[00:27:30] : [00:27:31]

that have appeared overthe last six, seven years

[00:27:31] : [00:27:35]

that can revive this type of method.

[00:27:35] : [00:27:38]

Some of them from FAIR,

[00:27:38] : [00:27:40]

some of them from Google and other places.

[00:27:40] : [00:27:44]

But there are limitations tothose contrastive methods.

[00:27:44] : [00:27:47]

What has changed in thelast three, four years

[00:27:47] : [00:27:51]

is now we have methodsthat are non-contrastive.

[00:27:51] : [00:27:54]

So they don't require thosenegative contrastive samples

[00:27:54] : [00:27:59]

of images that we know are different.

[00:27:59] : [00:28:01]

You train them only with images

[00:28:01] : [00:28:04]

that are different versions

[00:28:04] : [00:28:06]

or different views of the same thing.

[00:28:06] : [00:28:08]

And you rely on some other tweaks

[00:28:08] : [00:28:10]

to prevent the system from collapsing.

[00:28:10] : [00:28:12]

And we have half a dozendifferent methods for this now.

[00:28:12] : [00:28:16]

- So what is the fundamental difference

[00:28:16] : [00:28:17]

between joint embeddingarchitectures and LLMs?

[00:28:17] : [00:28:22]

So can JEPA take us to AGI?

[00:28:22] : [00:28:26]

Whether we should say thatyou don't like the term AGI

[00:28:26] : [00:28:31]

and we'll probably argue,

[00:28:31] : [00:28:33]

I think every singletime I've talked to you

[00:28:33] : [00:28:34]

we've argued about the G in AGI.

[00:28:34] : [00:28:36]

- [Yann] Yes.

[00:28:36] : [00:28:38]

- I get it, I get it, I get it. (laughing)

[00:28:38] : [00:28:40]

Well we'll probablycontinue to argue about it.

[00:28:40] : [00:28:42]

It's great.

[00:28:42] : [00:28:43]

Because you're like French,

[00:28:43] : [00:28:48]

and ami is I guess friend in French-

[00:28:48] : [00:28:51]

- [Yann] Yes.

[00:28:51] : [00:28:52]

- And AMI stands for advancedmachine intelligence-

[00:28:52] : [00:28:55]

- [Yann] Right.

[00:28:55] : [00:28:56]

- But either way, canJEPA take us to that,

[00:28:56] : [00:29:00]

towards that advancedmachine intelligence?

[00:29:00] : [00:29:02]

- Well, so it's a first step.

[00:29:02] : [00:29:04]

Okay?

[00:29:04] : [00:29:05]

So first of all, what's the difference

[00:29:05] : [00:29:07]

with generative architectures like LLMs?

[00:29:07] : [00:29:10]

So LLMs or vision systems thatare trained by reconstruction

[00:29:10] : [00:29:15]

generate the inputs, right?

[00:29:15] : [00:29:20]

They generate the original input

[00:29:20] : [00:29:22]

that is non-corrupted,non-transformed, right?

[00:29:22] : [00:29:27]

So you have to predict all the pixels.

[00:29:27] : [00:29:28]

And there is a huge amount ofresources spent in the system

[00:29:28] : [00:29:33]

to actually predict all thosepixels, all the details.

[00:29:33] : [00:29:36]

In a JEPA, you're not tryingto predict all the pixels,

[00:29:36] : [00:29:40]

you're only trying to predict

[00:29:40] : [00:29:42]

an abstract representationof the inputs, right?

[00:29:42] : [00:29:47]

And that's much easier in many ways.

[00:29:47] : [00:29:49]

So what the JEPA system

[00:29:49] : [00:29:50]

when it's being trained is trying to do,

[00:29:50] : [00:29:52]

is extract as much informationas possible from the input,

[00:29:52] : [00:29:56]

but yet only extract information

[00:29:56] : [00:29:58]

that is relatively easily predictable.

[00:29:58] : [00:30:00]

Okay.

[00:30:00] : [00:30:02]

So there's a lot of things in the world

[00:30:02] : [00:30:03]

that we cannot predict.

[00:30:03] : [00:30:04]

Like for example, if youhave a self driving car

[00:30:04] : [00:30:07]

driving down the street or road.

[00:30:07] : [00:30:08]

There may be trees around the road.

[00:30:08] : [00:30:13]

And it could be a windy day,

[00:30:13] : [00:30:14]

so the leaves on thetree are kind of moving

[00:30:14] : [00:30:17]

in kind of semi chaotic random ways

[00:30:17] : [00:30:19]

that you can't predict and you don't care,

[00:30:19] : [00:30:22]

you don't want to predict.

[00:30:22] : [00:30:23]

So what you want is your encoder

[00:30:23] : [00:30:25]

to basically eliminate all those details.

[00:30:25] : [00:30:27]

It'll tell you there's moving leaves,

[00:30:27] : [00:30:28]

but it's not gonna keep the details

[00:30:28] : [00:30:30]

of exactly what's going on.

[00:30:30] : [00:30:32]

And so when you do the predictionin representation space,

[00:30:32] : [00:30:35]

you're not going to have to predict

[00:30:35] : [00:30:37]

every single pixel of every leaf.

[00:30:37] : [00:30:38]

And that not only is a lot simpler,

[00:30:38] : [00:30:43]

but also it allows the system

[00:30:43] : [00:30:45]

to essentially learn an abstractrepresentation of the world

[00:30:45] : [00:30:49]

where what can be modeledand predicted is preserved

[00:30:49] : [00:30:54]

and the rest is viewed as noise

[00:30:54] : [00:30:57]

and eliminated by the encoder.

[00:30:57] : [00:30:59]

So it kind of lifts thelevel of abstraction

[00:30:59] : [00:31:00]

of the representation.

[00:31:00] : [00:31:02]

If you think about this,

[00:31:02] : [00:31:03]

this is something we doabsolutely all the time.

[00:31:03] : [00:31:05]

Whenever we describe a phenomenon,

[00:31:05] : [00:31:07]

we describe it at a particularlevel of abstraction.

[00:31:07] : [00:31:10]

And we don't always describeevery natural phenomenon

[00:31:10] : [00:31:13]

in terms of quantum field theory, right?

[00:31:13] : [00:31:15]

That would be impossible, right?

[00:31:15] : [00:31:17]

So we have multiple levels of abstraction

[00:31:17] : [00:31:19]

to describe what happens in the world.

[00:31:19] : [00:31:22]

Starting from quantum field theory

[00:31:22] : [00:31:24]

to like atomic theory andmolecules in chemistry,

[00:31:24] : [00:31:27]

materials,

[00:31:27] : [00:31:29]

all the way up to kind ofconcrete objects in the real world

[00:31:29] : [00:31:33]

and things like that.

[00:31:33] : [00:31:34]

So we can't just only modeleverything at the lowest level.

[00:31:34] : [00:31:39]

And that's what the ideaof JEPA is really about.

[00:31:39] : [00:31:44]

Learn abstract representationin a self supervised manner.

[00:31:44] : [00:31:49]

And you can do it hierarchically as well.

[00:31:49] : [00:31:52]

So that I think is an essential component

[00:31:52] : [00:31:54]

of an intelligent system.

[00:31:54] : [00:31:56]

And in language, we canget away without doing this

[00:31:56] : [00:31:58]

because language is alreadyto some level abstract

[00:31:58] : [00:32:02]

and already has eliminateda lot of information

[00:32:02] : [00:32:05]

that is not predictable.

[00:32:05] : [00:32:07]

And so we can get away withoutdoing the joint embedding,

[00:32:07] : [00:32:11]

without lifting the abstraction level

[00:32:11] : [00:32:13]

and by directly predicting words.

[00:32:13] : [00:32:15]

- So joint embedding.

[00:32:15] : [00:32:17]

It's still generative,

[00:32:17] : [00:32:20]

but it's generative in thisabstract representation space.

[00:32:20] : [00:32:23]

- [Yann] Yeah.

[00:32:23] : [00:32:24]

- And you're saying language,

[00:32:24] : [00:32:25]

we were lazy with language

[00:32:25] : [00:32:27]

'cause we already got theabstract representation for free

[00:32:27] : [00:32:30]

and now we have to zoom out,

[00:32:30] : [00:32:32]

actually think aboutgenerally intelligent systems,

[00:32:32] : [00:32:34]

we have to deal with the full mess

[00:32:34] : [00:32:37]

of physical of reality, of reality.

[00:32:37] : [00:32:40]

And you do have to do this step

[00:32:40] : [00:32:42]

of jumping from the full,rich, detailed reality

[00:32:42] : [00:32:47]

to an abstract representationof that reality

[00:32:47] : [00:32:54]

based on what you can then reason

[00:32:54] : [00:32:56]

and all that kind of stuff.

[00:32:56] : [00:32:57]

- Right.

[00:32:57] : [00:32:58]

And the thing is thoseself supervised algorithms

[00:32:58] : [00:33:00]

that learn by prediction,

[00:33:00] : [00:33:02]

even in representation space,

[00:33:02] : [00:33:04]

they learn more concept

[00:33:04] : [00:33:09]

if the input data you feedthem is more redundant.

[00:33:09] : [00:33:12]

The more redundancy there is in the data,

[00:33:12] : [00:33:14]

the more they're able to capture

[00:33:14] : [00:33:15]

some internal structure of it.

[00:33:15] : [00:33:17]

And so there,

[00:33:17] : [00:33:18]

there is way moreredundancy in the structure

[00:33:18] : [00:33:21]

in perceptual inputs,sensory input like vision,

[00:33:21] : [00:33:26]

than there is in text,

[00:33:26] : [00:33:28]

which is not nearly as redundant.

[00:33:28] : [00:33:30]

This is back to thequestion you were asking

[00:33:30] : [00:33:32]

a few minutes ago.

[00:33:32] : [00:33:33]

Language might representmore information really

[00:33:33] : [00:33:35]

because it's already compressed,

[00:33:35] : [00:33:36]

you're right about that.

[00:33:36] : [00:33:38]

But that means it's also less redundant.

[00:33:38] : [00:33:40]

And so self supervisedonly will not work as well.

[00:33:40] : [00:33:43]

- Is it possible to join

[00:33:43] : [00:33:45]

the self supervisedtraining on visual data

[00:33:45] : [00:33:49]

and self supervisedtraining on language data?

[00:33:49] : [00:33:53]

There is a huge amount of knowledge

[00:33:53] : [00:33:56]

even though you talk down aboutthose 10 to the 13 tokens.

[00:33:56] : [00:34:00]

Those 10 to the 13 tokens

[00:34:00] : [00:34:01]

represent the entirety,

[00:34:01] : [00:34:03]

a large fraction of whatus humans have figured out.

[00:34:03] : [00:34:08]

Both the shit talk on Reddit

[00:34:08] : [00:34:11]

and the contents of allthe books and the articles

[00:34:11] : [00:34:14]

and the full spectrum ofhuman intellectual creation.

[00:34:14] : [00:34:18]

So is it possible tojoin those two together?

[00:34:18] : [00:34:22]

- Well, eventually, yes,

[00:34:22] : [00:34:23]

but I think if we do this too early,

[00:34:23] : [00:34:27]

we run the risk of being tempted to cheat.

[00:34:27] : [00:34:30]

And in fact, that's whatpeople are doing at the moment

[00:34:30] : [00:34:32]

with vision language model.

[00:34:32] : [00:34:33]

We're basically cheating.

[00:34:33] : [00:34:35]

We are using language as a crutch

[00:34:35] : [00:34:38]

to help the deficienciesof our vision systems

[00:34:38] : [00:34:42]

to kind of learn good representationsfrom images and video.

[00:34:42] : [00:34:46]

And the problem with this

[00:34:46] : [00:34:47]

is that we might improve ourvision language system a bit,

[00:34:47] : [00:34:52]

I mean our language modelsby feeding them images.

[00:34:52] : [00:34:58]

But we're not gonna get to the level

[00:34:58] : [00:34:59]

of even the intelligence

[00:34:59] : [00:35:01]

or level of understanding of the world

[00:35:01] : [00:35:03]

of a cat or a dog whichdoesn't have language.

[00:35:03] : [00:35:06]

They don't have language

[00:35:06] : [00:35:08]

and they understand the worldmuch better than any LLM.

[00:35:08] : [00:35:12]

They can plan really complex actions

[00:35:12] : [00:35:14]

and sort of imagine theresult of a bunch of actions.

[00:35:14] : [00:35:17]

How do we get machines to learn that

[00:35:17] : [00:35:20]

before we combine that with language?

[00:35:20] : [00:35:22]

Obviously, if we combinethis with language,

[00:35:22] : [00:35:24]

this is gonna be a winner,

[00:35:24] : [00:35:26]

but before that we have to focus

[00:35:26] : [00:35:30]

on like how do we get systemsto learn how the world works?

[00:35:30] : [00:35:33]

- So this kind of joint embeddingpredictive architecture,

[00:35:33] : [00:35:37]

for you, that's gonna be able to learn

[00:35:37] : [00:35:40]

something like common sense,

[00:35:40] : [00:35:41]

something like what a cat uses

[00:35:41] : [00:35:43]

to predict how to mess withits owner most optimally

[00:35:43] : [00:35:47]

by knocking over a thing.

[00:35:47] : [00:35:49]

- That's the hope.

[00:35:49] : [00:35:51]

In fact, the techniques we'reusing are non-contrastive.

[00:35:51] : [00:35:54]

So not only is thearchitecture non-generative,

[00:35:54] : [00:35:57]

the learning procedures we'reusing are non-contrastive.

[00:35:57] : [00:36:00]

We have two sets of techniques.

[00:36:00] : [00:36:02]

One set is based on distillation

[00:36:02] : [00:36:05]

and there's a number of methodsthat use this principle.

[00:36:05] : [00:36:10]

One by DeepMind called BYOL.

[00:36:10] : [00:36:12]

A couple by FAIR,

[00:36:12] : [00:36:14]

one called VICReg andanother one called I-JEPA.

[00:36:14] : [00:36:19]

And VICReg, I should say,

[00:36:19] : [00:36:21]

is not a distillation method actually,

[00:36:21] : [00:36:23]

but I-JEPA and BYOL certainly are.

[00:36:23] : [00:36:25]

And there's another onealso called DINO or Dino,

[00:36:25] : [00:36:28]

also produced at FAIR.

[00:36:28] : [00:36:31]

And the idea of those things

[00:36:31] : [00:36:32]

is that you take the fullinput, let's say an image.

[00:36:32] : [00:36:35]

You run it through an encoder,

[00:36:35] : [00:36:37]

produces a representation.

[00:36:37] : [00:36:41]

And then you corrupt thatinput or transform it,

[00:36:41] : [00:36:43]

run it through essentially whatamounts to the same encoder

[00:36:43] : [00:36:46]

with some minor differences.

[00:36:46] : [00:36:48]

And then train a predictor.

[00:36:48] : [00:36:50]

Sometimes a predictor is very simple,

[00:36:50] : [00:36:51]

sometimes it doesn't exist.

[00:36:51] : [00:36:53]

But train a predictor topredict a representation

[00:36:53] : [00:36:55]

of the first uncorrupted inputfrom the corrupted input.

[00:36:55] : [00:37:00]

But you only train the second branch.

[00:37:00] : [00:37:04]

You only train the part of the network

[00:37:04] : [00:37:07]

that is fed with the corrupted input.

[00:37:07] : [00:37:10]

The other network, you don't train.

[00:37:10] : [00:37:12]

But since they share the same weight,

[00:37:12] : [00:37:14]

when you modify the first one,

[00:37:14] : [00:37:15]

it also modifies the second one.

[00:37:15] : [00:37:18]

And with various tricks,

[00:37:18] : [00:37:19]

you can prevent the system from collapsing

[00:37:19] : [00:37:21]

with the collapse of thetype I was explaining before

[00:37:21] : [00:37:24]

where the system basicallyignores the input.

[00:37:24] : [00:37:26]

So that works very well.

[00:37:26] : [00:37:31]

The two techniqueswe've developed at FAIR,

[00:37:31] : [00:37:33]

DINO and I-JEPA work really well for that.

[00:37:33] : [00:37:38]

- So what kind of dataare we talking about here?

[00:37:38] : [00:37:41]

- So there's several scenarios.

[00:37:41] : [00:37:43]

One scenario is you take an image,

[00:37:43] : [00:37:47]

you corrupt it by changingthe cropping, for example,

[00:37:47] : [00:37:52]

changing the size a little bit,

[00:37:52] : [00:37:54]

maybe changing theorientation, blurring it,

[00:37:54] : [00:37:56]

changing the colors,

[00:37:56] : [00:37:58]

doing all kinds of horrible things to it-

[00:37:58] : [00:38:00]

- But basic horrible things.

[00:38:00] : [00:38:01]

- Basic horrible things

[00:38:01] : [00:38:02]

that sort of degradethe quality a little bit

[00:38:02] : [00:38:04]

and change the framing,

[00:38:04] : [00:38:05]

crop the image.

[00:38:05] : [00:38:08]

And in some cases, in the case of I-JEPA,

[00:38:08] : [00:38:12]

you don't need to do any of this,

[00:38:12] : [00:38:13]

you just mask some parts of it, right?

[00:38:13] : [00:38:16]

You just basically remove some regions

[00:38:16] : [00:38:19]

like a big block, essentially.

[00:38:19] : [00:38:21]

And then run through the encoders

[00:38:21] : [00:38:24]

and train the entire system,

[00:38:24] : [00:38:26]

encoder and predictor,

[00:38:26] : [00:38:27]

to predict the representationof the good one

[00:38:27] : [00:38:29]

from the representationof the corrupted one.

[00:38:29] : [00:38:31]

So that's the I-JEPA.

[00:38:31] : [00:38:35]

It doesn't need to know thatit's an image, for example,

[00:38:35] : [00:38:38]

because the only thing it needs to know

[00:38:38] : [00:38:39]

is how to do this masking.

[00:38:39] : [00:38:40]

Whereas with DINO,

[00:38:40] : [00:38:43]

you need to know it's an image

[00:38:43] : [00:38:44]

because you need to do things

[00:38:44] : [00:38:45]

like geometry transformation and blurring

[00:38:45] : [00:38:48]

and things like that thatare really image specific.

[00:38:48] : [00:38:51]

A more recent version of thisthat we have is called V-JEPA.

[00:38:51] : [00:38:53]

So it's basically the same idea as I-JEPA

[00:38:53] : [00:38:56]

except it's applied to video.

[00:38:56] : [00:38:59]

So now you take a whole video

[00:38:59] : [00:39:00]

and you mask a whole chunk of it.

[00:39:00] : [00:39:02]

And what we mask is actuallykind of a temporal tube.

[00:39:02] : [00:39:04]

So like a whole segmentof each frame in the video

[00:39:04] : [00:39:07]

over the entire video.

[00:39:07] : [00:39:10]

- And that tube is likestatically positioned

[00:39:10] : [00:39:12]

throughout the frames?

[00:39:12] : [00:39:14]

It's literally just a straight tube?

[00:39:14] : [00:39:15]

- Throughout the tube, yeah.

[00:39:15] : [00:39:17]

Typically it's 16 frames or something,

[00:39:17] : [00:39:18]

and we mask the same regionover the entire 16 frames.

[00:39:18] : [00:39:22]

It's a different one forevery video, obviously.

[00:39:22] : [00:39:24]

And then again, train that system

[00:39:24] : [00:39:28]

so as to predict therepresentation of the full video

[00:39:28] : [00:39:31]

from the partially masked video.

[00:39:31] : [00:39:34]

And that works really well.

[00:39:34] : [00:39:35]

It's the first system that we have

[00:39:35] : [00:39:36]

that learns good representations of video

[00:39:36] : [00:39:39]

so that when you feedthose representations

[00:39:39] : [00:39:41]

to a supervised classifier head,

[00:39:41] : [00:39:44]

it can tell you what actionis taking place in the video

[00:39:44] : [00:39:47]

with pretty good accuracy.

[00:39:47] : [00:39:49]

So it's the first time we getsomething of that quality.

[00:39:49] : [00:39:55]

- So that's a good test

[00:39:55] : [00:39:57]

that a good representation is formed.

[00:39:57] : [00:39:58]

That means there's something to this.

[00:39:58] : [00:40:00]

- Yeah.

[00:40:00] : [00:40:01]

We also preliminary result

[00:40:01] : [00:40:03]

that seem to indicate

[00:40:03] : [00:40:05]

that the representationallows our system to tell

[00:40:05] : [00:40:09]

whether the video is physically possible

[00:40:09] : [00:40:12]

or completely impossible

[00:40:12] : [00:40:13]

because some object disappeared

[00:40:13] : [00:40:15]

or an object suddenly jumpedfrom one location to another

[00:40:15] : [00:40:19]

or changed shape or something.

[00:40:19] : [00:40:21]

- So it's able to capturesome physics based constraints

[00:40:21] : [00:40:26]

about the realityrepresented in the video?

[00:40:26] : [00:40:29]

- [Yann] Yeah.

[00:40:29] : [00:40:30]

- About the appearance andthe disappearance of objects?

[00:40:30] : [00:40:32]

- Yeah.

[00:40:32] : [00:40:34]

That's really new.

[00:40:34] : [00:40:35]

- Okay, but can this actually

[00:40:35] : [00:40:38]

get us to this kind of world model

[00:40:38] : [00:40:43]

that understands enough about the world

[00:40:43] : [00:40:46]

to be able to drive a car?

[00:40:46] : [00:40:48]

- Possibly.

[00:40:48] : [00:40:50]

And this is gonna take a while

[00:40:50] : [00:40:51]

before we get to that point.

[00:40:51] : [00:40:52]

And there are systemsalready, robotic systems,

[00:40:52] : [00:40:56]

that are based on this idea.

[00:40:56] : [00:40:58]

What you need for this

[00:40:58] : [00:41:02]

is a slightly modified version of this

[00:41:02] : [00:41:04]

where imagine that you have a video,

[00:41:04] : [00:41:09]

a complete video,

[00:41:09] : [00:41:12]

and what you're doing to this video

[00:41:12] : [00:41:13]

is that you are eithertranslating it in time

[00:41:13] : [00:41:17]

towards the future.

[00:41:17] : [00:41:18]

So you'll only see thebeginning of the video,

[00:41:18] : [00:41:19]

but you don't see the latter part of it

[00:41:19] : [00:41:21]

that is in the original one.

[00:41:21] : [00:41:23]

Or you just mask the secondhalf of the video, for example.

[00:41:23] : [00:41:27]

And then you train this I-JEPA system

[00:41:27] : [00:41:30]

or the type I described,

[00:41:30] : [00:41:32]

to predict representationof the full video

[00:41:32] : [00:41:33]

from the shifted one.

[00:41:33] : [00:41:36]

But you also feed thepredictor with an action.

[00:41:36] : [00:41:39]

For example, the wheel is turned

[00:41:39] : [00:41:42]

10 degrees to the rightor something, right?

[00:41:42] : [00:41:45]

So if it's a dash cam in a car

[00:41:45] : [00:41:49]

and you know the angle of the wheel,

[00:41:49] : [00:41:51]

you should be able topredict to some extent

[00:41:51] : [00:41:53]

what's going to happen to what you see.

[00:41:53] : [00:41:56]

You're not gonna be ableto predict all the details

[00:41:56] : [00:41:59]

of objects that appearin the view, obviously,

[00:41:59] : [00:42:02]

but at an abstract representation level,

[00:42:02] : [00:42:05]

you can probably predictwhat's gonna happen.

[00:42:05] : [00:42:08]

So now what you have is an internal model

[00:42:08] : [00:42:12]

that says, here is my idea

[00:42:12] : [00:42:13]

of the state of the world at time T,

[00:42:13] : [00:42:15]

here is an action I'm taking,

[00:42:15] : [00:42:17]

here is a prediction

[00:42:17] : [00:42:18]

of the state of theworld at time T plus one,

[00:42:18] : [00:42:20]

T plus delta T,

[00:42:20] : [00:42:22]

T plus two seconds, whatever it is.

[00:42:22] : [00:42:24]

If you have a model of this type,

[00:42:24] : [00:42:26]

you can use it for planning.

[00:42:26] : [00:42:27]

So now you can do what LLMs cannot do,

[00:42:27] : [00:42:31]

which is planning what you're gonna do

[00:42:31] : [00:42:34]

so as you arrive at a particular outcome

[00:42:34] : [00:42:37]

or satisfy a particular objective, right?

[00:42:37] : [00:42:40]

So you can have a numberof objectives, right?

[00:42:40] : [00:42:44]

I can predict that if I havean object like this, right?

[00:42:44] : [00:42:49]

And I open my hand,

[00:42:49] : [00:42:52]

it's gonna fall, right?

[00:42:52] : [00:42:54]

And if I push it with aparticular force on the table,

[00:42:54] : [00:42:57]

it's gonna move.

[00:42:57] : [00:42:58]

If I push the table itself,

[00:42:58] : [00:43:00]

it's probably not gonnamove with the same force.

[00:43:00] : [00:43:03]

So we have this internal modelof the world in our mind,

[00:43:03] : [00:43:07]

which allows us to plansequences of actions

[00:43:07] : [00:43:11]

to arrive at a particular goal.

[00:43:11] : [00:43:13]

And so now if you have this world model,

[00:43:13] : [00:43:18]

we can imagine a sequence of actions,

[00:43:18] : [00:43:21]

predict what the outcome

[00:43:21] : [00:43:22]

of the sequence of action is going to be,

[00:43:22] : [00:43:25]

measure to what extent the final state

[00:43:25] : [00:43:28]

satisfies a particular objective

[00:43:28] : [00:43:30]

like moving the bottleto the left of the table.

[00:43:30] : [00:43:35]

And then plan a sequence of actions

[00:43:35] : [00:43:38]

that will minimize thisobjective at runtime.

[00:43:38] : [00:43:41]

We're not talking about learning,

[00:43:41] : [00:43:43]

we're talking about inference time, right?

[00:43:43] : [00:43:44]

So this is planning, really.

[00:43:44] : [00:43:46]

And in optimal control,

[00:43:46] : [00:43:47]

this is a very classical thing.

[00:43:47] : [00:43:48]

It's called model predictive control.

[00:43:48] : [00:43:50]

You have a model of thesystem you want to control

[00:43:50] : [00:43:53]

that can predict the sequence of states

[00:43:53] : [00:43:55]

corresponding to a sequence of commands.

[00:43:55] : [00:43:58]

And you are planninga sequence of commands

[00:43:58] : [00:44:02]

so that according to your world model,

[00:44:02] : [00:44:04]

the end state of the system

[00:44:04] : [00:44:06]

will satisfy any objectives that you fix.

[00:44:06] : [00:44:10]

This is the way rockettrajectories have been planned

[00:44:10] : [00:44:15]

since computers have been around.

[00:44:15] : [00:44:17]

So since the early '60s, essentially.

[00:44:17] : [00:44:20]

- So yes, for a model predictive control,

[00:44:20] : [00:44:21]

but you also often talkabout hierarchical planning.

[00:44:21] : [00:44:26]

- [Yann] Yeah.

[00:44:26] : [00:44:26]

- Can hierarchical planningemerge from this somehow?

[00:44:26] : [00:44:28]

- Well, so no.

[00:44:28] : [00:44:29]

You will have to builda specific architecture

[00:44:29] : [00:44:32]

to allow for hierarchical planning.

[00:44:32] : [00:44:34]

So hierarchical planningis absolutely necessary

[00:44:34] : [00:44:36]

if you want to plan complex actions.

[00:44:36] : [00:44:39]

If I wanna go from, let'ssay, from New York to Paris,

[00:44:39] : [00:44:43]

this the example I use all the time.

[00:44:43] : [00:44:45]

And I'm sitting in my office at NYU.

[00:44:45] : [00:44:48]

My objective that I need to minimize

[00:44:48] : [00:44:50]

is my distance to Paris.

[00:44:50] : [00:44:52]

At a high level,

[00:44:52] : [00:44:52]

a very abstractrepresentation of my location,

[00:44:52] : [00:44:57]

I would have to decomposethis into two sub-goals.

[00:44:57] : [00:44:59]

First one is go to the airport,

[00:44:59] : [00:45:02]

second one is catch a plane to Paris.

[00:45:02] : [00:45:04]

Okay.

[00:45:04] : [00:45:05]

So my sub-goal is nowgoing to the airport.

[00:45:05] : [00:45:09]

My objective function ismy distance to the airport.

[00:45:09] : [00:45:11]

How do I go to the airport?

[00:45:11] : [00:45:14]

Well, I have to go in thestreet and hail a taxi,

[00:45:14] : [00:45:18]

which you can do in New York.

[00:45:18] : [00:45:19]

Okay, now I have another sub-goal.

[00:45:19] : [00:45:22]

Go down on the street.

[00:45:22] : [00:45:24]

Well, that means going to the elevator,

[00:45:24] : [00:45:27]

going down the elevator,

[00:45:27] : [00:45:28]

walk out to the street.

[00:45:28] : [00:45:30]

How do I go to the elevator?

[00:45:30] : [00:45:32]

I have to stand up from my chair,

[00:45:32] : [00:45:36]

open the door of my office,

[00:45:36] : [00:45:38]

go to the elevator, push the button.

[00:45:38] : [00:45:40]

How do I get up for my chair?

[00:45:40] : [00:45:42]

Like you can imagine goingdown all the way down

[00:45:42] : [00:45:45]

to basically what amounts

[00:45:45] : [00:45:47]

to millisecond bymillisecond muscle control.

[00:45:47] : [00:45:50]

Okay?

[00:45:50] : [00:45:51]

And obviously you're notgoing to plan your entire trip

[00:45:51] : [00:45:55]

from New York to Paris

[00:45:55] : [00:45:56]

in terms of millisecond bymillisecond muscle control.

[00:45:56] : [00:46:00]

First, that would be incredibly expensive,

[00:46:00] : [00:46:02]

but it will also be completely impossible

[00:46:02] : [00:46:03]

because you don't know all the conditions

[00:46:03] : [00:46:06]

of what's gonna happen.

[00:46:06] : [00:46:07]

How long it's gonna take to catch a taxi

[00:46:07] : [00:46:10]

or to go to the airport with traffic.

[00:46:10] : [00:46:13]

I mean, you would have to know exactly

[00:46:13] : [00:46:16]

the condition of everything

[00:46:16] : [00:46:18]

to be able to do this planning,

[00:46:18] : [00:46:19]

and you don't have the information.

[00:46:19] : [00:46:21]

So you have to do thishierarchical planning

[00:46:21] : [00:46:23]

so that you can start acting

[00:46:23] : [00:46:25]

and then sort of re-planning as you go.

[00:46:25] : [00:46:27]

And nobody really knowshow to do this in AI.

[00:46:27] : [00:46:32]

Nobody knows how to train a system

[00:46:32] : [00:46:35]

to learn the appropriatemultiple levels of representation

[00:46:35] : [00:46:38]

so that hierarchical planning works.

[00:46:38] : [00:46:41]

- Does something like that already emerge?

[00:46:41] : [00:46:42]

So like can you use an LLM,

[00:46:42] : [00:46:45]

state-of-the-art LLM,

[00:46:45] : [00:46:48]

to get you from New York to Paris

[00:46:48] : [00:46:50]

by doing exactly the kind of detailed

[00:46:50] : [00:46:54]

set of questions that you just did?

[00:46:54] : [00:46:56]

Which is can you give me alist of 10 steps I need to do

[00:46:56] : [00:47:01]

to get from New York to Paris?

[00:47:01] : [00:47:02]

And then for each of those steps,

[00:47:02] : [00:47:05]

can you give me a list of 10 steps

[00:47:05] : [00:47:07]

how I make that step happen?

[00:47:07] : [00:47:09]

And for each of those steps,

[00:47:09] : [00:47:10]

can you give me a list of 10 steps

[00:47:10] : [00:47:12]

to make each one of those,

[00:47:12] : [00:47:13]

until you're movingyour individual muscles?

[00:47:13] : [00:47:15]

Maybe not.

[00:47:15] : [00:47:17]

Whatever you can actually act upon

[00:47:17] : [00:47:19]

using your own mind.

[00:47:19] : [00:47:20]

- Right.

[00:47:20] : [00:47:22]

So there's a lot of questions

[00:47:22] : [00:47:23]

that are also implied by this, right?

[00:47:23] : [00:47:24]

So the first thing is LLMswill be able to answer

[00:47:24] : [00:47:27]

some of those questions

[00:47:27] : [00:47:28]

down to some level of abstraction.

[00:47:28] : [00:47:30]

Under the condition thatthey've been trained

[00:47:30] : [00:47:34]

with similar scenariosin their training set.

[00:47:34] : [00:47:37]

- They would be able toanswer all of those questions.

[00:47:37] : [00:47:40]

But some of them may be hallucinated,

[00:47:40] : [00:47:43]

meaning non-factual.

[00:47:43] : [00:47:44]

- Yeah, true.

[00:47:44] : [00:47:45]

I mean they'll probablyproduce some answer.

[00:47:45] : [00:47:46]

Except they're not gonna be able

[00:47:46] : [00:47:47]

to really kind of produce

[00:47:47] : [00:47:48]

millisecond by millisecond muscle control

[00:47:48] : [00:47:50]

of how you stand upfrom your chair, right?

[00:47:50] : [00:47:53]

But down to some level of abstraction

[00:47:53] : [00:47:55]

where you can describe things by words,

[00:47:55] : [00:47:57]

they might be able to give you a plan,

[00:47:57] : [00:47:59]

but only under the conditionthat they've been trained

[00:47:59] : [00:48:01]

to produce those kind of plans, right?

[00:48:01] : [00:48:04]

They're not gonna be ableto plan for situations

[00:48:04] : [00:48:06]

they never encountered before.

[00:48:06] : [00:48:09]

They basically are going tohave to regurgitate the template

[00:48:09] : [00:48:11]

that they've been trained on.

[00:48:11] : [00:48:12]

- But where, just for theexample of New York to Paris,

[00:48:12] : [00:48:15]

is it gonna start getting into trouble?

[00:48:15] : [00:48:18]

Like at which layer of abstraction

[00:48:18] : [00:48:20]

do you think you'll start?

[00:48:20] : [00:48:22]

Because like I can imagine

[00:48:22] : [00:48:23]

almost every single part of that,

[00:48:23] : [00:48:24]

an LLM will be able toanswer somewhat accurately,

[00:48:24] : [00:48:27]

especially when you're talkingabout New York and Paris,

[00:48:27] : [00:48:29]

major cities.

[00:48:29] : [00:48:31]

- So I mean certainly an LLM

[00:48:31] : [00:48:33]

would be able to solve that problem

[00:48:33] : [00:48:34]

if you fine tune it for it.

[00:48:34] : [00:48:36]

- [Lex] Sure.

[00:48:36] : [00:48:37]

- And so I can't say thatan LLM cannot do this,

[00:48:37] : [00:48:42]

it can't do this if you train it for it,

[00:48:42] : [00:48:44]

there's no question,

[00:48:44] : [00:48:45]

down to a certain level

[00:48:45] : [00:48:47]

where things can beformulated in terms of words.

[00:48:47] : [00:48:51]

But like if you wanna go down

[00:48:51] : [00:48:52]

to like how do you climb down the stairs

[00:48:52] : [00:48:54]

or just stand up from yourchair in terms of words,

[00:48:54] : [00:48:57]

like you can't do it.

[00:48:57] : [00:48:59]

That's one of the reasons you need

[00:48:59] : [00:49:04]

experience of the physical world,

[00:49:04] : [00:49:06]

which is much higher bandwidth

[00:49:06] : [00:49:07]

than what you can express in words,

[00:49:07] : [00:49:10]

in human language.

[00:49:10] : [00:49:11]

- So everything we've been talking about

[00:49:11] : [00:49:12]

on the joint embedding space,

[00:49:12] : [00:49:13]

is it possible that that's what we need

[00:49:13] : [00:49:16]

for like the interactionwith physical reality

[00:49:16] : [00:49:18]

on the robotics front?

[00:49:18] : [00:49:20]

And then just the LLMs are thething that sits on top of it

[00:49:20] : [00:49:24]

for the bigger reasoning

[00:49:24] : [00:49:26]

about like the fact that Ineed to book a plane ticket

[00:49:26] : [00:49:30]

and I need to know know how togo to the websites and so on.

[00:49:30] : [00:49:33]

- Sure.

[00:49:33] : [00:49:34]

And a lot of plans that people know about

[00:49:34] : [00:49:37]

that are relatively highlevel are actually learned.

[00:49:37] : [00:49:41]

Most people don't inventthe plans by themselves.

[00:49:41] : [00:49:46]

We have some ability to dothis, of course, obviously,

[00:49:46] : [00:49:54]

but most plans that people use

[00:49:54] : [00:49:57]

are plans that have been trained on.

[00:49:57] : [00:49:59]

Like they've seen otherpeople use those plans

[00:49:59] : [00:50:01]

or they've been toldhow to do things, right?

[00:50:01] : [00:50:04]

That you can't invent howyou like take a person

[00:50:04] : [00:50:07]

who's never heard of airplanes

[00:50:07] : [00:50:09]

and tell them like, how doyou go from New York to Paris?

[00:50:09] : [00:50:12]

They're probably not going to be able

[00:50:12] : [00:50:14]

to kind of deconstruct the whole plan

[00:50:14] : [00:50:16]

unless they've seenexamples of that before.

[00:50:16] : [00:50:18]

So certainly LLMs aregonna be able to do this.

[00:50:18] : [00:50:21]

But then how you link thisfrom the low level of actions,

[00:50:21] : [00:50:26]

that needs to be donewith things like JEPA,

[00:50:26] : [00:50:30]

that basically lift the abstraction level

[00:50:30] : [00:50:33]

of the representation

[00:50:33] : [00:50:34]

without attempting to reconstruct

[00:50:34] : [00:50:36]

every detail of the situation.

[00:50:36] : [00:50:38]

That's why we need JEPAs for.

[00:50:38] : [00:50:39]

- I would love to sort oflinger on your skepticism

[00:50:39] : [00:50:44]

around autoregressive LLMs.

[00:50:44] : [00:50:48]

So one way I would liketo test that skepticism is

[00:50:48] : [00:50:53]

everything you say makes a lot of sense,

[00:50:53] : [00:50:55]

but if I apply everythingyou said today and in general

[00:50:55] : [00:51:02]

to like, I don't know,

[00:51:02] : [00:51:03]

10 years ago, maybe a little bit less.

[00:51:03] : [00:51:05]

No, let's say three years ago.

[00:51:05] : [00:51:07]

I wouldn't be able topredict the success of LLMs.

[00:51:07] : [00:51:12]

So does it make sense to you

[00:51:12] : [00:51:15]

that autoregressive LLMsare able to be so damn good?

[00:51:15] : [00:51:19]

- [Yann] Yes.

[00:51:19] : [00:51:21]

- Can you explain your intuition?

[00:51:21] : [00:51:24]

Because if I were to takeyour wisdom and intuition

[00:51:24] : [00:51:29]

at face value,

[00:51:29] : [00:51:30]

I would say there's noway autoregressive LLMs

[00:51:30] : [00:51:32]

one token at a time,

[00:51:32] : [00:51:34]

would be able to do the kindof things they're doing.

[00:51:34] : [00:51:36]

- No, there's one thingthat autoregressive LLMs

[00:51:36] : [00:51:39]

or that LLMs in general, notjust the autoregressive ones,

[00:51:39] : [00:51:42]

but including the BERTstyle bidirectional ones,

[00:51:42] : [00:51:45]

are exploiting and itsself supervised running.

[00:51:45] : [00:51:49]

And I've been a very, very strong advocate

[00:51:49] : [00:51:51]

of self supervised running for many years.

[00:51:51] : [00:51:53]

So those things are an incrediblyimpressive demonstration

[00:51:53] : [00:51:58]

that self supervisedlearning actually works.

[00:51:58] : [00:52:01]

The idea that started...

[00:52:01] : [00:52:04]

It didn't start with BERT,

[00:52:04] : [00:52:07]

but it was really kind of agood demonstration with this.

[00:52:07] : [00:52:09]

So the idea that you take apiece of text, you corrupt it,

[00:52:09] : [00:52:14]

and then you train somegigantic neural net

[00:52:14] : [00:52:16]

to reconstruct the parts that are missing.

[00:52:16] : [00:52:18]

That has been an enormous...

[00:52:18] : [00:52:21]

Produced an enormous amount of benefits.

[00:52:21] : [00:52:25]

It allowed us to create systemsthat understand language,

[00:52:25] : [00:52:30]

systems that can translate

[00:52:30] : [00:52:32]

hundreds of languages in any direction,

[00:52:32] : [00:52:36]

systems that are multilingual.

[00:52:36] : [00:52:38]

It's a single system

[00:52:38] : [00:52:40]

that can be trained tounderstand hundreds of languages

[00:52:40] : [00:52:43]

and translate in any direction

[00:52:43] : [00:52:44]

and produce summaries

[00:52:44] : [00:52:48]

and then answer questionsand produce text.

[00:52:48] : [00:52:51]

And then there's a special case of it,

[00:52:51] : [00:52:53]

which is the autoregressive trick

[00:52:53] : [00:52:56]

where you constrain the system

[00:52:56] : [00:52:58]

to not elaborate arepresentation of the text

[00:52:58] : [00:53:02]

from looking at the entire text,

[00:53:02] : [00:53:03]

but only predicting a word

[00:53:03] : [00:53:06]

from the words that have come before.

[00:53:06] : [00:53:08]

Right?

[00:53:08] : [00:53:09]

And you do this

[00:53:09] : [00:53:09]

by constraining thearchitecture of the network.

[00:53:09] : [00:53:11]

And that's what you can buildan autoregressive LLM from.

[00:53:11] : [00:53:15]

So there was a surprise many years ago

[00:53:15] : [00:53:17]

with what's called decoder only LLM.

[00:53:17] : [00:53:20]

So systems of this type

[00:53:20] : [00:53:22]

that are just trying to producewords from the previous one.

[00:53:22] : [00:53:27]

And the fact that when you scale them up,

[00:53:27] : [00:53:31]

they tend to really kind ofunderstand more about language.

[00:53:31] : [00:53:36]

When you train them on lots of data,

[00:53:36] : [00:53:38]

you make them really big.

[00:53:38] : [00:53:39]

That was kind of a surprise.

[00:53:39] : [00:53:40]

And that surprise occurredquite a while back.

[00:53:40] : [00:53:42]

Like with work from Google,Meta, OpenAI, et cetera,

[00:53:42] : [00:53:47]

going back to the GPT

[00:53:47] : [00:53:53]

kind of general pre-trained transformers.

[00:53:53] : [00:53:56]

- You mean like GPT-2?

[00:53:56] : [00:53:58]

Like there's a certain place

[00:53:58] : [00:54:00]

where you start to realize

[00:54:00] : [00:54:01]

scaling might actually keepgiving us an emergent benefit.

[00:54:01] : [00:54:06]

- Yeah, I mean there werework from various places,

[00:54:06] : [00:54:09]

but if you want to kind ofplace it in the GPT timeline,

[00:54:09] : [00:54:14]

that would be around GPT-2, yeah.

[00:54:14] : [00:54:18]

- Well, 'cause you said it,

[00:54:18] : [00:54:20]

you're so charismatic andyou said so many words,

[00:54:20] : [00:54:23]

but self supervised learning, yes.

[00:54:23] : [00:54:25]

But again, the sameintuition you're applying

[00:54:25] : [00:54:28]

to saying that autoregressive LLMs

[00:54:28] : [00:54:31]

cannot have a deepunderstanding of the world,

[00:54:31] : [00:54:35]

if we just apply that same intuition,

[00:54:35] : [00:54:38]

does it make sense to you

[00:54:38] : [00:54:39]

that they're able to form enough

[00:54:39] : [00:54:42]

of a representation in the world

[00:54:42] : [00:54:43]

to be damn convincing,

[00:54:43] : [00:54:45]

essentially passing theoriginal Turing test

[00:54:45] : [00:54:49]

with flying colors.

[00:54:49] : [00:54:50]

- Well, we're fooled bytheir fluency, right?

[00:54:50] : [00:54:53]

We just assume that if a system is fluent

[00:54:53] : [00:54:56]

in manipulating language,

[00:54:56] : [00:54:57]

then it has all the characteristicsof human intelligence.

[00:54:57] : [00:55:00]

But that impression is false.

[00:55:00] : [00:55:04]

We're really fooled by it.

[00:55:04] : [00:55:06]

- Well, what do you thinkAlan Turing would say?

[00:55:06] : [00:55:08]

Without understanding anything,

[00:55:08] : [00:55:10]

just hanging out with it-

[00:55:10] : [00:55:11]

- Alan Turing would decide

[00:55:11] : [00:55:12]

that a Turing test is a really bad test.

[00:55:12] : [00:55:14]

(Lex chuckles)

[00:55:14] : [00:55:15]

Okay.

[00:55:15] : [00:55:16]

This is what the AI communityhas decided many years ago

[00:55:16] : [00:55:18]

that the Turing test was areally bad test of intelligence.

[00:55:18] : [00:55:22]

- What would Hans Moravec say

[00:55:22] : [00:55:23]

about the large language models?

[00:55:23] : [00:55:25]

- Hans Moravec would say

[00:55:25] : [00:55:26]

the Moravec's paradox still applies.

[00:55:26] : [00:55:30]

- [Lex] Okay.

[00:55:30] : [00:55:31]

- Okay?

[00:55:31] : [00:55:31]

Okay, we can pass-

[00:55:31] : [00:55:32]

- You don't think hewould be really impressed.

[00:55:32] : [00:55:34]

- No, of course everybodywould be impressed.

[00:55:34] : [00:55:35]

(laughs)

[00:55:35] : [00:55:36]

But it is not a questionof being impressed or not,

[00:55:36] : [00:55:39]

it is a question of knowing

[00:55:39] : [00:55:41]

what the limit of those systems can do.

[00:55:41] : [00:55:44]

Again, they are impressive.

[00:55:44] : [00:55:45]

They can do a lot of useful things.

[00:55:45] : [00:55:47]

There's a whole industry thatis being built around them.

[00:55:47] : [00:55:49]

They're gonna make progress,

[00:55:49] : [00:55:51]

but there is a lot ofthings they cannot do.

[00:55:51] : [00:55:53]

And we have to realize what they cannot do

[00:55:53] : [00:55:55]

and then figure out how we get there.

[00:55:55] : [00:55:59]

And I'm not saying this...

[00:55:59] : [00:56:02]

I'm saying this frombasically 10 years of research

[00:56:02] : [00:56:07]

on the idea of self supervised running,

[00:56:07] : [00:56:11]

actually that's goingback more than 10 years,

[00:56:11] : [00:56:13]

but the idea of self supervised learning.

[00:56:13] : [00:56:15]

So basically capturingthe internal structure

[00:56:15] : [00:56:17]

of a piece of a set of inputs

[00:56:17] : [00:56:21]

without training the systemfor any particular task, right?

[00:56:21] : [00:56:23]

Learning representations.

[00:56:23] : [00:56:25]

The conference I co-founded 14 years ago

[00:56:25] : [00:56:28]

is called International Conference

[00:56:28] : [00:56:30]

on Learning Representations,

[00:56:30] : [00:56:31]

that's the entire issue thatdeep learning is dealing with.

[00:56:31] : [00:56:34]

Right?

[00:56:34] : [00:56:35]

And it's been my obsessionfor almost 40 years now.

[00:56:35] : [00:56:38]

So learning representationis really the thing.

[00:56:38] : [00:56:42]

For the longest time

[00:56:42] : [00:56:43]

we could only do thiswith supervised learning.

[00:56:43] : [00:56:45]

And then we started working on

[00:56:45] : [00:56:47]

what we used to call unsupervised learning

[00:56:47] : [00:56:50]

and sort of revived the ideaof unsupervised learning

[00:56:50] : [00:56:55]

in the early 2000s withYoshua Bengio and Jeff Hinton.

[00:56:55] : [00:56:59]

Then discovered that supervised learning

[00:56:59] : [00:57:00]

actually works pretty well

[00:57:00] : [00:57:02]

if you can collect enough data.

[00:57:02] : [00:57:03]

And so the whole idea ofunsupervised self supervision

[00:57:03] : [00:57:07]

took a backseat for a bit

[00:57:07] : [00:57:10]

and then I kind of triedto revive it in a big way,

[00:57:10] : [00:57:14]

starting in 2014 basicallywhen we started FAIR,

[00:57:14] : [00:57:20]

and really pushing forlike finding new methods

[00:57:20] : [00:57:24]

to do self supervised running,

[00:57:24] : [00:57:26]

both for text and for imagesand for video and audio.

[00:57:26] : [00:57:29]

And some of that work hasbeen incredibly successful.

[00:57:29] : [00:57:32]

I mean, the reason why we have

[00:57:32] : [00:57:34]

multilingual translation system,

[00:57:34] : [00:57:37]

things to do,

[00:57:37] : [00:57:38]

content moderation on Meta,for example, on Facebook

[00:57:38] : [00:57:41]

that are multilingual,

[00:57:41] : [00:57:42]

that understand whether piece of text

[00:57:42] : [00:57:44]

is hate speech or not, or something

[00:57:44] : [00:57:46]

is due to their progress

[00:57:46] : [00:57:47]

using self supervised running for NLP,

[00:57:47] : [00:57:50]

combining this withtransformer architectures

[00:57:50] : [00:57:52]

and blah blah blah.

[00:57:52] : [00:57:53]

But that's the big successof self supervised running.

[00:57:53] : [00:57:55]

We had similar successin speech recognition,

[00:57:55] : [00:57:59]

a system called Wav2Vec,

[00:57:59] : [00:58:00]

which is also a joint embeddingarchitecture by the way,

[00:58:00] : [00:58:02]

trained with contrastive learning.

[00:58:02] : [00:58:03]

And that system also can produce

[00:58:03] : [00:58:07]

speech recognition systemsthat are multilingual

[00:58:07] : [00:58:10]

with mostly unlabeled data

[00:58:10] : [00:58:13]

and only need a fewminutes of labeled data

[00:58:13] : [00:58:15]

to actually do speech recognition.

[00:58:15] : [00:58:16]

That's amazing.

[00:58:16] : [00:58:18]

We have systems now based onthose combination of ideas

[00:58:18] : [00:58:22]

that can do real time translation

[00:58:22] : [00:58:24]

of hundreds of languages into each other,

[00:58:24] : [00:58:26]

speech to speech.

[00:58:26] : [00:58:28]

- Speech to speech,

[00:58:28] : [00:58:29]

even including, which is fascinating,

[00:58:29] : [00:58:31]

languages that don't have written forms-

[00:58:31] : [00:58:33]

- That's right.- They're spoken only.

[00:58:33] : [00:58:35]

- That's right.

[00:58:35] : [00:58:36]

We don't go through text,

[00:58:36] : [00:58:37]

it goes directly from speech to speech

[00:58:37] : [00:58:38]

using an internal representation

[00:58:38] : [00:58:40]

of kinda speech units that are discrete.

[00:58:40] : [00:58:41]

But it's called Textless NLP.

[00:58:41] : [00:58:44]

We used to call it this way.

[00:58:44] : [00:58:45]

But yeah.

[00:58:45] : [00:58:47]

I mean incredible success there.

[00:58:47] : [00:58:49]

And then for 10 years wetried to apply this idea

[00:58:49] : [00:58:53]

to learning representations of images

[00:58:53] : [00:58:55]

by training a system to predict videos,

[00:58:55] : [00:58:57]

learning intuitive physics

[00:58:57] : [00:58:58]

by training a system to predict

[00:58:58] : [00:59:00]

what's gonna happen in the video.

[00:59:00] : [00:59:02]

And tried and tried and failed and failed

[00:59:02] : [00:59:05]

with generative models,

[00:59:05] : [00:59:06]

with models that predict pixels.

[00:59:06] : [00:59:08]

We could not get them to learn

[00:59:08] : [00:59:10]

good representations of images,

[00:59:10] : [00:59:13]

we could not get them to learngood presentations of videos.

[00:59:13] : [00:59:16]

And we tried many times,

[00:59:16] : [00:59:17]

we published lots of papers on it.

[00:59:17] : [00:59:19]

They kind of sort of worked,but not really great.

[00:59:19] : [00:59:22]

It started working,

[00:59:22] : [00:59:24]

we abandoned this ideaof predicting every pixel

[00:59:24] : [00:59:27]

and basically just doing thejoint embedding and predicting

[00:59:27] : [00:59:30]

in representation space.

[00:59:30] : [00:59:32]

That works.

[00:59:32] : [00:59:33]

So there's ample evidence

[00:59:33] : [00:59:36]

that we're not gonna be ableto learn good representations

[00:59:36] : [00:59:40]

of the real world

[00:59:40] : [00:59:42]

using generative model.

[00:59:42] : [00:59:43]

So I'm telling people,

[00:59:43] : [00:59:44]

everybody's talking about generative AI.

[00:59:44] : [00:59:46]

If you're really interestedin human level AI,

[00:59:46] : [00:59:48]

abandon the idea of generative AI.

[00:59:48] : [00:59:50]

(Lex laughs)

[00:59:50] : [00:59:51]

- Okay.

[00:59:51] : [00:59:52]

But you really think it's possible

[00:59:52] : [00:59:54]

to get far with jointembedding representation?

[00:59:54] : [00:59:57]

So like there's common sense reasoning

[00:59:57] : [01:00:01]

and then there's high level reasoning.

[01:00:01] : [01:00:05]

Like I feel like those are two...

[01:00:05] : [01:00:08]

The kind of reasoningthat LLMs are able to do.

[01:00:08] : [01:00:11]

Okay, let me not use the word reasoning,

[01:00:11] : [01:00:13]

but the kind of stuffthat LLMs are able to do

[01:00:13] : [01:00:16]

seems fundamentally different

[01:00:16] : [01:00:17]

than the common sense reasoning we use

[01:00:17] : [01:00:19]

to navigate the world.

[01:00:19] : [01:00:20]

- [Yann] Yeah.

[01:00:20] : [01:00:21]

- It seems like we're gonna need both-

[01:00:21] : [01:00:23]

- Sure.- Would you be able to get,

[01:00:23] : [01:00:25]

with the joint embedding whichis a JEPA type of approach,

[01:00:25] : [01:00:27]

looking at video, wouldyou be able to learn,

[01:00:27] : [01:00:30]

let's see,

[01:00:30] : [01:00:33]

well, how to get from New York to Paris,

[01:00:33] : [01:00:35]

or how to understand the stateof politics in the world?

[01:00:35] : [01:00:40]

(both laugh)

[01:00:40] : [01:00:43]

Right?

[01:00:43] : [01:00:44]

These are things where various humans

[01:00:44] : [01:00:46]

generate a lot oflanguage and opinions on,

[01:00:46] : [01:00:49]

in the space of language,

[01:00:49] : [01:00:50]

but don't visually represent that

[01:00:50] : [01:00:52]

in any clearly compressible way.

[01:00:52] : [01:00:56]

- Right.

[01:00:56] : [01:00:56]

Well, there's a lot of situations

[01:00:56] : [01:00:58]

that might be difficult

[01:00:58] : [01:01:00]

for a purely languagebased system to know.

[01:01:00] : [01:01:04]

Like, okay, you can probablylearn from reading texts,

[01:01:04] : [01:01:08]

the entirety of the publiclyavailable text in the world

[01:01:08] : [01:01:11]

that I cannot get from New York to Paris

[01:01:11] : [01:01:13]

by snapping my fingers.

[01:01:13] : [01:01:15]

That's not gonna work, right?

[01:01:15] : [01:01:16]

- [Lex] Yes.

[01:01:16] : [01:01:17]

- But there's probablysort of more complex

[01:01:17] : [01:01:20]

scenarios of this type

[01:01:20] : [01:01:22]

which an LLM may never have encountered

[01:01:22] : [01:01:25]

and may not be able to determine

[01:01:25] : [01:01:27]

whether it's possible or not.

[01:01:27] : [01:01:29]

So that link from the lowlevel to the high level...

[01:01:29] : [01:01:34]

The thing is that the highlevel that language expresses

[01:01:34] : [01:01:38]

is based on the commonexperience of the low level,

[01:01:38] : [01:01:43]

which LLMs currently do not have.

[01:01:43] : [01:01:45]

When we talk to each other,

[01:01:45] : [01:01:47]

we know we have a commonexperience of the world.

[01:01:47] : [01:01:50]

Like a lot of it is similar.

[01:01:50] : [01:01:54]

And LLMs don't have that.

[01:01:54] : [01:01:59]

- But see, there it's present.

[01:01:59] : [01:02:01]

You and I have a commonexperience of the world

[01:02:01] : [01:02:02]

in terms of the physicsof how gravity works

[01:02:02] : [01:02:05]

and stuff like this.

[01:02:05] : [01:02:06]

And that common knowledge of the world,

[01:02:06] : [01:02:11]

I feel like is there in the language.

[01:02:11] : [01:02:15]

We don't explicitly express it,

[01:02:15] : [01:02:17]

but if you have a huge amount of text,

[01:02:17] : [01:02:21]

you're going to get this stuffthat's between the lines.

[01:02:21] : [01:02:24]

In order to form a consistent world model,

[01:02:24] : [01:02:28]

you're going to have tounderstand how gravity works,

[01:02:28] : [01:02:31]

even if you don't have anexplicit explanation of gravity.

[01:02:31] : [01:02:34]

So even though, in the case of gravity,

[01:02:34] : [01:02:37]

there is explicit explanation.

[01:02:37] : [01:02:38]

There's gravity in Wikipedia.

[01:02:38] : [01:02:40]

But like the stuff that we think of

[01:02:40] : [01:02:44]

as common sense reasoning,

[01:02:44] : [01:02:46]

I feel like to generatelanguage correctly,

[01:02:46] : [01:02:49]

you're going to have to figure that out.

[01:02:49] : [01:02:51]

Now, you could say as you have,

[01:02:51] : [01:02:53]

there's not enough text-- Well, I agree.

[01:02:53] : [01:02:54]

- Sorry.

[01:02:54] : [01:02:55]

Okay, yeah.

[01:02:55] : [01:02:56]

(laughs)

[01:02:56] : [01:02:57]

You don't think so?

[01:02:57] : [01:02:58]

- No, I agree with what you just said,

[01:02:58] : [01:02:59]

which is that to be able todo high level common sense...

[01:02:59] : [01:03:03]

To have high level common sense,

[01:03:03] : [01:03:04]

you need to have thelow level common sense

[01:03:04] : [01:03:06]

to build on top of.

[01:03:06] : [01:03:08]

- [Lex] Yeah.

[01:03:08] : [01:03:09]

But that's not there.

[01:03:09] : [01:03:10]

- That's not there in LLMs.

[01:03:10] : [01:03:11]

LLMs are purely trained from text.

[01:03:11] : [01:03:13]

So then the other statement you made,

[01:03:13] : [01:03:15]

I would not agree

[01:03:15] : [01:03:16]

with the fact that implicitin all languages in the world

[01:03:16] : [01:03:20]

is the underlying reality.

[01:03:20] : [01:03:22]

There's a lot about underlying reality

[01:03:22] : [01:03:24]

which is not expressed in language.

[01:03:24] : [01:03:26]

- Is that obvious to you?

[01:03:26] : [01:03:27]

- Yeah, totally.

[01:03:27] : [01:03:29]

- So like all the conversations we have...

[01:03:29] : [01:03:34]

Okay, there's the dark web,

[01:03:34] : [01:03:36]

meaning whatever,

[01:03:36] : [01:03:37]

the private conversationslike DMs and stuff like this,

[01:03:37] : [01:03:41]

which is much, much largerprobably than what's available,

[01:03:41] : [01:03:45]

what LLMs are trained on.

[01:03:45] : [01:03:46]

- You don't need to communicate

[01:03:46] : [01:03:48]

the stuff that is common.

[01:03:48] : [01:03:50]

- But the humor, all of it.

[01:03:50] : [01:03:51]

No, you do.

[01:03:51] : [01:03:52]

You don't need to, but it comes through.

[01:03:52] : [01:03:54]

Like if I accidentally knock this over,

[01:03:54] : [01:03:58]

you'll probably make fun of me.

[01:03:58] : [01:03:59]

And in the content ofthe you making fun of me

[01:03:59] : [01:04:02]

will be explanation ofthe fact that cups fall

[01:04:02] : [01:04:07]

and then gravity works in this way.

[01:04:07] : [01:04:09]

And then you'll have somevery vague information

[01:04:09] : [01:04:12]

about what kind of thingsexplode when they hit the ground.

[01:04:12] : [01:04:16]

And then maybe you'llmake a joke about entropy

[01:04:16] : [01:04:19]

or something like this

[01:04:19] : [01:04:20]

and we will never be ableto reconstruct this again.

[01:04:20] : [01:04:22]

Like, okay, you'll makea little joke like this

[01:04:22] : [01:04:24]

and there'll be trillion of other jokes.

[01:04:24] : [01:04:27]

And from the jokes,

[01:04:27] : [01:04:28]

you can piece together thefact that gravity works

[01:04:28] : [01:04:30]

and mugs can break andall this kind of stuff,

[01:04:30] : [01:04:32]

you don't need to see...

[01:04:32] : [01:04:34]

It'll be very inefficient.

[01:04:34] : [01:04:36]

It's easier for like

[01:04:36] : [01:04:38]

to not knock the thing over. (laughing)

[01:04:38] : [01:04:41]

- [Yann] Yeah.

[01:04:41] : [01:04:42]

- But I feel like it would be there

[01:04:42] : [01:04:44]

if you have enough of that data.

[01:04:44] : [01:04:46]

- I just think that most ofthe information of this type

[01:04:46] : [01:04:50]

that we have accumulatedwhen we were babies

[01:04:50] : [01:04:53]

is just not present in text,

[01:04:53] : [01:04:58]

in any description, essentially.

[01:04:58] : [01:04:59]

And the sensory datais a much richer source

[01:04:59] : [01:05:02]

for getting that kind of understanding.

[01:05:02] : [01:05:04]

I mean, that's the 16,000 hours

[01:05:04] : [01:05:06]

of wake time of a 4-year-old.

[01:05:06] : [01:05:09]

And tend to do 15 bytes,going through vision.

[01:05:09] : [01:05:12]

Just vision, right?

[01:05:12] : [01:05:13]

There is a similar bandwidth of touch

[01:05:13] : [01:05:17]

and a little less through audio.

[01:05:17] : [01:05:20]

And then text doesn't...

[01:05:20] : [01:05:21]

Language doesn't come inuntil like a year in life.

[01:05:21] : [01:05:26]

And by the time you are nine years old,

[01:05:26] : [01:05:28]

you've learned about gravity,

[01:05:28] : [01:05:30]

you know about inertia,

[01:05:30] : [01:05:31]

you know about gravity,

[01:05:31] : [01:05:32]

you know there's stability,

[01:05:32] : [01:05:33]

you know about the distinction

[01:05:33] : [01:05:36]

between animate and inanimate objects.

[01:05:36] : [01:05:38]

By 18 months,

[01:05:38] : [01:05:39]

you know about like whypeople want to do things

[01:05:39] : [01:05:42]

and you help them if they can't.

[01:05:42] : [01:05:45]

I mean there's a lot ofthings that you learn

[01:05:45] : [01:05:47]

mostly by observation,

[01:05:47] : [01:05:49]

really not even through interaction.

[01:05:49] : [01:05:52]

In the first few months of life,

[01:05:52] : [01:05:53]

babies don't really haveany influence on the world.

[01:05:53] : [01:05:55]

They can only observe, right?

[01:05:55] : [01:05:58]

And you accumulate like agigantic amount of knowledge

[01:05:58] : [01:06:02]

just from that.

[01:06:02] : [01:06:03]

So that's what we're missingfrom current AI systems.

[01:06:03] : [01:06:06]

- I think in one of yourslides you have this nice plot

[01:06:06] : [01:06:10]

that is one of the ways youshow that LLMs are limited.

[01:06:10] : [01:06:13]

I wonder if you couldtalk about hallucinations

[01:06:13] : [01:06:16]

from your perspectives.

[01:06:16] : [01:06:17]

Why hallucinations happenfrom large language models,

[01:06:17] : [01:06:23]

and to what degree isthat a fundamental flaw

[01:06:23] : [01:06:27]

of large language models.

[01:06:27] : [01:06:29]

- Right.

[01:06:29] : [01:06:30]

So because of theautoregressive prediction,

[01:06:30] : [01:06:34]

every time an LLM producesa token or a word,

[01:06:34] : [01:06:37]

there is some level ofprobability for that word

[01:06:37] : [01:06:40]

to take you out of theset of reasonable answers.

[01:06:40] : [01:06:44]

And if you assume,

[01:06:44] : [01:06:46]

which is a very strong assumption,

[01:06:46] : [01:06:48]

that the probability of such error

[01:06:48] : [01:06:50]

is those errors are independent

[01:06:50] : [01:06:55]

across a sequence oftokens being produced.

[01:06:55] : [01:06:59]

What that means is that everytime you produce a token,

[01:06:59] : [01:07:02]

the probability

[01:07:02] : [01:07:03]

that you stay within the setof correct answer decreases

[01:07:03] : [01:07:06]

and it decreases exponentially.

[01:07:06] : [01:07:08]

- So there's a strong, likeyou said, assumption there

[01:07:08] : [01:07:11]

that if there's a non-zeroprobability of making a mistake,

[01:07:11] : [01:07:14]

which there appears to be,

[01:07:14] : [01:07:16]

then there's going to be a kind of drift.

[01:07:16] : [01:07:18]

- Yeah.

[01:07:18] : [01:07:19]

And that drift is exponential.

[01:07:19] : [01:07:21]

It's like errors accumulate, right?

[01:07:21] : [01:07:23]

So the probability that ananswer would be nonsensical

[01:07:23] : [01:07:27]

increases exponentiallywith the number of tokens.

[01:07:27] : [01:07:31]

- Is that obvious to you by the way?

[01:07:31] : [01:07:33]

Well, so mathematically speaking maybe,

[01:07:33] : [01:07:36]

but like isn't there akind of gravitational pull

[01:07:36] : [01:07:39]

towards the truth?

[01:07:39] : [01:07:41]

Because on average, hopefully,

[01:07:41] : [01:07:44]

the truth is well representedin the training set.

[01:07:44] : [01:07:48]

- No, it's basically a struggle

[01:07:48] : [01:07:50]

against the curse of dimensionality.

[01:07:50] : [01:07:55]

So the way you can correct for this

[01:07:55] : [01:07:57]

is that you fine tune the system

[01:07:57] : [01:07:58]

by having it produce answers

[01:07:58] : [01:08:01]

for all kinds of questionsthat people might come up with.

[01:08:01] : [01:08:04]

And people are people,

[01:08:04] : [01:08:06]

so a lot of the questions that they have

[01:08:06] : [01:08:08]

are very similar to each other.

[01:08:08] : [01:08:10]

So you can probably cover,

[01:08:10] : [01:08:11]

you know, 80% or whatever ofquestions that people will ask

[01:08:11] : [01:08:16]

by collecting data.

[01:08:16] : [01:08:20]

And then you fine tune the system

[01:08:20] : [01:08:23]

to produce good answersfor all of those things.

[01:08:23] : [01:08:25]

And it's probably gonnabe able to learn that

[01:08:25] : [01:08:27]

because it's got a lotof capacity to learn.

[01:08:27] : [01:08:31]

But then there is theenormous set of prompts

[01:08:31] : [01:08:36]

that you have not covered during training.

[01:08:36] : [01:08:39]

And that set is enormous.

[01:08:39] : [01:08:41]

Like within the set ofall possible prompts,

[01:08:41] : [01:08:43]

the proportion of prompts thathave been used for training

[01:08:43] : [01:08:47]

is absolutely tiny.

[01:08:47] : [01:08:48]

It's a tiny, tiny, tiny subsetof all possible prompts.

[01:08:48] : [01:08:53]

And so the system will behave properly

[01:08:53] : [01:08:56]

on the prompts that it'sbeen either trained,

[01:08:56] : [01:08:58]

pre-trained or fine tuned.

[01:08:58] : [01:08:59]

But then there is anentire space of things

[01:08:59] : [01:09:04]

that it cannot possiblyhave been trained on

[01:09:04] : [01:09:06]

because it's just the number is gigantic.

[01:09:06] : [01:09:09]

So whatever training the system

[01:09:09] : [01:09:13]

has been subject to produceappropriate answers,

[01:09:13] : [01:09:18]

you can break it by finding out a prompt

[01:09:18] : [01:09:20]

that will be outside of the set of prompts

[01:09:20] : [01:09:24]

it's been trained on

[01:09:24] : [01:09:25]

or things that are similar,

[01:09:25] : [01:09:27]

and then it will justspew complete nonsense.

[01:09:27] : [01:09:29]

- When you say prompt,

[01:09:29] : [01:09:31]

do you mean that exact prompt

[01:09:31] : [01:09:33]

or do you mean a prompt that's like,

[01:09:33] : [01:09:36]

in many parts very different than...

[01:09:36] : [01:09:38]

Is it that easy to ask a question

[01:09:38] : [01:09:42]

or to say a thing thathasn't been said before

[01:09:42] : [01:09:45]

on the internet?

[01:09:45] : [01:09:46]

- I mean, people have come up with things

[01:09:46] : [01:09:48]

where like you put essentially

[01:09:48] : [01:09:51]

a random sequence ofcharacters in a prompt

[01:09:51] : [01:09:53]

and that's enough to kind ofthrow the system into a mode

[01:09:53] : [01:09:57]

where it's gonna answersomething completely different

[01:09:57] : [01:10:00]

than it would have answered without this.

[01:10:00] : [01:10:03]

So that's a way to jailbreakthe system, basically.

[01:10:03] : [01:10:05]

Go outside of its conditioning, right?

[01:10:05] : [01:10:09]

- So that's a very cleardemonstration of it.

[01:10:09] : [01:10:11]

But of course, that goes outside

[01:10:11] : [01:10:16]

of what it's designed to do, right?

[01:10:16] : [01:10:19]

If you actually stitch together

[01:10:19] : [01:10:20]

reasonably grammatical sentences,

[01:10:20] : [01:10:22]

is it that easy to break it?

[01:10:22] : [01:10:26]

- Yeah.

[01:10:26] : [01:10:27]

Some people have done things like

[01:10:27] : [01:10:29]

you write a sentence in English

[01:10:29] : [01:10:31]

or you ask a question in English

[01:10:31] : [01:10:33]

and it produces a perfectly fine answer.

[01:10:33] : [01:10:36]

And then you just substitute a few words

[01:10:36] : [01:10:38]

by the same word in another language,

[01:10:38] : [01:10:42]

and all of a sudden theanswer is complete nonsense.

[01:10:42] : [01:10:44]

- Yeah.

[01:10:44] : [01:10:45]

So I guess what I'm saying is like,

[01:10:45] : [01:10:46]

which fraction of prompts thathumans are likely to generate

[01:10:46] : [01:10:51]

are going to break the system?

[01:10:51] : [01:10:54]

- So the problem is thatthere is a long tail.

[01:10:54] : [01:10:57]

- [Lex] Yes.

[01:10:57] : [01:10:58]

- This is an issue that alot of people have realized

[01:10:58] : [01:11:02]

in social networks and stuff like that,

[01:11:02] : [01:11:04]

which is there's a very, very long tail

[01:11:04] : [01:11:05]

of things that people will ask.

[01:11:05] : [01:11:07]

And you can fine tune the system

[01:11:07] : [01:11:09]

for the 80% or whatever

[01:11:09] : [01:11:12]

of the things that most people will ask.

[01:11:12] : [01:11:16]

And then this long tail is so large

[01:11:16] : [01:11:18]

that you're not gonna beable to fine tune the system

[01:11:18] : [01:11:20]

for all the conditions.

[01:11:20] : [01:11:21]

And in the end,

[01:11:21] : [01:11:22]

the system ends up being

[01:11:22] : [01:11:23]

kind of a giant lookuptable, right? (laughing)

[01:11:23] : [01:11:25]

Essentially.

[01:11:25] : [01:11:26]

Which is not really what you want.

[01:11:26] : [01:11:27]

You want systems that can reason,

[01:11:27] : [01:11:29]

certainly that can plan.

[01:11:29] : [01:11:30]

So the type of reasoningthat takes place in LLM

[01:11:30] : [01:11:33]

is very, very primitive.

[01:11:33] : [01:11:35]

And the reason you can tell it's primitive

[01:11:35] : [01:11:37]

is because the amount of computation

[01:11:37] : [01:11:39]

that is spent per tokenproduced is constant.

[01:11:39] : [01:11:43]

So if you ask a question

[01:11:43] : [01:11:45]

and that question has an answerin a given number of token,

[01:11:45] : [01:11:50]

the amount of computationdevoted to computing that answer

[01:11:50] : [01:11:52]

can be exactly estimated.

[01:11:52] : [01:11:54]

It's the size of the prediction network

[01:11:54] : [01:11:59]

with its 36 layers or 92layers or whatever it is,

[01:11:59] : [01:12:03]

multiplied by number of tokens.

[01:12:03] : [01:12:05]

That's it.

[01:12:05] : [01:12:06]

And so essentially,

[01:12:06] : [01:12:08]

it doesn't matter ifthe question being asked

[01:12:08] : [01:12:10]

is simple to answer,complicated to answer,

[01:12:10] : [01:12:16]

impossible to answer

[01:12:16] : [01:12:17]

because it's decided,well, there's something.

[01:12:17] : [01:12:20]

The amount of computation

[01:12:20] : [01:12:22]

the system will be able todevote to the answer is constant

[01:12:22] : [01:12:25]

or is proportional to thenumber of token produced

[01:12:25] : [01:12:27]

in the answer, right?

[01:12:27] : [01:12:29]

This is not the way we work,

[01:12:29] : [01:12:30]

the way we reason is that

[01:12:30] : [01:12:33]

when we are faced with a complex problem

[01:12:33] : [01:12:37]

or a complex question,

[01:12:37] : [01:12:38]

we spend more time trying tosolve it and answer it, right?

[01:12:38] : [01:12:42]

Because it's more difficult.

[01:12:42] : [01:12:43]

- There's a prediction element,

[01:12:43] : [01:12:45]

there's an iterative element

[01:12:45] : [01:12:47]

where you're like adjustingyour understanding of a thing

[01:12:47] : [01:12:52]

by going over and over and over.

[01:12:52] : [01:12:54]

There's a hierarchical elements on.

[01:12:54] : [01:12:56]

Does this mean it's afundamental flaw of LLMs-

[01:12:56] : [01:12:59]

- [Yann] Yeah.

[01:12:59] : [01:13:00]

- Or does it mean that... (laughs)

[01:13:00] : [01:13:01]

There's more part to that question?

[01:13:01] : [01:13:03]

(laughs)

[01:13:03] : [01:13:04]

Now you're just behaving like an LLM.

[01:13:04] : [01:13:06]

(laughs)

[01:13:06] : [01:13:07]

Immediately answering.

[01:13:07] : [01:13:08]

No, that it's just thelow level world model

[01:13:08] : [01:13:13]

on top of which we can then build

[01:13:13] : [01:13:17]

some of these kinds of mechanisms,

[01:13:17] : [01:13:18]

like you said, persistentlong-term memory or reasoning,

[01:13:18] : [01:13:23]

so on.

[01:13:23] : [01:13:25]

But we need that world modelthat comes from language.

[01:13:25] : [01:13:29]

Maybe it is not so difficult

[01:13:29] : [01:13:30]

to build this kind of reasoning system

[01:13:30] : [01:13:33]

on top of a well constructed world model.

[01:13:33] : [01:13:36]

- Okay.

[01:13:36] : [01:13:37]

Whether it's difficult or not,

[01:13:37] : [01:13:38]

the near future will say,

[01:13:38] : [01:13:40]

because a lot of peopleare working on reasoning

[01:13:40] : [01:13:43]

and planning abilitiesfor dialogue systems.

[01:13:43] : [01:13:46]

I mean, even if we restrictourselves to language,

[01:13:46] : [01:13:50]

just having the ability

[01:13:50] : [01:13:53]

to plan your answer before you answer,

[01:13:53] : [01:13:55]

in terms that are not necessarily linked

[01:13:55] : [01:13:59]

with the language you're gonnause to produce the answer.

[01:13:59] : [01:14:02]

Right?

[01:14:02] : [01:14:02]

So this idea of this mental model

[01:14:02] : [01:14:04]

that allows you to planwhat you're gonna say

[01:14:04] : [01:14:06]

before you say it.

[01:14:06] : [01:14:06]

That is very important.

[01:14:06] : [01:14:11]

I think there's goingto be a lot of systems

[01:14:11] : [01:14:13]

over the next few years

[01:14:13] : [01:14:14]

that are going to have this capability,

[01:14:14] : [01:14:17]

but the blueprint of those systems

[01:14:17] : [01:14:19]

will be extremely differentfrom autoregressive LLMs.

[01:14:19] : [01:14:23]

So it's the same difference

[01:14:23] : [01:14:27]

as the difference between

[01:14:27] : [01:14:29]

what psychology has calledsystem one and system two

[01:14:29] : [01:14:31]

in humans, right?

[01:14:31] : [01:14:32]

So system one is the type oftask that you can accomplish

[01:14:32] : [01:14:35]

without like deliberatelyconsciously think about

[01:14:35] : [01:14:39]

how you do them.

[01:14:39] : [01:14:40]

You just do them.

[01:14:40] : [01:14:42]

You've done them enough

[01:14:42] : [01:14:43]

that you can just do itsubconsciously, right?

[01:14:43] : [01:14:45]

Without thinking about them.

[01:14:45] : [01:14:46]

If you're an experienced driver,

[01:14:46] : [01:14:48]

you can drive withoutreally thinking about it

[01:14:48] : [01:14:51]

and you can talk tosomeone at the same time

[01:14:51] : [01:14:52]

or listen to the radio, right?

[01:14:52] : [01:14:54]

If you are a veryexperienced chess player,

[01:14:54] : [01:14:58]

you can play against anon-experienced chess player

[01:14:58] : [01:15:01]

without really thinking either,

[01:15:01] : [01:15:02]

you just recognize thepattern and you play, right?

[01:15:02] : [01:15:05]

That's system one.

[01:15:05] : [01:15:06]

So all the things thatyou do instinctively

[01:15:06] : [01:15:09]

without really having to deliberately plan

[01:15:09] : [01:15:12]

and think about it.

[01:15:12] : [01:15:13]

And then there is othertasks where you need to plan.

[01:15:13] : [01:15:15]

So if you are a not tooexperienced chess player

[01:15:15] : [01:15:19]

or you are experienced

[01:15:19] : [01:15:20]

but you play against anotherexperienced chess player,

[01:15:20] : [01:15:22]

you think about allkinds of options, right?

[01:15:22] : [01:15:24]

You think about it for a while, right?

[01:15:24] : [01:15:27]

And you're much better if youhave time to think about it

[01:15:27] : [01:15:30]

than you are if you playblitz with limited time.

[01:15:30] : [01:15:35]

And so this type of deliberate planning,

[01:15:35] : [01:15:39]

which uses your internal worldmodel, that's system two,

[01:15:39] : [01:15:43]

this is what LLMs currently cannot do.

[01:15:43] : [01:15:46]

How do we get them to do this, right?

[01:15:46] : [01:15:48]

How do we build a system

[01:15:48] : [01:15:50]

that can do this kindof planning or reasoning

[01:15:50] : [01:15:55]

that devotes more resources

[01:15:55] : [01:15:57]

to complex problemsthan to simple problems.

[01:15:57] : [01:16:00]

And it's not going to be

[01:16:00] : [01:16:01]

autoregressive prediction of tokens,

[01:16:01] : [01:16:03]

it's going to be moresomething akin to inference

[01:16:03] : [01:16:08]

of latent variables

[01:16:08] : [01:16:09]

in what used to be calledprobabilistic models

[01:16:09] : [01:16:14]

or graphical models andthings of that type.

[01:16:14] : [01:16:17]

So basically the principle is like this.

[01:16:17] : [01:16:19]

The prompt is like observed variables.

[01:16:19] : [01:16:24]

And what the model does

[01:16:24] : [01:16:29]

is that it's basically a measure of...

[01:16:29] : [01:16:33]

It can measure to what extent an answer

[01:16:33] : [01:16:36]

is a good answer for a prompt.

[01:16:36] : [01:16:37]

Okay?

[01:16:37] : [01:16:38]

So think of it as somegigantic neural net,

[01:16:38] : [01:16:41]

but it's got only one output.

[01:16:41] : [01:16:42]

And that output is a scaler number,

[01:16:42] : [01:16:45]

which is let's say zero

[01:16:45] : [01:16:47]

if the answer is a goodanswer for the question,

[01:16:47] : [01:16:49]

and a large number

[01:16:49] : [01:16:51]

if the answer is not a goodanswer for the question.

[01:16:51] : [01:16:53]

Imagine you had this model.

[01:16:53] : [01:16:55]

If you had such a model,

[01:16:55] : [01:16:56]

you could use it to produce good answers.

[01:16:56] : [01:16:58]

The way you would do is produce the prompt

[01:16:58] : [01:17:02]

and then search through thespace of possible answers

[01:17:02] : [01:17:05]

for one that minimizes that number.

[01:17:05] : [01:17:07]

That's called an energy based model.

[01:17:07] : [01:17:11]

- But that energy based model

[01:17:11] : [01:17:14]

would need the modelconstructed by the LLM.

[01:17:14] : [01:17:18]

- Well, so really what you need to do

[01:17:18] : [01:17:20]

would be to not search overpossible strings of text

[01:17:20] : [01:17:24]

that minimize that energy.

[01:17:24] : [01:17:27]

But what you would do

[01:17:27] : [01:17:28]

is do this in abstractrepresentation space.

[01:17:28] : [01:17:31]

So in sort of the spaceof abstract thoughts,

[01:17:31] : [01:17:34]

you would elaborate a thought, right?

[01:17:34] : [01:17:37]

Using this process of minimizingthe output of your model.

[01:17:37] : [01:17:42]

Okay?

[01:17:42] : [01:17:42]

Which is just a scaler.

[01:17:42] : [01:17:44]

It's an optimization process, right?

[01:17:44] : [01:17:46]

So now the way the systemproduces its answer

[01:17:46] : [01:17:48]

is through optimization

[01:17:48] : [01:17:50]

by minimizing an objectivefunction basically, right?

[01:17:50] : [01:17:56]

And this is, we'retalking about inference,

[01:17:56] : [01:17:57]

we're not talking about training, right?

[01:17:57] : [01:17:59]

The system has been trained already.

[01:17:59] : [01:18:01]

So now we have an abstract representation

[01:18:01] : [01:18:03]

of the thought of the answer,

[01:18:03] : [01:18:04]

representation of the answer.

[01:18:04] : [01:18:06]

We feed that to basicallyan autoregressive decoder,

[01:18:06] : [01:18:10]

which can be very simple,

[01:18:10] : [01:18:11]

that turns this into a textthat expresses this thought.

[01:18:11] : [01:18:15]

Okay?

[01:18:15] : [01:18:16]

So that in my opinion

[01:18:16] : [01:18:17]

is the blueprint of future data systems.

[01:18:17] : [01:18:21]

They will think about their answer,

[01:18:21] : [01:18:23]

plan their answer by optimization

[01:18:23] : [01:18:25]

before turning it into text.

[01:18:25] : [01:18:27]

And that is turning complete.

[01:18:27] : [01:18:31]

- Can you explain exactly

[01:18:31] : [01:18:32]

what the optimization problem there is?

[01:18:32] : [01:18:34]

Like what's the objective function?

[01:18:34] : [01:18:37]

Just linger on it.

[01:18:37] : [01:18:38]

You kind of briefly described it,

[01:18:38] : [01:18:40]

but over what space are you optimizing?

[01:18:40] : [01:18:43]

- The space of representations-

[01:18:43] : [01:18:45]

- Goes abstract representation.

[01:18:45] : [01:18:47]

- That's right.

[01:18:47] : [01:18:47]

So you have an abstractrepresentation inside the system.

[01:18:47] : [01:18:51]

You have a prompt.

[01:18:51] : [01:18:52]

The prompt goes through an encoder,

[01:18:52] : [01:18:53]

produces a representation,

[01:18:53] : [01:18:55]

perhaps goes through a predictor

[01:18:55] : [01:18:56]

that predicts arepresentation of the answer,

[01:18:56] : [01:18:58]

of the proper answer.

[01:18:58] : [01:18:59]

But that representationmay not be a good answer

[01:18:59] : [01:19:03]

because there might besome complicated reasoning

[01:19:03] : [01:19:06]

you need to do, right?

[01:19:06] : [01:19:07]

So then you have another process

[01:19:07] : [01:19:11]

that takes the representationof the answers and modifies it

[01:19:11] : [01:19:15]

so as to minimize a cost function

[01:19:15] : [01:19:20]

that measures to what extent

[01:19:20] : [01:19:21]

the answer is a goodanswer for the question.

[01:19:21] : [01:19:22]

Now we sort of ignore the fact for...

[01:19:22] : [01:19:27]

I mean, the issue for a moment

[01:19:27] : [01:19:29]

of how you train that system

[01:19:29] : [01:19:30]

to measure whether an answeris a good answer for sure.

[01:19:30] : [01:19:35]

- But suppose such asystem could be created,

[01:19:35] : [01:19:38]

what's the process?

[01:19:38] : [01:19:40]

This kind of search like process.

[01:19:40] : [01:19:42]

- It's an optimization process.

[01:19:42] : [01:19:44]

You can do this if the entiresystem is differentiable,

[01:19:44] : [01:19:47]

that scaler output

[01:19:47] : [01:19:49]

is the result of runningthrough some neural net,

[01:19:49] : [01:19:52]

running the answer,

[01:19:52] : [01:19:54]

the representation of theanswer through some neural net.

[01:19:54] : [01:19:56]

Then by gradient descent,

[01:19:56] : [01:19:58]

by back propagating gradients,

[01:19:58] : [01:20:00]

you can figure out

[01:20:00] : [01:20:01]

like how to modify therepresentation of the answers

[01:20:01] : [01:20:03]

so as to minimize that.

[01:20:03] : [01:20:05]

- So that's still a gradient based.

[01:20:05] : [01:20:06]

- It's gradient based inference.

[01:20:06] : [01:20:08]

So now you have arepresentation of the answer

[01:20:08] : [01:20:10]

in abstract space.

[01:20:10] : [01:20:12]

Now you can turn it into text, right?

[01:20:12] : [01:20:14]

And the cool thing about this

[01:20:14] : [01:20:17]

is that the representation now

[01:20:17] : [01:20:20]

can be optimized through gradient descent,

[01:20:20] : [01:20:22]

but also is independent of the language

[01:20:22] : [01:20:24]

in which you're goingto express the answer.

[01:20:24] : [01:20:27]

- Right.

[01:20:27] : [01:20:28]

So you're operating in thesubstruct of representation.

[01:20:28] : [01:20:30]

I mean this goes backto the joint embedding.

[01:20:30] : [01:20:32]

- [Yann] Right.

[01:20:32] : [01:20:33]

- That it's better towork in the space of...

[01:20:33] : [01:20:36]

I don't know.

[01:20:36] : [01:20:37]

Or to romanticize the notion

[01:20:37] : [01:20:39]

like space of concepts

[01:20:39] : [01:20:40]

versus the space of concretesensory information.

[01:20:40] : [01:20:45]

- Right.

[01:20:45] : [01:20:47]

- Okay.

[01:20:47] : [01:20:48]

But can this do something like reasoning,

[01:20:48] : [01:20:50]

which is what we're talking about?

[01:20:50] : [01:20:51]

- Well, not really,

[01:20:51] : [01:20:53]

only in a very simple way.

[01:20:53] : [01:20:54]

I mean basically you canthink of those things as doing

[01:20:54] : [01:20:57]

the kind of optimizationI was talking about,

[01:20:57] : [01:20:59]

except they're optimizingthe discrete space

[01:20:59] : [01:21:01]

which is the space ofpossible sequences of tokens.

[01:21:01] : [01:21:05]

And they do this optimizationin a horribly inefficient way,

[01:21:05] : [01:21:09]

which is generate a lot of hypothesis

[01:21:09] : [01:21:11]

and then select the best ones.

[01:21:11] : [01:21:13]

And that's incredibly wasteful

[01:21:13] : [01:21:16]

in terms of competition,

[01:21:16] : [01:21:18]

'cause you basically have to run your LLM

[01:21:18] : [01:21:20]

for like every possiblegenerative sequence.

[01:21:20] : [01:21:24]

And it's incredibly wasteful.

[01:21:24] : [01:21:27]

So it's much better to do an optimization

[01:21:27] : [01:21:31]

in continuous space

[01:21:31] : [01:21:33]

where you can do gradient descent

[01:21:33] : [01:21:34]

as opposed to like generate tons of things

[01:21:34] : [01:21:36]

and then select the best,

[01:21:36] : [01:21:38]

you just iteratively refine your answer

[01:21:38] : [01:21:41]

to go towards the best, right?

[01:21:41] : [01:21:42]

That's much more efficient.

[01:21:42] : [01:21:44]

But you can only do thisin continuous spaces

[01:21:44] : [01:21:46]

with differentiable functions.

[01:21:46] : [01:21:48]

- You're talking about the reasoning,

[01:21:48] : [01:21:50]

like ability to thinkdeeply or to reason deeply.

[01:21:50] : [01:21:54]

How do you know what is an answer

[01:21:54] : [01:21:58]

that's better or worsebased on deep reasoning?

[01:21:58] : [01:22:03]

- Right.

[01:22:03] : [01:22:05]

So then we're asking the question,

[01:22:05] : [01:22:06]

of conceptually, how do youtrain an energy based model?

[01:22:06] : [01:22:09]

Right?

[01:22:09] : [01:22:10]

So energy based model

[01:22:10] : [01:22:11]

is a function with a scaleroutput, just a number.

[01:22:11] : [01:22:13]

You give it two inputs, X and Y,

[01:22:13] : [01:22:17]

and it tells you whether Yis compatible with X or not.

[01:22:17] : [01:22:20]

X you observe,

[01:22:20] : [01:22:21]

let's say it's a prompt, animage, a video, whatever.

[01:22:21] : [01:22:24]

And Y is a proposal for an answer,

[01:22:24] : [01:22:28]

a continuation of video, whatever.

[01:22:28] : [01:22:30]

And it tells you whetherY is compatible with X.

[01:22:30] : [01:22:32]

And the way it tells youthat Y is compatible with X

[01:22:32] : [01:22:37]

is that the output of thatfunction would be zero

[01:22:37] : [01:22:39]

if Y is compatible with X,

[01:22:39] : [01:22:40]

it would be a positive number, non-zero

[01:22:40] : [01:22:44]

if Y is not compatible with X.

[01:22:44] : [01:22:46]

Okay.

[01:22:46] : [01:22:48]

How do you train a system like this?

[01:22:48] : [01:22:49]

At a completely general level,

[01:22:49] : [01:22:51]

is you show it pairs of Xand Ys that are compatible,

[01:22:51] : [01:22:56]

a question and the corresponding answer.

[01:22:56] : [01:22:58]

And you train the parametersof the big neural net inside

[01:22:58] : [01:23:02]

to produce zero.

[01:23:02] : [01:23:03]

Okay.

[01:23:03] : [01:23:05]

Now that doesn't completely work

[01:23:05] : [01:23:07]

because the system might decide,

[01:23:07] : [01:23:08]

well, I'm just gonnasay zero for everything.

[01:23:08] : [01:23:11]

So now you have to have a process

[01:23:11] : [01:23:12]

to make sure that for a wrong Y,

[01:23:12] : [01:23:16]

the energy will be larger than zero.

[01:23:16] : [01:23:18]

And there you have two options,

[01:23:18] : [01:23:20]

one is contrastive methods.

[01:23:20] : [01:23:21]

So contrastive method isyou show an X and a bad Y,

[01:23:21] : [01:23:25]

and you tell the system,

[01:23:25] : [01:23:27]

well, give a high energy to this.

[01:23:27] : [01:23:29]

Like push up the energy, right?

[01:23:29] : [01:23:30]

Change the weights in the neuralnet that compute the energy

[01:23:30] : [01:23:33]

so that it goes up.

[01:23:33] : [01:23:34]

So that's contrasting methods.

[01:23:34] : [01:23:37]

The problem with this isif the space of Y is large,

[01:23:37] : [01:23:41]

the number of such contrasted samples

[01:23:41] : [01:23:43]

you're gonna have to show is gigantic.

[01:23:43] : [01:23:47]

But people do this.

[01:23:47] : [01:23:49]

They do this when youtrain a system with RLHF,

[01:23:49] : [01:23:53]

basically what you're training

[01:23:53] : [01:23:55]

is what's called a reward model,

[01:23:55] : [01:23:57]

which is basically an objective function

[01:23:57] : [01:24:00]

that tells you whetheran answer is good or bad.

[01:24:00] : [01:24:02]

And that's basically exactly what this is.

[01:24:02] : [01:24:06]

So we already do this to some extent.

[01:24:06] : [01:24:08]

We're just not using it for inference,

[01:24:08] : [01:24:09]

we're just using it for training.

[01:24:09] : [01:24:11]

There is another set of methods

[01:24:11] : [01:24:15]

which are non-contrastive,and I prefer those.

[01:24:15] : [01:24:18]

And those non-contrastivemethod basically say,

[01:24:18] : [01:24:22]

okay, the energy function

[01:24:22] : [01:24:26]

needs to have low energy onpairs of XYs that are compatible

[01:24:26] : [01:24:29]

that come from your training set.

[01:24:29] : [01:24:31]

How do you make sure that the energy

[01:24:31] : [01:24:34]

is gonna be higher everywhere else?

[01:24:34] : [01:24:36]

And the way you do this

[01:24:36] : [01:24:38]

is by having a regularizer, a criterion,

[01:24:38] : [01:24:43]

a term in your cost function

[01:24:43] : [01:24:45]

that basically minimizesthe volume of space

[01:24:45] : [01:24:49]

that can take low energy.

[01:24:49] : [01:24:50]

And the precise way to do this,

[01:24:50] : [01:24:53]

there's all kinds of differentspecific ways to do this

[01:24:53] : [01:24:55]

depending on the architecture,

[01:24:55] : [01:24:56]

but that's the basic principle.

[01:24:56] : [01:24:58]

So that if you pushdown the energy function

[01:24:58] : [01:25:00]

for particular regions in the XY space,

[01:25:00] : [01:25:04]

it will automaticallygo up in other places

[01:25:04] : [01:25:06]

because there's only alimited volume of space

[01:25:06] : [01:25:09]

that can take low energy.

[01:25:09] : [01:25:11]

Okay?

[01:25:11] : [01:25:11]

By the construction of the system

[01:25:11] : [01:25:13]

or by the regularizing function.

[01:25:13] : [01:25:16]

- We've been talking very generally,

[01:25:16] : [01:25:18]

but what is a good X and a good Y?

[01:25:18] : [01:25:21]

What is a good representation of X and Y?

[01:25:21] : [01:25:25]

Because we've been talking about language.

[01:25:25] : [01:25:27]

And if you just take language directly,

[01:25:27] : [01:25:30]

that presumably is not good,

[01:25:30] : [01:25:32]

so there has to be

[01:25:32] : [01:25:33]

some kind of abstractrepresentation of ideas.

[01:25:33] : [01:25:35]

- Yeah.

[01:25:35] : [01:25:37]

I mean you can do thiswith language directly

[01:25:37] : [01:25:39]

by just, you know, X is a text

[01:25:39] : [01:25:42]

and Y is the continuation of that text.

[01:25:42] : [01:25:43]

- [Lex] Yes.

[01:25:43] : [01:25:45]

- Or X is a question, Y is the answer.

[01:25:45] : [01:25:48]

- But you're sayingthat's not gonna take it.

[01:25:48] : [01:25:49]

I mean, that's going todo what LLMs are doing.

[01:25:49] : [01:25:52]

- Well, no.

[01:25:52] : [01:25:53]

It depends on how the internalstructure of the system

[01:25:53] : [01:25:56]

is built.

[01:25:56] : [01:25:57]

If the internal structure of the system

[01:25:57] : [01:25:59]

is built in such a waythat inside of the system

[01:25:59] : [01:26:02]

there is a latent variable,

[01:26:02] : [01:26:03]

let's called it Z,

[01:26:03] : [01:26:04]

that you can manipulate

[01:26:04] : [01:26:09]

so as to minimize the output energy,

[01:26:09] : [01:26:11]

then that Z can be viewed asrepresentation of a good answer

[01:26:11] : [01:26:16]

that you can translate intoa Y that is a good answer.

[01:26:16] : [01:26:19]

- So this kind of system could be trained

[01:26:19] : [01:26:22]

in a very similar way?

[01:26:22] : [01:26:24]

- Very similar way.

[01:26:24] : [01:26:25]

But you have to have thisway of preventing collapse,

[01:26:25] : [01:26:26]

of ensuring that there is high energy

[01:26:26] : [01:26:31]

for things you don't train it on.

[01:26:31] : [01:26:33]

And currently it's very implicit in LLMs.

[01:26:33] : [01:26:38]

It is done in a way

[01:26:38] : [01:26:39]

that people don't realize it's being done,

[01:26:39] : [01:26:40]

but it is being done.

[01:26:40] : [01:26:42]

It's due to the fact

[01:26:42] : [01:26:43]

that when you give a highprobability to a word,

[01:26:43] : [01:26:48]

automatically you give lowprobability to other words

[01:26:48] : [01:26:51]

because you only have

[01:26:51] : [01:26:52]

a finite amount of probabilityto go around. (laughing)

[01:26:52] : [01:26:55]

Right?

[01:26:55] : [01:26:56]

They have to sub to one.

[01:26:56] : [01:26:57]

So when you minimize thecross entropy or whatever,

[01:26:57] : [01:27:00]

when you train your LLMto predict the next word,

[01:27:00] : [01:27:05]

you are increasing the probability

[01:27:05] : [01:27:07]

your system will give to the correct word,

[01:27:07] : [01:27:09]

but you're also decreasing the probability

[01:27:09] : [01:27:10]

it will give to the incorrect words.

[01:27:10] : [01:27:12]

Now, indirectly, that givesa low probability to...

[01:27:12] : [01:27:17]

A high probability to sequencesof words that are good

[01:27:17] : [01:27:19]

and low probability twosequences of words that are bad,

[01:27:19] : [01:27:21]

but it's very indirect.

[01:27:21] : [01:27:23]

It's not obvious why thisactually works at all,

[01:27:23] : [01:27:26]

because you're not doingit on a joint probability

[01:27:26] : [01:27:30]

of all the symbols in a sequence,

[01:27:30] : [01:27:32]

you're just doing it kind of,

[01:27:32] : [01:27:34]

sort of factorized that probability

[01:27:34] : [01:27:37]

in terms of conditional probabilities

[01:27:37] : [01:27:39]

over successive tokens.

[01:27:39] : [01:27:41]

- So how do you do this for visual data?

[01:27:41] : [01:27:43]

- So we've been doing this

[01:27:43] : [01:27:44]

with all JEPA architectures,basically the-

[01:27:44] : [01:27:47]

- [Lex] The joint embedding?

[01:27:47] : [01:27:47]

- I-JEPA.

[01:27:47] : [01:27:48]

So there, the compatibilitybetween two things

[01:27:48] : [01:27:52]

is here's an image or a video,

[01:27:52] : [01:27:56]

here is a corrupted, shiftedor transformed version

[01:27:56] : [01:27:58]

of that image or video or masked.

[01:27:58] : [01:28:01]

Okay?

[01:28:01] : [01:28:01]

And then the energy of the system

[01:28:01] : [01:28:04]

is the prediction errorof the representation.

[01:28:04] : [01:28:09]

The predicted representationof the good thing

[01:28:09] : [01:28:14]

versus the actual representationof the good thing, right?

[01:28:14] : [01:28:17]

So you run the corruptedimage to the system,

[01:28:17] : [01:28:20]

predict the representation ofthe good input uncorrupted,

[01:28:20] : [01:28:24]

and then compute the prediction error.

[01:28:24] : [01:28:26]

That's the energy of the system.

[01:28:26] : [01:28:28]

So this system will tell you,

[01:28:28] : [01:28:30]

this is a good image andthis is a corrupted version.

[01:28:30] : [01:28:35]

It will give you zero energy

[01:28:35] : [01:28:38]

if those two things are effectively,

[01:28:38] : [01:28:41]

one of them is a corruptedversion of the other,

[01:28:41] : [01:28:43]

give you a high energy

[01:28:43] : [01:28:44]

if the two images arecompletely different.

[01:28:44] : [01:28:46]

- And hopefully that whole process

[01:28:46] : [01:28:48]

gives you a really nicecompressed representation

[01:28:48] : [01:28:51]

of reality, of visual reality.

[01:28:51] : [01:28:54]

- And we know it does

[01:28:54] : [01:28:55]

because then we use those presentations

[01:28:55] : [01:28:57]

as input to a classificationsystem or something,

[01:28:57] : [01:28:59]

and it works-- And then

[01:28:59] : [01:29:00]

that classification systemworks really nicely.

[01:29:00] : [01:29:01]

Okay.

[01:29:01] : [01:29:03]

Well, so to summarize,

[01:29:03] : [01:29:04]

you recommend in a spicy waythat only Yann LeCun can,

[01:29:04] : [01:29:09]

you recommend that weabandon generative models

[01:29:09] : [01:29:12]

in favor of joint embedding architectures?

[01:29:12] : [01:29:14]

- [Yann] Yes.

[01:29:14] : [01:29:16]

- Abandon autoregressive generation.

[01:29:16] : [01:29:17]

- [Yann] Yes.

[01:29:17] : [01:29:18]

- Abandon... (laughs)

[01:29:18] : [01:29:19]

This feels like court testimony.

[01:29:19] : [01:29:21]

Abandon probabilistic models

[01:29:21] : [01:29:23]

in favor of energy basedmodels, as we talked about.

[01:29:23] : [01:29:26]

Abandon contrastive methods

[01:29:26] : [01:29:27]

in favor of regularized methods.

[01:29:27] : [01:29:30]

And let me ask you about this;

[01:29:30] : [01:29:32]

you've been for a while, acritic of reinforcement learning.

[01:29:32] : [01:29:36]

- [Yann] Yes.

[01:29:36] : [01:29:37]

- So the last recommendationis that we abandon RL

[01:29:37] : [01:29:41]

in favor of model predictive control,

[01:29:41] : [01:29:43]

as you were talking about.

[01:29:43] : [01:29:45]

And only use RL

[01:29:45] : [01:29:46]

when planning doesn't yieldthe predicted outcome.

[01:29:46] : [01:29:50]

And we use RL in that case

[01:29:50] : [01:29:52]

to adjust the world model or the critic.

[01:29:52] : [01:29:55]

- [Yann] Yes.

[01:29:55] : [01:29:57]

- So you've mentioned RLHF,

[01:29:57] : [01:30:00]

reinforcement learningwith human feedback.

[01:30:00] : [01:30:02]

Why do you still hatereinforcement learning?

[01:30:02] : [01:30:05]

- [Yann] I don't hatereinforcement learning,

[01:30:05] : [01:30:07]

and I think it's-- So it's all love?

[01:30:07] : [01:30:08]

- I think it should notbe abandoned completely,

[01:30:08] : [01:30:12]

but I think it's use should be minimized

[01:30:12] : [01:30:14]

because it's incrediblyinefficient in terms of samples.

[01:30:14] : [01:30:18]

And so the proper way to train a system

[01:30:18] : [01:30:21]

is to first have it learn

[01:30:21] : [01:30:24]

good representations ofthe world and world models

[01:30:24] : [01:30:27]

from mostly observation,

[01:30:27] : [01:30:29]

maybe a little bit of interactions.

[01:30:29] : [01:30:31]

- And then steer it based on that.

[01:30:31] : [01:30:33]

If the representation is good,

[01:30:33] : [01:30:34]

then the adjustments should be minimal.

[01:30:34] : [01:30:36]

- Yeah.

[01:30:36] : [01:30:37]

Now there's two things.

[01:30:37] : [01:30:39]

If you've learned the world model,

[01:30:39] : [01:30:40]

you can use the world modelto plan a sequence of actions

[01:30:40] : [01:30:42]

to arrive at a particular objective.

[01:30:42] : [01:30:44]

You don't need RL,

[01:30:44] : [01:30:47]

unless the way you measurewhether you succeed

[01:30:47] : [01:30:50]

might be inexact.

[01:30:50] : [01:30:51]

Your idea of whether you weregonna fall from your bike

[01:30:51] : [01:30:56]

might be wrong,

[01:30:56] : [01:30:59]

or whether the personyou're fighting with MMA

[01:30:59] : [01:31:02]

was gonna do something

[01:31:02] : [01:31:03]

and they do something else. (laughing)

[01:31:03] : [01:31:05]

So there's two ways you can be wrong.

[01:31:05] : [01:31:09]

Either your objective function

[01:31:09] : [01:31:12]

does not reflect

[01:31:12] : [01:31:13]

the actual objective functionyou want to optimize,

[01:31:13] : [01:31:16]

or your world model is inaccurate, right?

[01:31:16] : [01:31:19]

So the prediction you were making

[01:31:19] : [01:31:22]

about what was gonna happenin the world is inaccurate.

[01:31:22] : [01:31:25]

So if you want to adjust your world model

[01:31:25] : [01:31:27]

while you are operating the world

[01:31:27] : [01:31:30]

or your objective function,

[01:31:30] : [01:31:32]

that is basically in the realm of RL.

[01:31:32] : [01:31:35]

This is what RL deals withto some extent, right?

[01:31:35] : [01:31:38]

So adjust your world model.

[01:31:38] : [01:31:40]

And the way to adjust yourworld model, even in advance,

[01:31:40] : [01:31:44]

is to explore parts of thespace with your world model,

[01:31:44] : [01:31:48]

where you know that yourworld model is inaccurate.

[01:31:48] : [01:31:50]

That's called curiositybasically, or play, right?

[01:31:50] : [01:31:54]

When you play,

[01:31:54] : [01:31:55]

you kind of explorepart of the state space

[01:31:55] : [01:31:58]

that you don't want to do for real

[01:31:58] : [01:32:03]

because it might be dangerous,

[01:32:03] : [01:32:05]

but you can adjust your world model

[01:32:05] : [01:32:07]

without killing yourselfbasically. (laughs)

[01:32:07] : [01:32:13]

So that's what you want to use RL for.

[01:32:13] : [01:32:14]

When it comes time tolearning a particular task,

[01:32:14] : [01:32:18]

you already have all thegood representations,

[01:32:18] : [01:32:20]

you already have your world model,

[01:32:20] : [01:32:21]

but you need to adjust itfor the situation at hand.

[01:32:21] : [01:32:25]

That's when you use RL.

[01:32:25] : [01:32:26]

- Why do you think RLHF works so well?

[01:32:26] : [01:32:29]

This enforcement learningwith human feedback,

[01:32:29] : [01:32:32]

why did it have such atransformational effect

[01:32:32] : [01:32:34]

on large language models that came before?

[01:32:34] : [01:32:38]

- So what's had thetransformational effect

[01:32:38] : [01:32:39]

is human feedback.

[01:32:39] : [01:32:42]

There is many ways to use it

[01:32:42] : [01:32:43]

and some of it is justpurely supervised, actually,

[01:32:43] : [01:32:45]

it's not really reinforcement learning.

[01:32:45] : [01:32:47]

- So it's the HF. (laughing)

[01:32:47] : [01:32:49]

- It's the HF.

[01:32:49] : [01:32:50]

And then there is various waysto use human feedback, right?

[01:32:50] : [01:32:53]

So you can ask humans to rate answers,

[01:32:53] : [01:32:56]

multiple answers that areproduced by a world model.

[01:32:56] : [01:33:00]

And then what you do is youtrain an objective function

[01:33:00] : [01:33:05]

to predict that rating.

[01:33:05] : [01:33:07]

And then you can usethat objective function

[01:33:07] : [01:33:11]

to predict whether an answer is good,

[01:33:11] : [01:33:13]

and you can back propagatereally through this

[01:33:13] : [01:33:15]

to fine tune your system

[01:33:15] : [01:33:16]

so that it only produceshighly rated answers.

[01:33:16] : [01:33:19]

Okay?

[01:33:19] : [01:33:22]

So that's one way.

[01:33:22] : [01:33:23]

So that's like in RL,

[01:33:23] : [01:33:26]

that means training what'scalled a reward model, right?

[01:33:26] : [01:33:29]

So something that,

[01:33:29] : [01:33:30]

basically your small neural net

[01:33:30] : [01:33:31]

that estimates to what extentan answer is good, right?

[01:33:31] : [01:33:35]

It's very similar to the objective

[01:33:35] : [01:33:36]

I was talking about earlier for planning,

[01:33:36] : [01:33:39]

except now it's not used for planning,

[01:33:39] : [01:33:41]

it's used for fine tuning your system.

[01:33:41] : [01:33:43]

I think it would be much more efficient

[01:33:43] : [01:33:45]

to use it for planning,

[01:33:45] : [01:33:46]

but currently it's used

[01:33:46] : [01:33:49]

to fine tune the parameters of the system.

[01:33:49] : [01:33:52]

Now, there's several ways to do this.

[01:33:52] : [01:33:54]

Some of them are supervised.

[01:33:54] : [01:33:57]

You just ask a human person,

[01:33:57] : [01:33:59]

like what is a goodanswer for this, right?

[01:33:59] : [01:34:02]

Then you just type the answer.

[01:34:02] : [01:34:04]

I mean, there's lots of ways

[01:34:04] : [01:34:07]

that those systems are being adjusted.

[01:34:07] : [01:34:09]

- Now, a lot of peoplehave been very critical

[01:34:09] : [01:34:13]

of the recently releasedGoogle's Gemini 1.5

[01:34:13] : [01:34:17]

for essentially, in my words,I could say super woke.

[01:34:17] : [01:34:23]

Woke in the negativeconnotation of that word.

[01:34:23] : [01:34:26]

There is some almost hilariouslyabsurd things that it does,

[01:34:26] : [01:34:30]

like it modifies history,

[01:34:30] : [01:34:32]

like generating images ofa black George Washington

[01:34:32] : [01:34:37]

or perhaps more seriously

[01:34:37] : [01:34:40]

something that you commented on Twitter,

[01:34:40] : [01:34:43]

which is refusing to commenton or generate images of,

[01:34:43] : [01:34:48]

or even descriptions ofTiananmen Square or the tank men,

[01:34:48] : [01:34:54]

one of the most sort of legendaryprotest images in history.

[01:34:54] : [01:35:00]

And of course, theseimages are highly censored

[01:35:00] : [01:35:04]

by the Chinese government.

[01:35:04] : [01:35:06]

And therefore everybodystarted asking questions

[01:35:06] : [01:35:09]

of what is the processof designing these LLMs?

[01:35:09] : [01:35:14]

What is the role of censorship in these,

[01:35:14] : [01:35:16]

and all that kind of stuff.

[01:35:16] : [01:35:19]

So you commented on Twitter

[01:35:19] : [01:35:22]

saying that open source is the answer.

[01:35:22] : [01:35:23]

(laughs)- Yeah.

[01:35:23] : [01:35:25]

- Essentially.

[01:35:25] : [01:35:26]

So can you explain?

[01:35:26] : [01:35:28]

- I actually made that comment

[01:35:28] : [01:35:31]

on just about every social network I can.

[01:35:31] : [01:35:32]

(Lex laughs)

[01:35:32] : [01:35:33]

And I've made that pointmultiple times in various forums.

[01:35:33] : [01:35:38]

Here's my point of view on this.

[01:35:38] : [01:35:43]

People can complain thatAI systems are biased,

[01:35:43] : [01:35:47]

and they generally are biased

[01:35:47] : [01:35:49]

by the distribution of the training data

[01:35:49] : [01:35:51]

that they've been trained on

[01:35:51] : [01:35:53]

that reflects biases in society.

[01:35:53] : [01:35:57]

And that is potentiallyoffensive to some people

[01:35:57] : [01:36:03]

or potentially not.

[01:36:03] : [01:36:06]

And some techniques to de-bias

[01:36:06] : [01:36:10]

then become offensive to some people

[01:36:10] : [01:36:13]

because of historicalincorrectness and things like that.

[01:36:13] : [01:36:20]

And so you can ask the question.

[01:36:20] : [01:36:25]

You can ask two questions.

[01:36:25] : [01:36:27]

The first question is,

[01:36:27] : [01:36:28]

is it possible to produce anAI system that is not biased?

[01:36:28] : [01:36:30]

And the answer is absolutely not.

[01:36:30] : [01:36:33]

And it's not because oftechnological challenges,

[01:36:33] : [01:36:36]

although there are technologicalchallenges to that.

[01:36:36] : [01:36:41]

It's because bias is inthe eye of the beholder.

[01:36:41] : [01:36:45]

Different people may have different ideas

[01:36:45] : [01:36:48]

about what constitutesbias for a lot of things.

[01:36:48] : [01:36:52]

I mean there are factsthat are indisputable,

[01:36:52] : [01:36:57]

but there are a lot of opinions or things

[01:36:57] : [01:36:59]

that can be expressed in different ways.

[01:36:59] : [01:37:01]

And so you cannot have an unbiased system,

[01:37:01] : [01:37:04]

that's just an impossibility.

[01:37:04] : [01:37:06]

And so what's the answer to this?

[01:37:06] : [01:37:12]

And the answer is thesame answer that we found

[01:37:12] : [01:37:16]

in liberal democracy about the press.

[01:37:16] : [01:37:20]

The press needs to be free and diverse.

[01:37:20] : [01:37:24]

We have free speech for a good reason.

[01:37:24] : [01:37:28]

It's because we don't wantall of our information

[01:37:28] : [01:37:31]

to come from a unique source,

[01:37:31] : [01:37:35]

'cause that's opposite tothe whole idea of democracy

[01:37:35] : [01:37:40]

and progressive ideasand even science, right?

[01:37:40] : [01:37:44]

In science, people have toargue for different opinions.

[01:37:44] : [01:37:48]

And science makes progresswhen people disagree

[01:37:48] : [01:37:51]

and they come up with an answer

[01:37:51] : [01:37:52]

and a consensus forms, right?

[01:37:52] : [01:37:54]

And it's true in alldemocracies around the world.

[01:37:54] : [01:37:57]

So there is a futurewhich is already happening

[01:37:57] : [01:38:02]

where every single one of our interaction

[01:38:02] : [01:38:05]

with the digital world

[01:38:05] : [01:38:07]

will be mediated by AI systems,

[01:38:07] : [01:38:10]

AI assistance, right?

[01:38:10] : [01:38:11]

We're gonna have smart glasses.

[01:38:11] : [01:38:14]

You can already buy themfrom Meta, (laughing)

[01:38:14] : [01:38:16]

the Ray-Ban Meta.

[01:38:16] : [01:38:18]

Where you can talk to them

[01:38:18] : [01:38:20]

and they are connected with an LLM

[01:38:20] : [01:38:21]

and you can get answerson any question you have.

[01:38:21] : [01:38:25]

Or you can be looking at a monument

[01:38:25] : [01:38:28]

and there is a camera inthe system, in the glasses,

[01:38:28] : [01:38:31]

you can ask it like what can you tell me

[01:38:31] : [01:38:34]

about this building or this monument?

[01:38:34] : [01:38:36]

You can be looking at amenu in a foreign language

[01:38:36] : [01:38:39]

and the thing we willtranslate it for you.

[01:38:39] : [01:38:40]

We can do real time translation

[01:38:40] : [01:38:43]

if we speak different languages.

[01:38:43] : [01:38:44]

So a lot of our interactionswith the digital world

[01:38:44] : [01:38:48]

are going to be mediated by those systems

[01:38:48] : [01:38:49]

in the near future.

[01:38:49] : [01:38:50]

Increasingly, the searchengines that we're gonna use

[01:38:50] : [01:38:56]

are not gonna be search engines,

[01:38:56] : [01:38:58]

they're gonna be dialogue systems

[01:38:58] : [01:39:01]

that we just ask a question,

[01:39:01] : [01:39:04]

and it will answer

[01:39:04] : [01:39:05]

and then point you

[01:39:05] : [01:39:05]

to the perhaps appropriatereference for it.

[01:39:05] : [01:39:09]

But here is the thing,

[01:39:09] : [01:39:10]

we cannot afford those systems

[01:39:10] : [01:39:11]

to come from a handful of companies

[01:39:11] : [01:39:13]

on the west coast of the US

[01:39:13] : [01:39:15]

because those systems will constitute

[01:39:15] : [01:39:18]

the repository of all human knowledge.

[01:39:18] : [01:39:21]

And we cannot have that be controlled

[01:39:21] : [01:39:25]

by a small number of people, right?

[01:39:25] : [01:39:27]

It has to be diverse

[01:39:27] : [01:39:29]

for the same reason thepress has to be diverse.

[01:39:29] : [01:39:32]

So how do we get a diverseset of AI assistance?

[01:39:32] : [01:39:35]

It's very expensive and difficult

[01:39:35] : [01:39:38]

to train a base model, right?

[01:39:38] : [01:39:40]

A base LLM at the moment.

[01:39:40] : [01:39:42]

In the future might besomething different,

[01:39:42] : [01:39:43]

but at the moment that's an LLM.

[01:39:43] : [01:39:46]

So only a few companiescan do this properly.

[01:39:46] : [01:39:49]

And if some of thosesubsystems are open source,

[01:39:49] : [01:39:55]

anybody can use them,

[01:39:55] : [01:39:57]

anybody can fine tune them.

[01:39:57] : [01:39:58]

If we put in place some systems

[01:39:58] : [01:40:01]

that allows any group of people,

[01:40:01] : [01:40:05]

whether they are individual citizens,

[01:40:05] : [01:40:10]

groups of citizens,

[01:40:10] : [01:40:11]

government organizations,

[01:40:11] : [01:40:13]

NGOs, companies, whatever,

[01:40:13] : [01:40:18]

to take those open sourcesystems, AI systems,

[01:40:18] : [01:40:23]

and fine tune them for theirown purpose on their own data,

[01:40:23] : [01:40:27]

there we're gonna havea very large diversity

[01:40:27] : [01:40:29]

of different AI systems

[01:40:29] : [01:40:31]

that are specialized forall of those things, right?

[01:40:31] : [01:40:34]

So I'll tell you,

[01:40:34] : [01:40:35]

I talked to the Frenchgovernment quite a bit

[01:40:35] : [01:40:38]

and the French government will not accept

[01:40:38] : [01:40:41]

that the digital dietof all their citizens

[01:40:41] : [01:40:44]

be controlled by three companies

[01:40:44] : [01:40:46]

on the west coast of the US.

[01:40:46] : [01:40:48]

That's just not acceptable.

[01:40:48] : [01:40:49]

It's a danger to democracy.

[01:40:49] : [01:40:51]

Regardless of how well intentioned

[01:40:51] : [01:40:52]

those companies are, right?

[01:40:52] : [01:40:54]

And it's also a danger to local culture,

[01:40:54] : [01:41:00]

to values, to language, right?

[01:41:00] : [01:41:05]

I was talking with thefounder of Infosys in India.

[01:41:05] : [01:41:10]

He's funding a projectto fine tune LLaMA 2,

[01:41:10] : [01:41:16]

the open source model produced by Meta.

[01:41:16] : [01:41:19]

So that LLaMA 2 speaks all 22official languages in India.

[01:41:19] : [01:41:23]

It's very important for people in India.

[01:41:23] : [01:41:26]

I was talking to aformer colleague of mine,

[01:41:26] : [01:41:28]

Moustapha Cisse,

[01:41:28] : [01:41:29]

who used to be a scientist at FAIR,

[01:41:29] : [01:41:31]

and then moved back to Africa

[01:41:31] : [01:41:32]

and created a researchlab for Google in Africa

[01:41:32] : [01:41:35]

and now has a new startup Kera.

[01:41:35] : [01:41:37]

And what he's trying todo is basically have LLM

[01:41:37] : [01:41:40]

that speaks the local languages in Senegal

[01:41:40] : [01:41:42]

so that people can haveaccess to medical information,

[01:41:42] : [01:41:46]

'cause they don't have access to doctors,

[01:41:46] : [01:41:47]

it's a very small number ofdoctors per capita in Senegal.

[01:41:47] : [01:41:52]

I mean, you can't have any of this

[01:41:52] : [01:41:55]

unless you have open source platforms.

[01:41:55] : [01:41:58]

So with open source platforms,

[01:41:58] : [01:41:59]

you can have AI systems

[01:41:59] : [01:42:00]

that are not only diverse interms of political opinions

[01:42:00] : [01:42:02]

or things of that type,

[01:42:02] : [01:42:05]

but in terms of language,culture, value systems,

[01:42:05] : [01:42:10]

political opinions, technicalabilities in various domains.

[01:42:10] : [01:42:16]

And you can have an industry,

[01:42:16] : [01:42:20]

an ecosystem of companies

[01:42:20] : [01:42:22]

that fine tune those open source systems

[01:42:22] : [01:42:24]

for vertical applicationsin industry, right?

[01:42:24] : [01:42:26]

You have, I don't know, apublisher has thousands of books

[01:42:26] : [01:42:30]

and they want to build a system

[01:42:30] : [01:42:31]

that allows a customerto just ask a question

[01:42:31] : [01:42:33]

about the content of any of their books.

[01:42:33] : [01:42:37]

You need to train on theirproprietary data, right?

[01:42:37] : [01:42:40]

You have a company,

[01:42:40] : [01:42:42]

we have one within Metait's called Meta Mate.

[01:42:42] : [01:42:44]

And it's basically an LLM

[01:42:44] : [01:42:46]

that can answer any question

[01:42:46] : [01:42:47]

about internal stuffabout about the company.

[01:42:47] : [01:42:52]

Very useful.

[01:42:52] : [01:42:53]

A lot of companies want this, right?

[01:42:53] : [01:42:54]

A lot of companies want thisnot just for their employees,

[01:42:54] : [01:42:57]

but also for their customers,

[01:42:57] : [01:42:59]

to take care of their customers.

[01:42:59] : [01:43:00]

So the only way you'regonna have an AI industry,

[01:43:00] : [01:43:04]

the only way you're gonna have AI systems

[01:43:04] : [01:43:06]

that are not uniquely biased,

[01:43:06] : [01:43:08]

is if you have open source platforms

[01:43:08] : [01:43:10]

on top of which any group canbuild specialized systems.

[01:43:10] : [01:43:15]

So the inevitable direction of history

[01:43:15] : [01:43:21]

is that the vast majority of AI systems

[01:43:21] : [01:43:25]

will be built on top ofopen source platforms.

[01:43:25] : [01:43:28]

- So that's a beautiful vision.

[01:43:28] : [01:43:30]

So meaning like a companylike Meta or Google or so on,

[01:43:30] : [01:43:35]

should take only minimal fine tuning steps

[01:43:35] : [01:43:40]

after the building, thefoundation, pre-trained model.

[01:43:40] : [01:43:44]

As few steps as possible.

[01:43:44] : [01:43:47]

- Basically.

[01:43:47] : [01:43:48]

(Lex sighs)

[01:43:48] : [01:43:49]

- Can Meta afford to do that?

[01:43:49] : [01:43:51]

- No.

[01:43:51] : [01:43:52]

- So I don't know if you know this,

[01:43:52] : [01:43:53]

but companies are supposedto make money somehow.

[01:43:53] : [01:43:56]

And open source is like giving away...

[01:43:56] : [01:44:00]

I don't know, Mark made a video,

[01:44:00] : [01:44:02]

Mark Zuckerberg.

[01:44:02] : [01:44:04]

A very sexy video talkingabout 350,000 Nvidia H100s.

[01:44:04] : [01:44:08]

The math of that is,

[01:44:08] : [01:44:14]

just for the GPUs,that's a hundred billion,

[01:44:14] : [01:44:17]

plus the infrastructurefor training everything.

[01:44:17] : [01:44:22]

So I'm no business guy,

[01:44:22] : [01:44:26]

but how do you make money on that?

[01:44:26] : [01:44:27]

So the vision you paintis a really powerful one,

[01:44:27] : [01:44:30]

but how is it possible to make money?

[01:44:30] : [01:44:32]

- Okay.

[01:44:32] : [01:44:33]

So you have severalbusiness models, right?

[01:44:33] : [01:44:36]

The business model thatMeta is built around

[01:44:36] : [01:44:39]

is you offer a service,

[01:44:39] : [01:44:44]

and the financing of that service

[01:44:44] : [01:44:48]

is either through ads orthrough business customers.

[01:44:48] : [01:44:52]

So for example, if you have an LLM

[01:44:52] : [01:44:54]

that can help a mom-and-pop pizza place

[01:44:54] : [01:44:58]

by talking to theircustomers through WhatsApp,

[01:44:58] : [01:45:03]

and so the customerscan just order a pizza

[01:45:03] : [01:45:05]

and the system will just ask them,

[01:45:05] : [01:45:08]

like what topping do you wantor what size, blah blah, blah.

[01:45:08] : [01:45:11]

The business will pay for that.

[01:45:11] : [01:45:14]

Okay?

[01:45:14] : [01:45:14]

That's a model.

[01:45:14] : [01:45:15]

And otherwise, if it's a system

[01:45:15] : [01:45:21]

that is on the more kindof classical services,

[01:45:21] : [01:45:24]

it can be ad supported orthere's several models.

[01:45:24] : [01:45:28]

But the point is,

[01:45:28] : [01:45:29]

if you have a big enoughpotential customer base

[01:45:29] : [01:45:34]

and you need to build thatsystem anyway for them,

[01:45:34] : [01:45:39]

it doesn't hurt you

[01:45:39] : [01:45:41]

to actually distribute it to open source.

[01:45:41] : [01:45:43]

- Again, I'm no business guy,

[01:45:43] : [01:45:45]

but if you release the open source model,

[01:45:45] : [01:45:48]

then other people cando the same kind of task

[01:45:48] : [01:45:51]

and compete on it.

[01:45:51] : [01:45:52]

Basically provide finetuned models for businesses,

[01:45:52] : [01:45:57]

is the bet that Meta is making...

[01:45:57] : [01:45:59]

By the way, I'm a huge fan of all this.

[01:45:59] : [01:46:01]

But is the bet that Meta is making

[01:46:01] : [01:46:03]

is like, "we'll do a better job of it?"

[01:46:03] : [01:46:05]

- Well, no.

[01:46:05] : [01:46:06]

The bet is more,

[01:46:06] : [01:46:08]

we already have a huge userbase and customer base.

[01:46:08] : [01:46:13]

- [Lex] Ah, right.- Right?

[01:46:13] : [01:46:13]

So it's gonna be useful to them.

[01:46:13] : [01:46:15]

Whatever we offer them is gonna be useful

[01:46:15] : [01:46:17]

and there is a way toderive revenue from this.

[01:46:17] : [01:46:21]

- [Lex] Sure.

[01:46:21] : [01:46:22]

- And it doesn't hurt

[01:46:22] : [01:46:23]

that we provide that systemor the base model, right?

[01:46:23] : [01:46:28]

The foundation model in open source

[01:46:28] : [01:46:32]

for others to buildapplications on top of it too.

[01:46:32] : [01:46:35]

If those applications

[01:46:35] : [01:46:36]

turn out to be useful for our customers,

[01:46:36] : [01:46:38]

we can just buy it for them.

[01:46:38] : [01:46:39]

It could be that theywill improve the platform.

[01:46:39] : [01:46:44]

In fact, we see this already.

[01:46:44] : [01:46:46]

I mean there is literallymillions of downloads of LLaMA 2

[01:46:46] : [01:46:50]

and thousands of peoplewho have provided ideas

[01:46:50] : [01:46:53]

about how to make it better.

[01:46:53] : [01:46:55]

So this clearly accelerates progress

[01:46:55] : [01:46:58]

to make the system available

[01:46:58] : [01:47:00]

to sort of a wide community of people.

[01:47:00] : [01:47:05]

And there is literallythousands of businesses

[01:47:05] : [01:47:07]

who are building applications with it.

[01:47:07] : [01:47:09]

Meta's ability to deriverevenue from this technology

[01:47:09] : [01:47:19]

is not impaired by the distribution

[01:47:19] : [01:47:24]

of base models in open source.

[01:47:24] : [01:47:26]

- The fundamental criticismthat Gemini is getting

[01:47:26] : [01:47:28]

is that, as you pointedout on the west coast...

[01:47:28] : [01:47:31]

Just to clarify,

[01:47:31] : [01:47:32]

we're currently in the east coast,

[01:47:32] : [01:47:34]

where I would suppose MetaAI headquarters would be.

[01:47:34] : [01:47:38]

(laughs)

[01:47:38] : [01:47:39]

So strong words about the west coast.

[01:47:39] : [01:47:42]

But I guess the issue that happens is,

[01:47:42] : [01:47:46]

I think it's fair to saythat most tech people

[01:47:46] : [01:47:50]

have a political affiliationwith the left wing.

[01:47:50] : [01:47:53]

They lean left.

[01:47:53] : [01:47:55]

And so the problem that peopleare criticizing Gemini with

[01:47:55] : [01:47:58]

is that in that de-biasingprocess that you mentioned,

[01:47:58] : [01:48:02]

that their ideologicallean becomes obvious.

[01:48:02] : [01:48:07]

Is this something that could be escaped?

[01:48:07] : [01:48:14]

You're saying open source is the only way?

[01:48:14] : [01:48:16]

- [Yann] Yeah.

[01:48:16] : [01:48:17]

- Have you witnessed thiskind of ideological lean

[01:48:17] : [01:48:19]

that makes engineering difficult?

[01:48:19] : [01:48:22]

- No, I don't think it has to do...

[01:48:22] : [01:48:24]

I don't think the issue has to do

[01:48:24] : [01:48:25]

with the political leaning

[01:48:25] : [01:48:26]

of the people designing those systems.

[01:48:26] : [01:48:29]

It has to do with theacceptability or political leanings

[01:48:29] : [01:48:34]

of their customer base or audience, right?

[01:48:34] : [01:48:38]

So a big company cannot affordto offend too many people.

[01:48:38] : [01:48:43]

So they're going to make sure

[01:48:43] : [01:48:46]

that whatever productthey put out is "safe,"

[01:48:46] : [01:48:49]

whatever that means.

[01:48:49] : [01:48:50]

And it's very possible to overdo it.

[01:48:50] : [01:48:55]

And it's also very possible to...

[01:48:55] : [01:48:58]

It's impossible to do itproperly for everyone.

[01:48:58] : [01:49:00]

You're not going to satisfy everyone.

[01:49:00] : [01:49:02]

So that's what I said before,

[01:49:02] : [01:49:03]

you cannot have a system that is unbiased

[01:49:03] : [01:49:05]

and is perceived as unbiased by everyone.

[01:49:05] : [01:49:07]

It's gonna be,

[01:49:07] : [01:49:09]

you push it in one way,

[01:49:09] : [01:49:11]

one set of people aregonna see it as biased.

[01:49:11] : [01:49:14]

And then you push it the other way

[01:49:14] : [01:49:15]

and another set of peopleis gonna see it as biased.

[01:49:15] : [01:49:18]

And then in addition to this,

[01:49:18] : [01:49:19]

there's the issue ofif you push the system

[01:49:19] : [01:49:22]

perhaps a little too far in one direction,

[01:49:22] : [01:49:24]

it's gonna be non-factual, right?

[01:49:24] : [01:49:25]

You're gonna have black Nazi soldiers in-

[01:49:25] : [01:49:30]

- Yeah.

[01:49:30] : [01:49:31]

So we should mention image generation

[01:49:31] : [01:49:34]

of black Nazi soldiers,

[01:49:34] : [01:49:36]

which is not factually accurate.

[01:49:36] : [01:49:38]

- Right.

[01:49:38] : [01:49:39]

And can be offensive forsome people as well, right?

[01:49:39] : [01:49:42]

So it's gonna be impossible

[01:49:42] : [01:49:46]

to kind of produce systemsthat are unbiased for everyone.

[01:49:46] : [01:49:49]

So the only solutionthat I see is diversity.

[01:49:49] : [01:49:53]

- And diversity in fullmeaning of that word,

[01:49:53] : [01:49:55]

diversity in every possible way.

[01:49:55] : [01:49:57]

- [Yann] Yeah.

[01:49:57] : [01:49:58]

- Marc Andreessen just tweeted today,

[01:49:58] : [01:50:02]

let me do a TL;DR.

[01:50:02] : [01:50:06]

The conclusion is onlystartups and open source

[01:50:06] : [01:50:08]

can avoid the issue that he'shighlighting with big tech.

[01:50:08] : [01:50:12]

He's asking,

[01:50:12] : [01:50:14]

can big tech actually fieldgenerative AI products?

[01:50:14] : [01:50:17]

One, ever escalating demandsfrom internal activists,

[01:50:17] : [01:50:20]

employee mobs, crazed executives,

[01:50:20] : [01:50:23]

broken boards, pressure groups,

[01:50:23] : [01:50:25]

extremist regulators,government agencies, the press,

[01:50:25] : [01:50:28]

in quotes "experts,"

[01:50:28] : [01:50:30]

and everything corrupting the output.

[01:50:30] : [01:50:34]

Two, constant risk ofgenerating a bad answer

[01:50:34] : [01:50:36]

or drawing a bad pictureor rendering a bad video.

[01:50:36] : [01:50:40]

Who knows what it's goingto say or do at any moment?

[01:50:40] : [01:50:44]

Three, legal exposure,product liability, slander,

[01:50:44] : [01:50:48]

election law, many other things and so on.

[01:50:48] : [01:50:51]

Anything that makes Congress mad.

[01:50:51] : [01:50:53]

Four, continuous attempts

[01:50:53] : [01:50:56]

to tighten grip on acceptable output,

[01:50:56] : [01:50:58]

degrade the model,

[01:50:58] : [01:50:59]

like how good it actually is

[01:50:59] : [01:51:01]

in terms of usable andpleasant to use and effective

[01:51:01] : [01:51:05]

and all that kind of stuff.

[01:51:05] : [01:51:06]

And five, publicity ofbad text, images, video,

[01:51:06] : [01:51:10]

actual puts those examplesinto the training data

[01:51:10] : [01:51:13]

for the next version.

[01:51:13] : [01:51:14]

And so on.

[01:51:14] : [01:51:15]

So he just highlightshow difficult this is.

[01:51:15] : [01:51:18]

From all kinds of people being unhappy.

[01:51:18] : [01:51:21]

He just said you can't create a system

[01:51:21] : [01:51:23]

that makes everybody happy.

[01:51:23] : [01:51:24]

- [Yann] Yes.

[01:51:24] : [01:51:25]

- So if you're going to dothe fine tuning yourself

[01:51:25] : [01:51:29]

and keep a close source,

[01:51:29] : [01:51:30]

essentially the problem there

[01:51:30] : [01:51:33]

is then trying to minimizethe number of people

[01:51:33] : [01:51:35]

who are going to be unhappy.

[01:51:35] : [01:51:36]

- [Yann] Yeah.

[01:51:36] : [01:51:38]

- And you're saying like the only...

[01:51:38] : [01:51:39]

That that's almostimpossible to do, right?

[01:51:39] : [01:51:42]

And the better way is to do open source.

[01:51:42] : [01:51:44]

- Basically, yeah.

[01:51:44] : [01:51:46]

I mean Marc is right about anumber of things that he lists

[01:51:46] : [01:51:51]

that indeed scare large companies.

[01:51:51] : [01:51:55]

Certainly, congressionalinvestigations is one of them.

[01:51:55] : [01:52:00]

Legal liability.

[01:52:00] : [01:52:01]

Making things

[01:52:01] : [01:52:05]

that get people to hurtthemselves or hurt others.

[01:52:05] : [01:52:09]

Like big companies are really careful

[01:52:09] : [01:52:12]

about not producing things of this type,

[01:52:12] : [01:52:15]

because they have...

[01:52:15] : [01:52:19]

They don't want to hurtanyone, first of all.

[01:52:19] : [01:52:21]

And then second, they wannapreserve their business.

[01:52:21] : [01:52:23]

So it's essentially impossiblefor systems like this

[01:52:23] : [01:52:26]

that can inevitablyformulate political opinions

[01:52:26] : [01:52:30]

and opinions about various things

[01:52:30] : [01:52:32]

that may be political or not,

[01:52:32] : [01:52:34]

but that people may disagree about.

[01:52:34] : [01:52:36]

About, you know, moral issues

[01:52:36] : [01:52:37]

and things about likequestions about religion

[01:52:37] : [01:52:42]

and things like that, right?

[01:52:42] : [01:52:44]

Or cultural issues

[01:52:44] : [01:52:46]

that people from different communities

[01:52:46] : [01:52:48]

would disagree with in the first place.

[01:52:48] : [01:52:50]

So there's only kind of arelatively small number of things

[01:52:50] : [01:52:52]

that people will sort of agree on,

[01:52:52] : [01:52:55]

basic principles.

[01:52:55] : [01:52:57]

But beyond that,

[01:52:57] : [01:52:58]

if you want those systems to be useful,

[01:52:58] : [01:53:01]

they will necessarily haveto offend a number of people,

[01:53:01] : [01:53:06]

inevitably.

[01:53:06] : [01:53:08]

- And so open source is just better-

[01:53:08] : [01:53:11]

- [Yann] Diversity is better, right?

[01:53:11] : [01:53:12]

- And open source enables diversity.

[01:53:12] : [01:53:15]

- That's right.

[01:53:15] : [01:53:16]

Open source enables diversity.

[01:53:16] : [01:53:17]

- This can be a fascinating world

[01:53:17] : [01:53:19]

where if it's true thatthe open source world,

[01:53:19] : [01:53:22]

if Meta leads the way

[01:53:22] : [01:53:24]

and creates this kind of opensource foundation model world,

[01:53:24] : [01:53:27]

there's going to be,

[01:53:27] : [01:53:28]

like governments will have afine tuned model. (laughing)

[01:53:28] : [01:53:31]

- [Yann] Yeah.

[01:53:31] : [01:53:33]

- And then potentially,

[01:53:33] : [01:53:34]

people that vote left and right

[01:53:34] : [01:53:39]

will have their own model and preference

[01:53:39] : [01:53:40]

to be able to choose.

[01:53:40] : [01:53:42]

And it will potentiallydivide us even more

[01:53:42] : [01:53:44]

but that's on us humans.

[01:53:44] : [01:53:46]

We get to figure out...

[01:53:46] : [01:53:48]

Basically the technology enables humans

[01:53:48] : [01:53:50]

to human more effectively.

[01:53:50] : [01:53:53]

And all the difficult ethicalquestions that humans raise

[01:53:53] : [01:53:57]

we'll just leave it upto us to figure that out.

[01:53:57] : [01:54:02]

- Yeah, I mean there aresome limits to what...

[01:54:02] : [01:54:04]

The same way there arelimits to free speech,

[01:54:04] : [01:54:06]

there has to be somelimit to the kind of stuff

[01:54:06] : [01:54:08]

that those systems mightbe authorized to produce,

[01:54:08] : [01:54:13]

some guardrails.

[01:54:13] : [01:54:16]

So I mean, that's one thingI've been interested in,

[01:54:16] : [01:54:18]

which is in the type of architecture

[01:54:18] : [01:54:20]

that we were discussing before,

[01:54:20] : [01:54:22]

where the output of the system

[01:54:22] : [01:54:26]

is a result of an inferenceto satisfy an objective.

[01:54:26] : [01:54:29]

That objective can include guardrails.

[01:54:29] : [01:54:32]

And we can put guardrailsin open source systems.

[01:54:32] : [01:54:37]

I mean, if we eventually have systems

[01:54:37] : [01:54:39]

that are built with this blueprint,

[01:54:39] : [01:54:41]

we can put guardrails in those systems

[01:54:41] : [01:54:43]

that guarantee

[01:54:43] : [01:54:44]

that there is sort of aminimum set of guardrails

[01:54:44] : [01:54:47]

that make the system non-dangerousand non-toxic, et cetera.

[01:54:47] : [01:54:50]

Basic things thateverybody would agree on.

[01:54:50] : [01:54:53]

And then the fine tuningthat people will add

[01:54:53] : [01:54:57]

or the additional guardrailsthat people will add

[01:54:57] : [01:54:59]

will kind of cater to theircommunity, whatever it is.

[01:54:59] : [01:55:04]

- And yeah, the fine tuning

[01:55:04] : [01:55:06]

would be more about the grayareas of what is hate speech,

[01:55:06] : [01:55:09]

what is dangerous andall that kind of stuff.

[01:55:09] : [01:55:11]

I mean, you've-

[01:55:11] : [01:55:12]

- [Yann] Or different value systems.

[01:55:12] : [01:55:13]

- Different value systems.

[01:55:13] : [01:55:14]

But still even with the objectives

[01:55:14] : [01:55:16]

of how to build a bio weapon, for example,

[01:55:16] : [01:55:18]

I think something you've commented on,

[01:55:18] : [01:55:20]

or at least there's a paper

[01:55:20] : [01:55:23]

where a collection of researchers

[01:55:23] : [01:55:24]

is trying to understand thesocial impacts of these LLMs.

[01:55:24] : [01:55:28]

And I guess one threshold that's nice

[01:55:28] : [01:55:31]

is like does the LLM make itany easier than a search would,

[01:55:31] : [01:55:36]

like a Google search would?

[01:55:36] : [01:55:39]

- Right.

[01:55:39] : [01:55:40]

So the increasing numberof studies on this

[01:55:40] : [01:55:44]

seems to point to thefact that it doesn't help.

[01:55:44] : [01:55:49]

So having an LLM doesn't help you

[01:55:49] : [01:55:52]

design or build a bioweapon or a chemical weapon

[01:55:52] : [01:55:57]

if you already have access toa search engine and a library.

[01:55:57] : [01:56:01]

And so the sort of increasedinformation you get

[01:56:01] : [01:56:04]

or the ease with which you getit doesn't really help you.

[01:56:04] : [01:56:07]

That's the first thing.

[01:56:07] : [01:56:08]

The second thing is,

[01:56:08] : [01:56:10]

it's one thing to havea list of instructions

[01:56:10] : [01:56:12]

of how to make a chemical weapon,for example, a bio weapon.

[01:56:12] : [01:56:17]

It's another thing to actually build it.

[01:56:17] : [01:56:19]

And it's much harder than you might think,

[01:56:19] : [01:56:21]

and then LLM will not help you with that.

[01:56:21] : [01:56:23]

In fact, nobody in the world,

[01:56:23] : [01:56:27]

not even like countries use bio weapons

[01:56:27] : [01:56:29]

because most of the time they have no idea

[01:56:29] : [01:56:31]

how to protect their ownpopulations against it.

[01:56:31] : [01:56:34]

So it's too dangerousactually to kind of ever use.

[01:56:34] : [01:56:39]

And it's in fact bannedby international treaties.

[01:56:39] : [01:56:43]

Chemical weapons is different.

[01:56:43] : [01:56:45]

It's also banned by treaties,

[01:56:45] : [01:56:47]

but it's the same problem.

[01:56:47] : [01:56:50]

It's difficult to use

[01:56:50] : [01:56:51]

in situations that doesn'tturn against the perpetrators.

[01:56:51] : [01:56:56]

But we could ask Elon Musk.

[01:56:56] : [01:56:57]

Like I can give you a veryprecise list of instructions

[01:56:57] : [01:57:01]

of how you build a rocket engine.

[01:57:01] : [01:57:03]

And even if you havea team of 50 engineers

[01:57:03] : [01:57:06]

that are really experienced building it,

[01:57:06] : [01:57:08]

you're still gonna haveto blow up a dozen of them

[01:57:08] : [01:57:10]

before you get one that works.

[01:57:10] : [01:57:11]

And it's the same withchemical weapons or bio weapons

[01:57:11] : [01:57:18]

or things like this.

[01:57:18] : [01:57:19]

It requires expertise in the real world

[01:57:19] : [01:57:23]

that the LLM is not gonna help you with.

[01:57:23] : [01:57:25]

- And it requires eventhe common sense expertise

[01:57:25] : [01:57:28]

that we've been talking about,

[01:57:28] : [01:57:29]

which is how to takelanguage based instructions

[01:57:29] : [01:57:34]

and materialize them in the physical world

[01:57:34] : [01:57:36]

requires a lot of knowledgethat's not in the instructions.

[01:57:36] : [01:57:41]

- Yeah, exactly.

[01:57:41] : [01:57:42]

A lot of biologists haveposted on this actually

[01:57:42] : [01:57:44]

in response to those things

[01:57:44] : [01:57:45]

saying like do you realize how hard it is

[01:57:45] : [01:57:47]

to actually do the lab work?

[01:57:47] : [01:57:49]

Like this is not trivial.

[01:57:49] : [01:57:50]

- Yeah.

[01:57:50] : [01:57:52]

And that's Hans Moraveccomes to light once again.

[01:57:52] : [01:57:56]

Just to linger on LLaMA.

[01:57:56] : [01:57:59]

Mark announced that LLaMA3 is coming out eventually,

[01:57:59] : [01:58:01]

I don't think there's a release date,

[01:58:01] : [01:58:03]

but what are you most excited about?

[01:58:03] : [01:58:06]

First of all, LLaMA 2that's already out there,

[01:58:06] : [01:58:09]

and maybe the future LLaMA 3, 4, 5, 6, 10,

[01:58:09] : [01:58:12]

just the future of theopen source under Meta?

[01:58:12] : [01:58:15]

- Well, a number of things.

[01:58:15] : [01:58:18]

So there's gonna be likevarious versions of LLaMA

[01:58:18] : [01:58:22]

that are improvements of previous LLaMAs.

[01:58:22] : [01:58:26]

Bigger, better, multimodal,things like that.

[01:58:26] : [01:58:30]

And then in future generations,

[01:58:30] : [01:58:32]

systems that are capable of planning,

[01:58:32] : [01:58:34]

that really understandhow the world works,

[01:58:34] : [01:58:36]

maybe are trained from videoso they have some world model.

[01:58:36] : [01:58:39]

Maybe capable of the typeof reasoning and planning

[01:58:39] : [01:58:42]

I was talking about earlier.

[01:58:42] : [01:58:44]

Like how long is that gonna take?

[01:58:44] : [01:58:45]

Like when is the research thatis going in that direction

[01:58:45] : [01:58:48]

going to sort of feed intothe product line, if you want,

[01:58:48] : [01:58:52]

of LLaMA?

[01:58:52] : [01:58:53]

I don't know, I can't tell you.

[01:58:53] : [01:58:54]

And there's a few breakthroughs

[01:58:54] : [01:58:56]

that we have to basically go through

[01:58:56] : [01:58:59]

before we can get there.

[01:58:59] : [01:59:01]

But you'll be able to monitor our progress

[01:59:01] : [01:59:03]

because we publish our research, right?

[01:59:03] : [01:59:07]

So last week we published the V-JEPA work,

[01:59:07] : [01:59:11]

which is sort of a first step

[01:59:11] : [01:59:13]

towards training systems from video.

[01:59:13] : [01:59:15]

And then the next stepis gonna be world models

[01:59:15] : [01:59:18]

based on kind of this type of idea,

[01:59:18] : [01:59:21]

training from video.

[01:59:21] : [01:59:23]

There's similar work atDeepMind also taking place,

[01:59:23] : [01:59:28]

and also at UC Berkeleyon world models and video.

[01:59:28] : [01:59:33]

A lot of people are working on this.

[01:59:33] : [01:59:35]

I think a lot of good ideas are appearing.

[01:59:35] : [01:59:38]

My bet is that those systemsare gonna be JEPA-like,

[01:59:38] : [01:59:41]

they're not gonna be generative models.

[01:59:41] : [01:59:43]

And we'll see what the future will tell.

[01:59:43] : [01:59:49]

There's really good work at...

[01:59:49] : [01:59:52]

A gentleman called DanijarHafner who is now DeepMind,

[01:59:52] : [01:59:56]

who's worked on kindof models of this type

[01:59:56] : [01:59:58]

that learn representations

[01:59:58] : [02:00:00]

and then use them forplanning or learning tasks

[02:00:00] : [02:00:02]

by reinforcement training.

[02:00:02] : [02:00:04]

And a lot of work at Berkeley

[02:00:04] : [02:00:07]

by Pieter Abbeel, Sergey Levine,

[02:00:07] : [02:00:11]

a bunch of other people of that type.

[02:00:11] : [02:00:13]

I'm collaborating with actually

[02:00:13] : [02:00:14]

in the context of somegrants with my NYU hat.

[02:00:14] : [02:00:18]

And then collaborations also through Meta,

[02:00:18] : [02:00:22]

'cause the lab at Berkeley

[02:00:22] : [02:00:24]

is associated with Metain some way, with FAIR.

[02:00:24] : [02:00:28]

So I think it's very exciting.

[02:00:28] : [02:00:29]

I think I'm super excited about...

[02:00:29] : [02:00:34]

I haven't been that excited

[02:00:34] : [02:00:35]

about like the directionof machine learning and AI

[02:00:35] : [02:00:38]

since 10 years ago when FAIR was started,

[02:00:38] : [02:00:41]

and before that, 30 years ago,

[02:00:41] : [02:00:44]

when we were working on,

[02:00:44] : [02:00:45]

sorry 35,

[02:00:45] : [02:00:46]

on combination nets and theearly days of neural net.

[02:00:46] : [02:00:51]

So I'm super excited

[02:00:51] : [02:00:54]

because I see a path towards

[02:00:54] : [02:00:57]

potentially human level intelligence

[02:00:57] : [02:00:59]

with systems that canunderstand the world,

[02:00:59] : [02:01:04]

remember, plan, reason.

[02:01:04] : [02:01:06]

There is some set of ideasto make progress there

[02:01:06] : [02:01:09]

that might have a chance of working.

[02:01:09] : [02:01:12]

And I'm really excited about this.

[02:01:12] : [02:01:14]

What I like is that

[02:01:14] : [02:01:15]

somewhat we get onto like a good direction

[02:01:15] : [02:01:20]

and perhaps succeed before mybrain turns to a white sauce

[02:01:20] : [02:01:24]

or before I need to retire.

[02:01:24] : [02:01:26]

(laughs)

[02:01:26] : [02:01:28]

- Yeah.

[02:01:28] : [02:01:29]

Yeah.

[02:01:29] : [02:01:30]

Are you also excited by...

[02:01:30] : [02:01:32]

Is it beautiful to you justthe amount of GPUs involved,

[02:01:32] : [02:01:38]

sort of the whole trainingprocess on this much compute?

[02:01:38] : [02:01:42]

Just zooming out,

[02:01:42] : [02:01:43]

just looking at earth and humans together

[02:01:43] : [02:01:47]

have built these computing devices

[02:01:47] : [02:01:49]

and are able to train this one brain,

[02:01:49] : [02:01:52]

we then open source.

[02:01:52] : [02:01:56]

(laughs)

[02:01:56] : [02:01:57]

Like giving birth tothis open source brain

[02:01:57] : [02:02:01]

trained on this gigantic compute system.

[02:02:01] : [02:02:04]

There's just the detailsof how to train on that,

[02:02:04] : [02:02:07]

how to build the infrastructureand the hardware,

[02:02:07] : [02:02:10]

the cooling, all of this kind of stuff.

[02:02:10] : [02:02:12]

Are you just still themost of your excitement

[02:02:12] : [02:02:14]

is in the theory aspect of it?

[02:02:14] : [02:02:16]

Meaning like the software.

[02:02:16] : [02:02:19]

- Well, I used to be ahardware guy many years ago.

[02:02:19] : [02:02:21]

(laughs)- Yes, yes, that's right.

[02:02:21] : [02:02:22]

- Decades ago.

[02:02:22] : [02:02:23]

- Hardware has improved a little bit.

[02:02:23] : [02:02:25]

Changed a little bit, yeah.

[02:02:25] : [02:02:27]

- I mean, certainly scale isnecessary but not sufficient.

[02:02:27] : [02:02:32]

- [Lex] Absolutely.

[02:02:32] : [02:02:33]

- So we certainly need computation.

[02:02:33] : [02:02:34]

I mean, we're still farin terms of compute power

[02:02:34] : [02:02:37]

from what we would need

[02:02:37] : [02:02:39]

to match the computepower of the human brain.

[02:02:39] : [02:02:42]

This may occur in the next couple decades,

[02:02:42] : [02:02:45]

but we're still some ways away.

[02:02:45] : [02:02:47]

And certainly in termsof power efficiency,

[02:02:47] : [02:02:49]

we're really far.

[02:02:49] : [02:02:50]

So a lot of progress to make in hardware.

[02:02:50] : [02:02:56]

And right now a lot ofthe progress is not...

[02:02:56] : [02:03:00]

I mean, there's a bit comingfrom Silicon technology,

[02:03:00] : [02:03:03]

but a lot of it coming fromarchitectural innovation

[02:03:03] : [02:03:06]

and quite a bit coming fromlike more efficient ways

[02:03:06] : [02:03:10]

of implementing the architecturesthat have become popular.

[02:03:10] : [02:03:13]

Basically combination oftransformers and com net, right?

[02:03:13] : [02:03:17]

And so there's still some ways to go

[02:03:17] : [02:03:22]

until we are going to saturate.

[02:03:22] : [02:03:27]

We're gonna have to come up

[02:03:27] : [02:03:28]

with like new principles,new fabrication technology,

[02:03:28] : [02:03:31]

new basic components,

[02:03:31] : [02:03:34]

perhaps based on sortof different principles

[02:03:34] : [02:03:38]

than those classical digital CMOS.

[02:03:38] : [02:03:41]

- Interesting.

[02:03:41] : [02:03:42]

So you think in order to build AmI, ami,

[02:03:42] : [02:03:46]

we potentially might needsome hardware innovation too?

[02:03:46] : [02:03:52]

- Well, if we wanna make it ubiquitous,

[02:03:52] : [02:03:55]

yeah, certainly.

[02:03:55] : [02:03:56]

Because we're gonna have toreduce the power consumption.

[02:03:56] : [02:04:01]

A GPU today, right?

[02:04:01] : [02:04:03]

Is half a kilowatt to a kilowatt.

[02:04:03] : [02:04:05]

Human brain is about 25 watts.

[02:04:05] : [02:04:08]

And the GPU is way belowthe power of human brain.

[02:04:08] : [02:04:13]

You need something like a hundred thousand

[02:04:13] : [02:04:14]

or a million to match it.

[02:04:14] : [02:04:16]

So we are off by a huge factor.

[02:04:16] : [02:04:19]

- You often say thatAGI is not coming soon.

[02:04:19] : [02:04:26]

Meaning like not this year,not the next few years,

[02:04:26] : [02:04:30]

potentially farther away.

[02:04:30] : [02:04:32]

What's your basic intuition behind that?

[02:04:32] : [02:04:35]

- So first of all, it'snot to be an event, right?

[02:04:35] : [02:04:39]

The idea somehow

[02:04:39] : [02:04:40]

which is popularized byscience fiction in Hollywood

[02:04:40] : [02:04:42]

that somehow somebody isgonna discover the secret,

[02:04:42] : [02:04:47]

the secret to AGI orhuman level AI or AmI,

[02:04:47] : [02:04:50]

whatever you wanna call it,

[02:04:50] : [02:04:52]

and then turn on a machineand then we have AGI.

[02:04:52] : [02:04:55]

That's just not going to happen.

[02:04:55] : [02:04:57]

It's not going to be an event.

[02:04:57] : [02:04:58]

It's gonna be gradual progress.

[02:04:58] : [02:05:02]

Are we gonna have systems

[02:05:02] : [02:05:04]

that can learn fromvideo how the world works

[02:05:04] : [02:05:07]

and learn good representations?

[02:05:07] : [02:05:09]

Yeah.

[02:05:09] : [02:05:10]

Before we get them tothe scale and performance

[02:05:10] : [02:05:13]

that we observe in humans,

[02:05:13] : [02:05:14]

it's gonna take quite a while.

[02:05:14] : [02:05:15]

It's not gonna happen in one day.

[02:05:15] : [02:05:17]

Are we gonna get systems

[02:05:17] : [02:05:20]

that can have large amountof associated memories

[02:05:20] : [02:05:24]

so they can remember stuff?

[02:05:24] : [02:05:26]

Yeah.

[02:05:26] : [02:05:27]

But same, it's not gonna happen tomorrow.

[02:05:27] : [02:05:28]

I mean, there is some basic techniques

[02:05:28] : [02:05:30]

that need to be developed.

[02:05:30] : [02:05:31]

We have a lot of them,

[02:05:31] : [02:05:32]

but like to get this to worktogether with a full system

[02:05:32] : [02:05:36]

is another story.

[02:05:36] : [02:05:37]

Are we gonna have systemsthat can reason and plan,

[02:05:37] : [02:05:39]

perhaps along the lines ofobjective driven AI architectures

[02:05:39] : [02:05:43]

that I described before?

[02:05:43] : [02:05:45]

Yeah, but like before weget this to work properly,

[02:05:45] : [02:05:47]

it's gonna take a while.

[02:05:47] : [02:05:48]

And before we get all thosethings to work together.

[02:05:48] : [02:05:51]

And then on top of this,

[02:05:51] : [02:05:52]

have systems that can learnlike hierarchical planning,

[02:05:52] : [02:05:55]

hierarchical representations,

[02:05:55] : [02:05:56]

systems that can be configured

[02:05:56] : [02:05:58]

for a lot of different situation at hands

[02:05:58] : [02:06:00]

the way the human brain can.

[02:06:00] : [02:06:02]

All of this is gonnatake at least a decade,

[02:06:02] : [02:06:07]

probably much more,

[02:06:07] : [02:06:08]

because there are a lot of problems

[02:06:08] : [02:06:11]

that we're not seeing right now

[02:06:11] : [02:06:12]

that we have not encountered.

[02:06:12] : [02:06:15]

And so we don't know ifthere is an easy solution

[02:06:15] : [02:06:17]

within this framework.

[02:06:17] : [02:06:18]

It's not just around the corner.

[02:06:18] : [02:06:23]

I mean, I've been hearingpeople for the last 12, 15 years

[02:06:23] : [02:06:27]

claiming that AGI isjust around the corner

[02:06:27] : [02:06:29]

and being systematically wrong.

[02:06:29] : [02:06:32]

And I knew they were wrongwhen they were saying it.

[02:06:32] : [02:06:34]

I called it bullshit.

[02:06:34] : [02:06:35]

(laughs)

[02:06:35] : [02:06:36]

- Why do you think peoplehave been calling...

[02:06:36] : [02:06:38]

First of all, I mean,from the beginning of,

[02:06:38] : [02:06:39]

from the birth of the termartificial intelligence,

[02:06:39] : [02:06:41]

there has been an eternal optimism

[02:06:41] : [02:06:45]

that's perhaps unlike other technologies.

[02:06:45] : [02:06:49]

Is it Moravec's paradox?

[02:06:49] : [02:06:51]

Is it the explanation

[02:06:51] : [02:06:53]

for why people are sooptimistic about AGI?

[02:06:53] : [02:06:56]

- I don't think it'sjust Moravec's paradox.

[02:06:56] : [02:06:58]

Moravec's paradox is a consequence

[02:06:58] : [02:07:00]

of realizing that the worldis not as easy as we think.

[02:07:00] : [02:07:03]

So first of all, intelligenceis not a linear thing

[02:07:03] : [02:07:08]

that you can measure with a scaler,

[02:07:08] : [02:07:10]

with a single number.

[02:07:10] : [02:07:11]

Can you say that humans aresmarter than orangutans?

[02:07:11] : [02:07:17]

In some ways, yes,

[02:07:17] : [02:07:20]

but in some ways orangutansare smarter than humans

[02:07:20] : [02:07:22]

in a lot of domains

[02:07:22] : [02:07:23]

that allows them to survivein the forest, (laughing)

[02:07:23] : [02:07:26]

for example.

[02:07:26] : [02:07:26]

- So IQ is a very limitedmeasure of intelligence.

[02:07:26] : [02:07:30]

True intelligence

[02:07:30] : [02:07:31]

is bigger than what IQ,for example, measures.

[02:07:31] : [02:07:33]

- Well, IQ can measureapproximately something for humans,

[02:07:33] : [02:07:38]

but because humans kind of come

[02:07:38] : [02:07:43]

in relatively kind of uniform form, right?

[02:07:43] : [02:07:48]

- [Lex] Yeah.

[02:07:48] : [02:07:49]

- But it only measures one type of ability

[02:07:49] : [02:07:53]

that may be relevant forsome tasks, but not others.

[02:07:53] : [02:07:56]

But then if you are talkingabout other intelligent entities

[02:07:56] : [02:08:02]

for which the basic thingsthat are easy to them

[02:08:02] : [02:08:07]

is very different,

[02:08:07] : [02:08:08]

then it doesn't mean anything.

[02:08:08] : [02:08:11]

So intelligence is a collection of skills

[02:08:11] : [02:08:15]

and an ability to acquirenew skills efficiently.

[02:08:15] : [02:08:21]

Right?

[02:08:21] : [02:08:23]

And the collection of skills

[02:08:23] : [02:08:25]

that a particularintelligent entity possess

[02:08:25] : [02:08:29]

or is capable of learning quickly

[02:08:29] : [02:08:31]

is different from the collectionof skills of another one.

[02:08:31] : [02:08:35]

And because it's a multidimensional thing,

[02:08:35] : [02:08:37]

the set of skills is ahigh dimensional space,

[02:08:37] : [02:08:39]

you can't measure.

[02:08:39] : [02:08:40]

You cannot compare two things

[02:08:40] : [02:08:42]

as to whether one is moreintelligent than the other.

[02:08:42] : [02:08:45]

It's multidimensional.

[02:08:45] : [02:08:46]

- So you push back against whatare called AI doomers a lot.

[02:08:46] : [02:08:53]

Can you explain their perspective

[02:08:53] : [02:08:57]

and why you think they're wrong?

[02:08:57] : [02:08:59]

- Okay.

[02:08:59] : [02:09:00]

So AI doomers imagine allkinds of catastrophe scenarios

[02:09:00] : [02:09:03]

of how AI could escape our control

[02:09:03] : [02:09:07]

and basically kill us all. (laughs)

[02:09:07] : [02:09:10]

And that relies on awhole bunch of assumptions

[02:09:10] : [02:09:14]

that are mostly false.

[02:09:14] : [02:09:15]

So the first assumption

[02:09:15] : [02:09:18]

is that the emergenceof super intelligence

[02:09:18] : [02:09:20]

could be an event.

[02:09:20] : [02:09:21]

That at some point we'regoing to figure out the secret

[02:09:21] : [02:09:25]

and we'll turn on a machinethat is super intelligent.

[02:09:25] : [02:09:28]

And because we'd never done it before,

[02:09:28] : [02:09:30]

it's gonna take over theworld and kill us all.

[02:09:30] : [02:09:33]

That is false.

[02:09:33] : [02:09:33]

It's not gonna be an event.

[02:09:33] : [02:09:35]

We're gonna have systems thatare like as smart as a cat,

[02:09:35] : [02:09:39]

have all the characteristicsof human level intelligence,

[02:09:39] : [02:09:44]

but their level of intelligence

[02:09:44] : [02:09:46]

would be like a cat or aparrot maybe or something.

[02:09:46] : [02:09:49]

And then we're gonna walk our way up

[02:09:49] : [02:09:53]

to kind of make thosethings more intelligent.

[02:09:53] : [02:09:55]

And as we make them more intelligent,

[02:09:55] : [02:09:56]

we're also gonna putsome guardrails in them

[02:09:56] : [02:09:58]

and learn how to kindof put some guardrails

[02:09:58] : [02:10:00]

so they behave properly.

[02:10:00] : [02:10:01]

And we're not gonna dothis with just one...

[02:10:01] : [02:10:03]

It's not gonna be one effort,

[02:10:03] : [02:10:04]

but it's gonna be lots ofdifferent people doing this.

[02:10:04] : [02:10:07]

And some of them are gonna succeed

[02:10:07] : [02:10:09]

at making intelligent systemsthat are controllable and safe

[02:10:09] : [02:10:12]

and have the right guardrails.

[02:10:12] : [02:10:14]

And if some other goes rogue,

[02:10:14] : [02:10:15]

then we can use the good onesto go against the rogue ones.

[02:10:15] : [02:10:19]

(laughs)

[02:10:19] : [02:10:20]

So it's gonna be smart AIpolice against your rogue AI.

[02:10:20] : [02:10:24]

So it's not gonna be likewe're gonna be exposed

[02:10:24] : [02:10:27]

to like a single rogue AIthat's gonna kill us all.

[02:10:27] : [02:10:29]

That's just not happening.

[02:10:29] : [02:10:31]

Now, there is another fallacy,

[02:10:31] : [02:10:33]

which is the fact that becausethe system is intelligent,

[02:10:33] : [02:10:36]

it necessarily wants to take over.

[02:10:36] : [02:10:38]

And there is several arguments

[02:10:38] : [02:10:43]

that make people scared of this,

[02:10:43] : [02:10:44]

which I think arecompletely false as well.

[02:10:44] : [02:10:48]

So one of them is in nature,

[02:10:48] : [02:10:53]

it seems to be that themore intelligent species

[02:10:53] : [02:10:54]

are the ones that endup dominating the other.

[02:10:54] : [02:10:58]

And even extinguishing the others

[02:10:58] : [02:11:03]

sometimes by design,sometimes just by mistake.

[02:11:03] : [02:11:06]

And so there is sort of a thinking

[02:11:06] : [02:11:12]

by which you say, well, if AI systems

[02:11:12] : [02:11:15]

are more intelligent than us,

[02:11:15] : [02:11:17]

surely they're going to eliminate us,

[02:11:17] : [02:11:19]

if not by design,

[02:11:19] : [02:11:21]

simply because they don't care about us.

[02:11:21] : [02:11:23]

And that's just preposterousfor a number of reasons.

[02:11:23] : [02:11:27]

First reason is they'renot going to be a species.

[02:11:27] : [02:11:30]

They're not gonna be aspecies that competes with us.

[02:11:30] : [02:11:33]

They're not gonna havethe desire to dominate

[02:11:33] : [02:11:35]

because the desire to dominate

[02:11:35] : [02:11:36]

is something that has to be hardwired

[02:11:36] : [02:11:38]

into an intelligent system.

[02:11:38] : [02:11:41]

It is hardwired in humans,

[02:11:41] : [02:11:43]

it is hardwired in baboons,

[02:11:43] : [02:11:46]

in chimpanzees, in wolves,

[02:11:46] : [02:11:47]

not in orangutans.

[02:11:47] : [02:11:49]

The species in which thisdesire to dominate or submit

[02:11:49] : [02:11:56]

or attain status in other ways

[02:11:56] : [02:11:59]

is specific to social species.

[02:11:59] : [02:12:03]

Non-social species likeorangutans don't have it.

[02:12:03] : [02:12:06]

Right?

[02:12:06] : [02:12:07]

And they are as smart as we are, almost.

[02:12:07] : [02:12:09]

Right?

[02:12:09] : [02:12:10]

- And to you, there'snot significant incentive

[02:12:10] : [02:12:12]

for humans to encodethat into the AI systems.

[02:12:12] : [02:12:15]

And to the degree they do,

[02:12:15] : [02:12:17]

there'll be other AIs thatsort of punish them for it.

[02:12:17] : [02:12:22]

Out-compete them over-

[02:12:22] : [02:12:22]

- Well, there's all kinds of incentive

[02:12:22] : [02:12:24]

to make AI systems submissive to humans.

[02:12:24] : [02:12:26]

Right?- [Lex] Right.

[02:12:26] : [02:12:27]

- I mean, this is the waywe're gonna build them, right?

[02:12:27] : [02:12:29]

And so then people say,oh, but look at LLMs.

[02:12:29] : [02:12:32]

LLMs are not controllable.

[02:12:32] : [02:12:33]

And they're right,

[02:12:33] : [02:12:35]

LLMs are not controllable.

[02:12:35] : [02:12:36]

But objective driven AI,

[02:12:36] : [02:12:37]

so systems that derive their answers

[02:12:37] : [02:12:41]

by optimization of an objective

[02:12:41] : [02:12:43]

means they have tooptimize this objective,

[02:12:43] : [02:12:45]

and that objective can include guardrails.

[02:12:45] : [02:12:48]

One guardrail is obey humans.

[02:12:48] : [02:12:52]

Another guardrail is don't obey humans

[02:12:52] : [02:12:54]

if it's hurting other humans-

[02:12:54] : [02:12:56]

- I've heard that beforesomewhere, I don't remember-

[02:12:56] : [02:12:59]

- [Yann] Yes.(Lex laughs)

[02:12:59] : [02:13:00]

Maybe in a book. (laughs)

[02:13:00] : [02:13:01]

- Yeah.

[02:13:01] : [02:13:03]

But speaking of that book,

[02:13:03] : [02:13:04]

could there be unintendedconsequences also

[02:13:04] : [02:13:08]

from all of this?

[02:13:08] : [02:13:09]

- No, of course.

[02:13:09] : [02:13:09]

So this is not a simple problem, right?

[02:13:09] : [02:13:12]

I mean designing those guardrails

[02:13:12] : [02:13:14]

so that the system behaves properly

[02:13:14] : [02:13:16]

is not gonna be a simple issue

[02:13:16] : [02:13:20]

for which there is a silver bullet,

[02:13:20] : [02:13:22]

for which you have a mathematical proof

[02:13:22] : [02:13:23]

that the system can be safe.

[02:13:23] : [02:13:25]

It's gonna be very progressive,

[02:13:25] : [02:13:27]

iterative design system

[02:13:27] : [02:13:28]

where we put those guardrails

[02:13:28] : [02:13:31]

in such a way that thesystem behave properly.

[02:13:31] : [02:13:32]

And sometimes they'regoing to do something

[02:13:32] : [02:13:35]

that was unexpected becausethe guardrail wasn't right,

[02:13:35] : [02:13:38]

and we're gonna correct themso that they do it right.

[02:13:38] : [02:13:41]

The idea somehow that wecan't get it slightly wrong,

[02:13:41] : [02:13:44]

because if we get itslightly wrong we all die,

[02:13:44] : [02:13:46]

is ridiculous.

[02:13:46] : [02:13:47]

We're just gonna go progressively.

[02:13:47] : [02:13:50]

The analogy I've used manytimes is turbojet design.

[02:13:50] : [02:13:56]

How did we figure out

[02:13:56] : [02:14:02]

how to make turbojets sounbelievably reliable, right?

[02:14:02] : [02:14:06]

I mean, those are like incrediblycomplex pieces of hardware

[02:14:06] : [02:14:10]

that run at really high temperatures

[02:14:10] : [02:14:12]

for 20 hours at a time sometimes.

[02:14:12] : [02:14:17]

And we can fly halfway around the world

[02:14:17] : [02:14:20]

on a two engine jet linerat near the speed of sound.

[02:14:20] : [02:14:25]

Like how incredible is this?

[02:14:25] : [02:14:28]

It is just unbelievable.

[02:14:28] : [02:14:30]

And did we do this

[02:14:30] : [02:14:33]

because we inventedlike a general principle

[02:14:33] : [02:14:35]

of how to make turbojets safe?

[02:14:35] : [02:14:37]

No, it took decades

[02:14:37] : [02:14:39]

to kind of fine tune thedesign of those systems

[02:14:39] : [02:14:40]

so that they were safe.

[02:14:40] : [02:14:43]

Is there a separate group

[02:14:43] : [02:14:46]

within General Electricor Snecma or whatever

[02:14:46] : [02:14:50]

that is specialized in turbojet safety?

[02:14:50] : [02:14:54]

No.

[02:14:54] : [02:14:56]

The design is all about safety.

[02:14:56] : [02:14:58]

Because a better turbojetis also a safer turbojet,

[02:14:58] : [02:15:01]

a more reliable one.

[02:15:01] : [02:15:03]

It's the same for AI.

[02:15:03] : [02:15:04]

Like do you need specificprovisions to make AI safe?

[02:15:04] : [02:15:08]

No, you need to make better AI systems

[02:15:08] : [02:15:10]

and they will be safe

[02:15:10] : [02:15:11]

because they are designedto be more useful

[02:15:11] : [02:15:14]

and more controllable.

[02:15:14] : [02:15:16]

- So let's imagine a system,

[02:15:16] : [02:15:17]

AI system that's able tobe incredibly convincing

[02:15:17] : [02:15:22]

and can convince you of anything.

[02:15:22] : [02:15:24]

I can at least imagine such a system.

[02:15:24] : [02:15:28]

And I can see such asystem be weapon-like,

[02:15:28] : [02:15:33]

because it can control people's minds,

[02:15:33] : [02:15:35]

we're pretty gullible.

[02:15:35] : [02:15:37]

We want to believe a thing.

[02:15:37] : [02:15:38]

And you can have an AIsystem that controls it

[02:15:38] : [02:15:40]

and you could see governmentsusing that as a weapon.

[02:15:40] : [02:15:43]

So do you think if youimagine such a system,

[02:15:43] : [02:15:47]

there's any parallel tosomething like nuclear weapons?

[02:15:47] : [02:15:53]

- [Yann] No.

[02:15:53] : [02:15:54]

- So why is that technology different?

[02:15:54] : [02:15:58]

So you're saying there's goingto be gradual development?

[02:15:58] : [02:16:01]

- [Yann] Yeah.

[02:16:01] : [02:16:02]

- I mean it might be rapid,

[02:16:02] : [02:16:03]

but they'll be iterative.

[02:16:03] : [02:16:05]

And then we'll be able tokind of respond and so on.

[02:16:05] : [02:16:09]

- So that AI system designedby Vladimir Putin or whatever,

[02:16:09] : [02:16:12]

or his minions (laughing)

[02:16:12] : [02:16:16]

is gonna be like tryingto talk to every American

[02:16:16] : [02:16:21]

to convince them to vote for-

[02:16:21] : [02:16:24]

- [Lex] Whoever.

[02:16:24] : [02:16:25]

- Whoever pleases Putin or whatever.

[02:16:25] : [02:16:30]

Or rile people up against each other

[02:16:30] : [02:16:36]

as they've been trying to do.

[02:16:36] : [02:16:37]

They're not gonna be talking to you,

[02:16:37] : [02:16:40]

they're gonna be talkingto your AI assistant

[02:16:40] : [02:16:43]

which is going to be assmart as theirs, right?

[02:16:43] : [02:16:47]

Because as I said, in the future,

[02:16:47] : [02:16:51]

every single one of yourinteraction with the digital world

[02:16:51] : [02:16:53]

will be mediated by your AI assistant.

[02:16:53] : [02:16:55]

So the first thing you'regonna ask is, is this a scam?

[02:16:55] : [02:16:58]

Like is this thing liketelling me the truth?

[02:16:58] : [02:17:00]

Like it's not even goingto be able to get to you

[02:17:00] : [02:17:03]

because it's only going totalk to your AI assistant,

[02:17:03] : [02:17:05]

and your AI is not even going to...

[02:17:05] : [02:17:07]

It's gonna be like a spam filter, right?

[02:17:07] : [02:17:10]

You're not even seeing theemail, the spam email, right?

[02:17:10] : [02:17:13]

It's automatically put in afolder that you never see.

[02:17:13] : [02:17:16]

It's gonna be the same thing.

[02:17:16] : [02:17:18]

That AI system that tries toconvince you of something,

[02:17:18] : [02:17:21]

it's gonna be talking to an AI system

[02:17:21] : [02:17:22]

which is gonna be at least as smart as it.

[02:17:22] : [02:17:25]

And is gonna say, this is spam. (laughs)

[02:17:25] : [02:17:29]

It's not even going tobring it to your attention.

[02:17:29] : [02:17:32]

- So to you it's verydifficult for any one AI system

[02:17:32] : [02:17:34]

to take such a big leap ahead

[02:17:34] : [02:17:37]

to where it can convinceeven the other AI systems?

[02:17:37] : [02:17:40]

So like there's always goingto be this kind of race

[02:17:40] : [02:17:43]

where nobody's way ahead?

[02:17:43] : [02:17:46]

- That's the history of the world.

[02:17:46] : [02:17:48]

History of the world

[02:17:48] : [02:17:49]

is whenever there is a progress someplace,

[02:17:49] : [02:17:51]

there is a countermeasure.

[02:17:51] : [02:17:54]

And it's a cat and mouse game.

[02:17:54] : [02:17:57]

- Mostly yes,

[02:17:57] : [02:17:58]

but this is why nuclearweapons are so interesting

[02:17:58] : [02:18:01]

because that was such a powerful weapon

[02:18:01] : [02:18:05]

that it mattered who got it first.

[02:18:05] : [02:18:07]

That you could imagine Hitler, Stalin, Mao

[02:18:07] : [02:18:13]

getting the weapon first

[02:18:13] : [02:18:17]

and that having a differentkind of impact on the world

[02:18:17] : [02:18:21]

than the United Statesgetting the weapon first.

[02:18:21] : [02:18:24]

To you, nuclear weapons is like...

[02:18:24] : [02:18:27]

You don't imagine a breakthrough discovery

[02:18:27] : [02:18:32]

and then Manhattan projectlike effort for AI?

[02:18:32] : [02:18:35]

- No.

[02:18:35] : [02:18:36]

As I said, it's not going to be an event.

[02:18:36] : [02:18:39]

It's gonna be continuous progress.

[02:18:39] : [02:18:41]

And whenever one breakthrough occurs,

[02:18:41] : [02:18:45]

it's gonna be widelydisseminated really quickly.

[02:18:45] : [02:18:48]

Probably first within industry.

[02:18:48] : [02:18:51]

I mean, this is not a domain

[02:18:51] : [02:18:52]

where government or military organizations

[02:18:52] : [02:18:55]

are particularly innovative,

[02:18:55] : [02:18:57]

and they're in fact way behind.

[02:18:57] : [02:18:59]

And so this is gonna come from industry.

[02:18:59] : [02:19:02]

And this kind of informationdisseminates extremely quickly.

[02:19:02] : [02:19:04]

We've seen this over thelast few years, right?

[02:19:04] : [02:19:08]

Where you have a new...

[02:19:08] : [02:19:10]

Like even take AlphaGo.

[02:19:10] : [02:19:12]

This was reproduced within three months

[02:19:12] : [02:19:13]

even without like particularlydetailed information, right?

[02:19:13] : [02:19:18]

- Yeah.

[02:19:18] : [02:19:18]

This is an industry that'snot good at secrecy.

[02:19:18] : [02:19:20]

(laughs)

[02:19:20] : [02:19:21]

- But even if there is,

[02:19:21] : [02:19:22]

just the fact that you knowthat something is possible

[02:19:22] : [02:19:26]

makes you like realize

[02:19:26] : [02:19:28]

that it's worth investingthe time to actually do it.

[02:19:28] : [02:19:31]

You may be the second personto do it but you'll do it.

[02:19:31] : [02:19:35]

Say for all the innovations

[02:19:35] : [02:19:40]

of self supervised running transformers,

[02:19:40] : [02:19:43]

decoder only architectures, LLMs.

[02:19:43] : [02:19:46]

I mean those things,

[02:19:46] : [02:19:47]

you don't need to know exactlythe details of how they work

[02:19:47] : [02:19:49]

to know that it's possible

[02:19:49] : [02:19:52]

because it's deployed andthen it's getting reproduced.

[02:19:52] : [02:19:54]

And then people who workfor those companies move.

[02:19:54] : [02:19:59]

They go from one company to another.

[02:19:59] : [02:20:02]

And the information disseminates.

[02:20:02] : [02:20:05]

What makes the successof the US tech industry

[02:20:05] : [02:20:09]

and Silicon Valley inparticular, is exactly that,

[02:20:09] : [02:20:11]

is because informationcirculates really, really quickly

[02:20:11] : [02:20:14]

and disseminates very quickly.

[02:20:14] : [02:20:17]

And so the whole region sort of is ahead

[02:20:17] : [02:20:21]

because of thatcirculation of information.

[02:20:21] : [02:20:24]

- Maybe just to linger onthe psychology of AI doomers.

[02:20:24] : [02:20:28]

You give in the classic Yann LeCun way,

[02:20:28] : [02:20:31]

a pretty good example

[02:20:31] : [02:20:33]

of just when a new technology comes to be,

[02:20:33] : [02:20:36]

you say engineer says,

[02:20:36] : [02:20:38]

"I invented this new thing,I call it a ballpen."

[02:20:38] : [02:20:43]

And then the TwitterSphere responds,

[02:20:43] : [02:20:46]

"OMG people could writehorrible things with it

[02:20:46] : [02:20:48]

like misinformation,propaganda, hate speech.

[02:20:48] : [02:20:51]

Ban it now!"

[02:20:51] : [02:20:52]

Then writing doomers come in,

[02:20:52] : [02:20:54]

akin to the AI doomers,

[02:20:54] : [02:20:57]

"imagine if everyone can get a ballpen.

[02:20:57] : [02:21:01]

This could destroy society.

[02:21:01] : [02:21:01]

There should be a law

[02:21:01] : [02:21:03]

against using ballpento write hate speech,

[02:21:03] : [02:21:05]

regulate ballpens now."

[02:21:05] : [02:21:07]

And then the pencil industry mogul says,

[02:21:07] : [02:21:09]

"yeah, ballpens are very dangerous,

[02:21:09] : [02:21:12]

unlike pencil writing which is erasable,

[02:21:12] : [02:21:15]

ballpen writing stays forever.

[02:21:15] : [02:21:18]

Government should require alicense for a pen manufacturer."

[02:21:18] : [02:21:21]

I mean, this does seem tobe part of human psychology

[02:21:21] : [02:21:27]

when it comes up against new technology.

[02:21:27] : [02:21:31]

What deep insights canyou speak to about this?

[02:21:31] : [02:21:36]

- Well, there is a naturalfear of new technology

[02:21:36] : [02:21:42]

and the impact it can have on society.

[02:21:42] : [02:21:45]

And people have kindof instinctive reaction

[02:21:45] : [02:21:48]

to the world they know

[02:21:48] : [02:21:52]

being threatened by major transformations

[02:21:52] : [02:21:55]

that are either cultural phenomena

[02:21:55] : [02:21:57]

or technological revolutions.

[02:21:57] : [02:22:01]

And they fear for their culture,

[02:22:01] : [02:22:04]

they fear for their job,

[02:22:04] : [02:22:05]

they fear for the future of their children

[02:22:05] : [02:22:10]

and their way of life, right?

[02:22:10] : [02:22:13]

So any change is feared.

[02:22:13] : [02:22:17]

And you see this along history,

[02:22:17] : [02:22:20]

like any technologicalrevolution or cultural phenomenon

[02:22:20] : [02:22:24]

was always accompanied bygroups or reaction in the media

[02:22:24] : [02:22:29]

that basically attributedall the problems,

[02:22:29] : [02:22:36]

the current problems of society

[02:22:36] : [02:22:37]

to that particular change, right?

[02:22:37] : [02:22:40]

Electricity was going tokill everyone at some point.

[02:22:40] : [02:22:44]

The train was going to be a horrible thing

[02:22:44] : [02:22:47]

because you can't breathepast 50 kilometers an hour.

[02:22:47] : [02:22:50]

And so there's a wonderful website

[02:22:50] : [02:22:54]

called a Pessimists Archive, right?

[02:22:54] : [02:22:56]

Which has all thosenewspaper clips (laughing)

[02:22:56] : [02:22:59]

of all the horrible thingspeople imagined would arrive

[02:22:59] : [02:23:02]

because of either technological innovation

[02:23:02] : [02:23:06]

or a cultural phenomenon.

[02:23:06] : [02:23:09]

Wonderful examples of jazz or comic books

[02:23:09] : [02:23:18]

being blamed for unemployment

[02:23:18] : [02:23:23]

or young people notwanting to work anymore

[02:23:23] : [02:23:25]

and things like that, right?

[02:23:25] : [02:23:27]

And that has existed for centuries.

[02:23:27] : [02:23:30]

And it's knee jerk reactions.

[02:23:30] : [02:23:36]

The question is do we embracechange or do we resist it?

[02:23:36] : [02:23:43]

And what are the real dangers

[02:23:43] : [02:23:47]

as opposed to the imagined imagined ones?

[02:23:47] : [02:23:50]

- So people worry about...

[02:23:50] : [02:23:53]

I think one thing theyworry about with big tech,

[02:23:53] : [02:23:55]

something we've beentalking about over and over

[02:23:55] : [02:23:58]

but I think worth mentioning again,

[02:23:58] : [02:24:02]

they worry about how powerful AI will be

[02:24:02] : [02:24:05]

and they worry about it

[02:24:05] : [02:24:07]

being in the hands ofone centralized power

[02:24:07] : [02:24:09]

of just a handful of central control.

[02:24:09] : [02:24:13]

And so that's theskepticism with big tech.

[02:24:13] : [02:24:16]

These companies can makea huge amount of money

[02:24:16] : [02:24:18]

and control this technology.

[02:24:18] : [02:24:21]

And by so doing,

[02:24:21] : [02:24:24]

take advantage, abuse thelittle guy in society.

[02:24:24] : [02:24:29]

- Well, that's exactly why weneed open source platforms.

[02:24:29] : [02:24:31]

- Yeah.

[02:24:31] : [02:24:32]

I just wanted to... (laughs)

[02:24:32] : [02:24:34]

Nail the point home more and more.

[02:24:34] : [02:24:36]

- [Yann] Yes.

[02:24:36] : [02:24:37]

- So let me ask you on your...

[02:24:37] : [02:24:40]

Like I said, you do get a little bit

[02:24:40] : [02:24:42]

flavorful on the internet.

[02:24:42] : [02:24:46]

Joscha Bach tweetedsomething that you LOL'd at

[02:24:46] : [02:24:50]

in reference to HAL 9,000.

[02:24:50] : [02:24:53]

Quote,

[02:24:53] : [02:24:54]

"I appreciate your argument

[02:24:54] : [02:24:55]

and I fully understand your frustration,

[02:24:55] : [02:24:57]

but whether the pod bay doorsshould be opened or closed

[02:24:57] : [02:25:01]

is a complex and nuanced issue."

[02:25:01] : [02:25:03]

So you're at the head of Meta AI.

[02:25:03] : [02:25:06]

This is something that really worries me,

[02:25:06] : [02:25:12]

that our AI overlords

[02:25:12] : [02:25:15]

will speak down to us withcorporate speak of this nature

[02:25:15] : [02:25:20]

and you sort of resist thatwith your way of being.

[02:25:20] : [02:25:23]

Is this something you can just comment on

[02:25:23] : [02:25:27]

sort of working at a big company,

[02:25:27] : [02:25:29]

how you can avoid theover fearing, I suppose,

[02:25:29] : [02:25:34]

the through caution create harm?

[02:25:34] : [02:25:41]

- Yeah.

[02:25:41] : [02:25:42]

Again, I think the answer tothis is open source platforms

[02:25:42] : [02:25:45]

and then enabling a widelydiverse set of people

[02:25:45] : [02:25:49]

to build AI assistants

[02:25:49] : [02:25:53]

that represent the diversity

[02:25:53] : [02:25:55]

of cultures, opinions, languages,

[02:25:55] : [02:25:57]

and value systems across the world.

[02:25:57] : [02:25:59]

So that you're not boundto just be brainwashed

[02:25:59] : [02:26:04]

by a particular way of thinking

[02:26:04] : [02:26:07]

because of a single AI entity.

[02:26:07] : [02:26:10]

So I mean, I think it's areally, really important question

[02:26:10] : [02:26:13]

for society.

[02:26:13] : [02:26:14]

And the problem I'm seeing,

[02:26:14] : [02:26:16]

which is why I've been so vocal

[02:26:16] : [02:26:21]

and sometimes a little sardonic about it-

[02:26:21] : [02:26:25]

- Never stop.

[02:26:25] : [02:26:26]

Never stop, Yann.

[02:26:26] : [02:26:27]

(both laugh)

[02:26:27] : [02:26:28]

We love it.- Is because I see the danger

[02:26:28] : [02:26:31]

of this concentration of power

[02:26:31] : [02:26:32]

through proprietary AI systems

[02:26:32] : [02:26:36]

as a much bigger dangerthan everything else.

[02:26:36] : [02:26:39]

That if we really wantdiversity of opinion AI systems

[02:26:39] : [02:26:44]

that in the future

[02:26:44] : [02:26:48]

that we'll all be interactingthrough AI systems,

[02:26:48] : [02:26:52]

we need those to be diverse

[02:26:52] : [02:26:54]

for the preservationof a diversity of ideas

[02:26:54] : [02:26:58]

and creeds and politicalopinions and whatever,

[02:26:58] : [02:27:03]

and the preservation of democracy.

[02:27:03] : [02:27:07]

And what works against this

[02:27:07] : [02:27:12]

is people who think thatfor reasons of security,

[02:27:12] : [02:27:15]

we should keep AI systemsunder lock and key

[02:27:15] : [02:27:19]

because it's too dangerous

[02:27:19] : [02:27:20]

to put it in the hands of everybody

[02:27:20] : [02:27:22]

because it could be usedby terrorists or something.

[02:27:22] : [02:27:26]

That would lead topotentially a very bad future

[02:27:26] : [02:27:33]

in which all of our information diet

[02:27:33] : [02:27:38]

is controlled by a smallnumber of companies

[02:27:38] : [02:27:41]

through proprietary systems.

[02:27:41] : [02:27:43]

- So you trust humans with this technology

[02:27:43] : [02:27:47]

to build systems that are onthe whole good for humanity?

[02:27:47] : [02:27:52]

- Isn't that what democracyand free speech is all about?

[02:27:52] : [02:27:56]

- I think so.

[02:27:56] : [02:27:57]

- Do you trust institutionsto do the right thing?

[02:27:57] : [02:28:00]

Do you trust people to do the right thing?

[02:28:00] : [02:28:03]

And yeah, there's bad peoplewho are gonna do bad things,

[02:28:03] : [02:28:05]

but they're not going tohave superior technology

[02:28:05] : [02:28:07]

to the good people.

[02:28:07] : [02:28:08]

So then it's gonna be my goodAI against your bad AI, right?

[02:28:08] : [02:28:12]

I mean it's the examples thatwe were just talking about

[02:28:12] : [02:28:15]

of maybe some rogue countrywill build some AI system

[02:28:15] : [02:28:20]

that's gonna try to convince everybody

[02:28:20] : [02:28:23]

to go into a civil war or something

[02:28:23] : [02:28:27]

or elect a favorable ruler.

[02:28:27] : [02:28:31]

But then they will have to gopast our AI systems, right?

[02:28:31] : [02:28:35]

(laughs)

[02:28:35] : [02:28:36]

- An AI system with astrong Russian accent

[02:28:36] : [02:28:38]

will be trying to convince our-

[02:28:38] : [02:28:40]

- And doesn't put anyarticles in their sentences.

[02:28:40] : [02:28:42]

(both laugh)

[02:28:42] : [02:28:45]

- Well, it'll be at the veryleast, absurdly comedic.

[02:28:45] : [02:28:48]

Okay.

[02:28:48] : [02:28:50]

So since we talked aboutsort of the physical reality,

[02:28:50] : [02:28:55]

I'd love to ask your visionof the future with robots

[02:28:55] : [02:28:58]

in this physical reality.

[02:28:58] : [02:29:00]

So many of the kinds of intelligence

[02:29:00] : [02:29:02]

you've been speaking about

[02:29:02] : [02:29:05]

would empower robots

[02:29:05] : [02:29:06]

to be more effectivecollaborators with us humans.

[02:29:06] : [02:29:10]

So since Tesla's Optimus team

[02:29:10] : [02:29:14]

has been showing us someprogress in humanoid robots,

[02:29:14] : [02:29:17]

I think it really reinvigoratedthe whole industry

[02:29:17] : [02:29:20]

that I think BostonDynamics has been leading

[02:29:20] : [02:29:22]

for a very, very long time.

[02:29:22] : [02:29:24]

So now there's all kinds of companies,

[02:29:24] : [02:29:25]

Figure AI, obviously Boston Dynamics-

[02:29:25] : [02:29:28]

- [Yann] Unitree.

[02:29:28] : [02:29:29]

- Unitree.

[02:29:29] : [02:29:31]

But there's like a lot of them.

[02:29:31] : [02:29:33]

It's great.

[02:29:33] : [02:29:34]

It's great.

[02:29:34] : [02:29:35]

I mean I love it.

[02:29:35] : [02:29:36]

So do you think there'll bemillions of humanoid robots

[02:29:36] : [02:29:42]

walking around soon?

[02:29:42] : [02:29:44]

- Not soon, but it's gonna happen.

[02:29:44] : [02:29:46]

Like the next decade

[02:29:46] : [02:29:47]

I think is gonna be reallyinteresting in robots.

[02:29:47] : [02:29:49]

Like the emergence ofthe robotics industry

[02:29:49] : [02:29:53]

has been in the waiting for 10, 20 years,

[02:29:53] : [02:29:57]

without really emerging

[02:29:57] : [02:29:58]

other than for like kindof pre-program behavior

[02:29:58] : [02:30:01]

and stuff like that.

[02:30:01] : [02:30:03]

And the main issue is again,the Moravec's paradox.

[02:30:03] : [02:30:08]

Like how do we get the systems

[02:30:08] : [02:30:09]

to understand how the world works

[02:30:09] : [02:30:11]

and kind of plan actions?

[02:30:11] : [02:30:13]

And so we can do it forreally specialized tasks.

[02:30:13] : [02:30:15]

And the way Boston Dynamics goes about it

[02:30:15] : [02:30:21]

is basically with a lot ofhandcrafted dynamical models

[02:30:21] : [02:30:25]

and careful planning in advance,

[02:30:25] : [02:30:28]

which is very classical roboticswith a lot of innovation,

[02:30:28] : [02:30:32]

a little bit of perception,

[02:30:32] : [02:30:33]

but it's still not...

[02:30:33] : [02:30:35]

Like they can't build adomestic robot, right?

[02:30:35] : [02:30:38]

And we're still some distance away

[02:30:38] : [02:30:43]

from completely autonomouslevel five driving.

[02:30:43] : [02:30:46]

And we're certainly very far away

[02:30:46] : [02:30:49]

from having level five autonomous driving

[02:30:49] : [02:30:53]

by a system that can train itself

[02:30:53] : [02:30:55]

by driving 20 hours, like any 17-year-old.

[02:30:55] : [02:30:59]

So until we have, again, world models,

[02:30:59] : [02:31:05]

systems that can train themselves

[02:31:05] : [02:31:09]

to understand how the world works,

[02:31:09] : [02:31:11]

we're not gonna have significantprogress in robotics.

[02:31:11] : [02:31:16]

So a lot of the people

[02:31:16] : [02:31:18]

working on robotic hardware at the moment

[02:31:18] : [02:31:21]

are betting or banking

[02:31:21] : [02:31:23]

on the fact that AI

[02:31:23] : [02:31:25]

is gonna make sufficientprogress towards that.

[02:31:25] : [02:31:28]

- And they're hoping todiscover a product in it too-

[02:31:28] : [02:31:31]

- [Yann] Yeah.

[02:31:31] : [02:31:32]

- Before you have areally strong world model,

[02:31:32] : [02:31:34]

there'll be an almost strong world model.

[02:31:34] : [02:31:38]

And people are trying to find a product

[02:31:38] : [02:31:40]

in a clumsy robot, I suppose.

[02:31:40] : [02:31:43]

Like not a perfectly efficient robot.

[02:31:43] : [02:31:45]

So there's the factory setting

[02:31:45] : [02:31:46]

where humanoid robots

[02:31:46] : [02:31:48]

can help automate someaspects of the factory.

[02:31:48] : [02:31:51]

I think that's a crazy difficult task

[02:31:51] : [02:31:53]

'cause of all the safety required

[02:31:53] : [02:31:54]

and all this kind of stuff,

[02:31:54] : [02:31:56]

I think in the home is more interesting.

[02:31:56] : [02:31:57]

But then you start to think...

[02:31:57] : [02:32:00]

I think you mentioned loadingthe dishwasher, right?

[02:32:00] : [02:32:03]

- [Yann] Yeah.

[02:32:03] : [02:32:04]

- Like I suppose that'sone of the main problems

[02:32:04] : [02:32:06]

you're working on.

[02:32:06] : [02:32:07]

- I mean there's cleaning up. (laughs)

[02:32:07] : [02:32:10]

- [Lex] Yeah.

[02:32:10] : [02:32:11]

- Cleaning the house,

[02:32:11] : [02:32:13]

clearing up the table after a meal,

[02:32:13] : [02:32:17]

washing the dishes, allthose tasks, cooking.

[02:32:17] : [02:32:21]

I mean all the tasks that inprinciple could be automated

[02:32:21] : [02:32:24]

but are actually incredibly sophisticated,

[02:32:24] : [02:32:26]

really complicated.

[02:32:26] : [02:32:28]

- But even just basic navigation

[02:32:28] : [02:32:29]

around a space full of uncertainty.

[02:32:29] : [02:32:32]

- That sort of works.

[02:32:32] : [02:32:33]

Like you can sort of do this now.

[02:32:33] : [02:32:35]

Navigation is fine.

[02:32:35] : [02:32:37]

- Well, navigation in a waythat's compelling to us humans

[02:32:37] : [02:32:40]

is a different thing.

[02:32:40] : [02:32:42]

- Yeah.

[02:32:42] : [02:32:43]

It's not gonna be necessarily...

[02:32:43] : [02:32:45]

I mean we have demos actually

[02:32:45] : [02:32:46]

'cause there is a so-calledembodied AI group at FAIR

[02:32:46] : [02:32:51]

and they've been notbuilding their own robots

[02:32:51] : [02:32:55]

but using commercial robots.

[02:32:55] : [02:32:57]

And you can tell the robotdog like go to the fridge

[02:32:57] : [02:33:02]

and they can actually open the fridge

[02:33:02] : [02:33:03]

and they can probably pickup a can in the fridge

[02:33:03] : [02:33:05]

and stuff like that and bring it to you.

[02:33:05] : [02:33:08]

So it can navigate,

[02:33:08] : [02:33:10]

it can grab objects

[02:33:10] : [02:33:12]

as long as it's beentrained to recognize them,

[02:33:12] : [02:33:14]

which vision systems workpretty well nowadays.

[02:33:14] : [02:33:17]

But it's not like acompletely general robot

[02:33:17] : [02:33:23]

that would be sophisticated enough

[02:33:23] : [02:33:24]

to do things like clearingup the dinner table.

[02:33:24] : [02:33:28]

(laughs)

[02:33:28] : [02:33:30]

- Yeah, to me that's an exciting future

[02:33:30] : [02:33:33]

of getting humanoid robots.

[02:33:33] : [02:33:35]

Robots in general inthe home more and more

[02:33:35] : [02:33:36]

because it gets humans

[02:33:36] : [02:33:38]

to really directlyinteract with AI systems

[02:33:38] : [02:33:40]

in the physical space.

[02:33:40] : [02:33:42]

And in so doing it allows us

[02:33:42] : [02:33:44]

to philosophically,psychologically explore

[02:33:44] : [02:33:46]

our relationships with robots.

[02:33:46] : [02:33:47]

It can be really, really interesting.

[02:33:47] : [02:33:50]

So I hope you make progresson the whole JEPA thing soon.

[02:33:50] : [02:33:54]

(laughs)

[02:33:54] : [02:33:55]

- Well, I mean, I hopethings can work as planned.

[02:33:55] : [02:33:58]

I mean, again, we've been likekinda working on this idea

[02:33:58] : [02:34:03]

of self supervised learningfrom video for 10 years.

[02:34:03] : [02:34:07]

And only made significantprogress in the last two or three.

[02:34:07] : [02:34:12]

- And actually you've mentioned

[02:34:12] : [02:34:13]

that there's a lot ofinteresting breakthroughs

[02:34:13] : [02:34:15]

that can happen without havingaccess to a lot of compute.

[02:34:15] : [02:34:18]

So if you're interested in doing a PhD

[02:34:18] : [02:34:20]

in this kind of stuff,

[02:34:20] : [02:34:21]

there's a lot of possibilities still

[02:34:21] : [02:34:23]

to do innovative work.

[02:34:23] : [02:34:25]

So like what advice would you give

[02:34:25] : [02:34:26]

to a undergrad that'slooking to go to grad school

[02:34:26] : [02:34:30]

and do a PhD?

[02:34:30] : [02:34:32]

- So basically, I've listed them already.

[02:34:32] : [02:34:35]

This idea of how do you traina world model by observation?

[02:34:35] : [02:34:38]

And you don't have to train necessarily

[02:34:38] : [02:34:41]

on gigantic data sets.

[02:34:41] : [02:34:43]

I mean, it could turn out to be necessary

[02:34:43] : [02:34:47]

to actually train on large data sets

[02:34:47] : [02:34:48]

to have emergent propertieslike we have with LLMs.

[02:34:48] : [02:34:51]

But I think there is a lot ofgood ideas that can be done

[02:34:51] : [02:34:53]

without necessarily scaling up.

[02:34:53] : [02:34:56]

Then there is how do you do planning

[02:34:56] : [02:34:58]

with a learn world model?

[02:34:58] : [02:35:00]

If the world the system evolves in

[02:35:00] : [02:35:02]

is not the physical world,

[02:35:02] : [02:35:03]

but is the world of let's say the internet

[02:35:03] : [02:35:06]

or some sort of world

[02:35:06] : [02:35:09]

of where an action consists

[02:35:09] : [02:35:11]

in doing a search in a search engine

[02:35:11] : [02:35:13]

or interrogating a database,

[02:35:13] : [02:35:14]

or running a simulation

[02:35:14] : [02:35:18]

or calling a calculator

[02:35:18] : [02:35:19]

or solving a differential equation,

[02:35:19] : [02:35:21]

how do you get a system

[02:35:21] : [02:35:23]

to actually plan a sequence of actions

[02:35:23] : [02:35:25]

to give the solution to a problem?

[02:35:25] : [02:35:28]

And so the question of planning

[02:35:28] : [02:35:32]

is not just a question ofplanning physical actions,

[02:35:32] : [02:35:35]

it could be planning actions to use tools

[02:35:35] : [02:35:39]

for a dialogue system

[02:35:39] : [02:35:40]

or for any kind of intelligence system.

[02:35:40] : [02:35:42]

And there's some work onthis but not a huge amount.

[02:35:42] : [02:35:47]

Some work at FAIR,

[02:35:47] : [02:35:48]

one called Toolformer,which was a couple years ago

[02:35:48] : [02:35:52]

and some more recent work on planning,

[02:35:52] : [02:35:55]

but I don't think wehave like a good solution

[02:35:55] : [02:35:59]

for any of that.

[02:35:59] : [02:36:00]

Then there is the questionof hierarchical planning.

[02:36:00] : [02:36:03]

So the example I mentioned

[02:36:03] : [02:36:05]

of planning a trip from New York to Paris,

[02:36:05] : [02:36:10]

that's hierarchical,

[02:36:10] : [02:36:11]

but almost every action that we take

[02:36:11] : [02:36:13]

involves hierarchicalplanning in some sense.

[02:36:13] : [02:36:17]

And we really have absolutelyno idea how to do this.

[02:36:17] : [02:36:20]

Like there's zero demonstration

[02:36:20] : [02:36:22]

of hierarchical planning in AI,

[02:36:22] : [02:36:26]

where the various levelsof representations

[02:36:26] : [02:36:32]

that are necessary have been learned.

[02:36:32] : [02:36:36]

We can do like two levelhierarchical planning

[02:36:36] : [02:36:39]

when we design the two levels.

[02:36:39] : [02:36:41]

So for example, you have likea dog legged robot, right?

[02:36:41] : [02:36:44]

You want it to go from theliving room to the kitchen.

[02:36:44] : [02:36:48]

You can plan a path thatavoids the obstacle.

[02:36:48] : [02:36:51]

And then you can send thisto a lower level planner

[02:36:51] : [02:36:54]

that figures out how to move the legs

[02:36:54] : [02:36:56]

to kind of follow thattrajectories, right?

[02:36:56] : [02:36:59]

So that works,

[02:36:59] : [02:37:00]

but that two level planningis designed by hand, right?

[02:37:00] : [02:37:03]

We specify what the properlevels of abstraction,

[02:37:03] : [02:37:09]

the representation at eachlevel of abstraction have to be.

[02:37:09] : [02:37:13]

How do you learn this?

[02:37:13] : [02:37:14]

How do you learn thathierarchical representation

[02:37:14] : [02:37:16]

of action plans, right?

[02:37:16] : [02:37:19]

With com nets and deep learning,

[02:37:19] : [02:37:22]

we can train the system

[02:37:22] : [02:37:23]

to learn hierarchicalrepresentations of percepts.

[02:37:23] : [02:37:26]

What is the equivalent

[02:37:26] : [02:37:28]

when what you're trying torepresent are action plans?

[02:37:28] : [02:37:30]

- For action plans.

[02:37:30] : [02:37:31]

Yeah.

[02:37:31] : [02:37:32]

So you want basically arobot dog or humanoid robot

[02:37:32] : [02:37:35]

that turns on and travelsfrom New York to Paris

[02:37:35] : [02:37:38]

all by itself.

[02:37:38] : [02:37:40]

- [Yann] For example.

[02:37:40] : [02:37:41]

- All right.

[02:37:41] : [02:37:43]

It might have some trouble at the TSA but-

[02:37:43] : [02:37:47]

- No, but even doingsomething fairly simple

[02:37:47] : [02:37:49]

like a household task.

[02:37:49] : [02:37:50]

- [Lex] Sure.

[02:37:50] : [02:37:51]

- Like cooking or something.

[02:37:51] : [02:37:53]

- Yeah.

[02:37:53] : [02:37:54]

There's a lot involved.

[02:37:54] : [02:37:55]

It's a super complex task.

[02:37:55] : [02:37:56]

Once again, we take it for granted.

[02:37:56] : [02:37:59]

What hope do you have forthe future of humanity?

[02:37:59] : [02:38:05]

We're talking about somany exciting technologies,

[02:38:05] : [02:38:07]

so many exciting possibilities.

[02:38:07] : [02:38:09]

What gives you hope when you look out

[02:38:09] : [02:38:12]

over the next 10, 20, 50, 100 years?

[02:38:12] : [02:38:15]

If you look at social media,

[02:38:15] : [02:38:16]

there's wars going on, there'sdivision, there's hatred,

[02:38:16] : [02:38:21]

all this kind of stuffthat's also part of humanity.

[02:38:21] : [02:38:24]

But amidst all that, what gives you hope?

[02:38:24] : [02:38:27]

- I love that question.

[02:38:27] : [02:38:30]

We can make humanity smarter with AI.

[02:38:30] : [02:38:37]

Okay?

[02:38:37] : [02:38:40]

I mean AI basically willamplify human intelligence.

[02:38:40] : [02:38:44]

It's as if every one of us

[02:38:44] : [02:38:47]

will have a staff of smart AI assistants.

[02:38:47] : [02:38:52]

They might be smarter than us.

[02:38:52] : [02:38:53]

They'll do our bidding,

[02:38:53] : [02:38:55]

perhaps execute a task

[02:38:55] : [02:39:01]

in ways that are much betterthan we could do ourselves

[02:39:01] : [02:39:05]

because they'd be smarter than us.

[02:39:05] : [02:39:07]

And so it's like everyonewould be the boss

[02:39:07] : [02:39:10]

of a staff of super smart virtual people.

[02:39:10] : [02:39:14]

So we shouldn't feel threatened by this

[02:39:14] : [02:39:18]

any more than we should feel threatened

[02:39:18] : [02:39:19]

by being the manager of a group of people,

[02:39:19] : [02:39:22]

some of whom are more intelligent than us.

[02:39:22] : [02:39:24]

I certainly have a lotof experience with this.

[02:39:24] : [02:39:29]

(laughs)

[02:39:29] : [02:39:30]

Of having people working withme who are smarter than me.

[02:39:30] : [02:39:34]

That's actually a wonderful thing.

[02:39:34] : [02:39:36]

So having machines thatare smarter than us,

[02:39:36] : [02:39:40]

that assist us in all ofour tasks, our daily lives,

[02:39:40] : [02:39:44]

whether it's professional or personal,

[02:39:44] : [02:39:45]

I think would be anabsolutely wonderful thing.

[02:39:45] : [02:39:48]

Because intelligence is the commodity

[02:39:48] : [02:39:50]

that is most in demand.

[02:39:50] : [02:39:54]

I mean, all the mistakesthat humanity makes

[02:39:54] : [02:39:57]

is because of lack ofintelligence, really,

[02:39:57] : [02:39:59]

or lack of knowledge, which is related.

[02:39:59] : [02:40:01]

So making people smarterwhich can only be better.

[02:40:01] : [02:40:07]

I mean, for the same reason

[02:40:07] : [02:40:08]

that public education is a good thing

[02:40:08] : [02:40:10]

and books are a good thing,

[02:40:10] : [02:40:15]

and the internet is also agood thing, intrinsically.

[02:40:15] : [02:40:17]

And even social networks are a good thing

[02:40:17] : [02:40:19]

if you run them properly.

[02:40:19] : [02:40:21]

(laughs)

[02:40:21] : [02:40:21]

It's difficult, but you can.

[02:40:21] : [02:40:23]

Because it helps the communication

[02:40:23] : [02:40:30]

of information and knowledge

[02:40:30] : [02:40:32]

and the transmission of knowledge.

[02:40:32] : [02:40:34]

So AI is gonna make humanity smarter.

[02:40:34] : [02:40:36]

And the analogy I've been using

[02:40:36] : [02:40:39]

is the fact that perhapsan equivalent event

[02:40:39] : [02:40:44]

in the history of humanity

[02:40:44] : [02:40:47]

to what might be provided bygeneralization of AI assistant

[02:40:47] : [02:40:52]

is the invention of the printing press.

[02:40:52] : [02:40:55]

It made everybody smarter.

[02:40:55] : [02:40:57]

The fact that people couldhave access to books.

[02:40:57] : [02:41:02]

Books were a lot cheaperthan they were before.

[02:41:02] : [02:41:06]

And so a lot more people hadan incentive to learn to read,

[02:41:06] : [02:41:10]

which wasn't the case before.

[02:41:10] : [02:41:12]

And people became smarter.

[02:41:12] : [02:41:17]

It enabled the enlightenment, right?

[02:41:17] : [02:41:21]

There wouldn't be an enlightenment

[02:41:21] : [02:41:22]

without the printing press.

[02:41:22] : [02:41:24]

It enabled philosophy, rationalism,

[02:41:24] : [02:41:29]

escape from religious doctrine,

[02:41:29] : [02:41:33]

democracy, science.

[02:41:33] : [02:41:38]

And certainly without this

[02:41:38] : [02:41:43]

there wouldn't have beenthe American Revolution

[02:41:43] : [02:41:46]

or the French Revolution.

[02:41:46] : [02:41:47]

And so we'll still be underfeudal regimes perhaps.

[02:41:47] : [02:41:52]

And so it completely transformed the world

[02:41:52] : [02:41:57]

because people became smarter

[02:41:57] : [02:41:59]

and kinda learned about things.

[02:41:59] : [02:42:01]

Now, it also created 200 years

[02:42:01] : [02:42:05]

of essentially religiousconflicts in Europe, right?

[02:42:05] : [02:42:08]

Because the first thing thatpeople read was the Bible

[02:42:08] : [02:42:12]

and realized that

[02:42:12] : [02:42:15]

perhaps there was a differentinterpretation of the Bible

[02:42:15] : [02:42:17]

than what the priests were telling them.

[02:42:17] : [02:42:19]

And so that createdthe Protestant movement

[02:42:19] : [02:42:22]

and created a rift.

[02:42:22] : [02:42:23]

And in fact, the Catholic church

[02:42:23] : [02:42:25]

didn't like the idea of the printing press

[02:42:25] : [02:42:28]

but they had no choice.

[02:42:28] : [02:42:29]

And so it had some badeffects and some good effects.

[02:42:29] : [02:42:32]

I don't think anyone today

[02:42:32] : [02:42:33]

would say that the inventionof the printing press

[02:42:33] : [02:42:35]

had an overall negative effect

[02:42:35] : [02:42:38]

despite the fact that it created 200 years

[02:42:38] : [02:42:40]

of religious conflicts in Europe.

[02:42:40] : [02:42:44]

Now compare this,

[02:42:44] : [02:42:45]

and I was very proud of myself

[02:42:45] : [02:42:48]

to come up with this analogy,

[02:42:48] : [02:42:51]

but realized someone else camewith the same idea before me.

[02:42:51] : [02:42:54]

Compare this with whathappened in the Ottoman Empire.

[02:42:54] : [02:42:58]

The Ottoman Empire banned theprinting press for 200 years.

[02:42:58] : [02:43:03]

And it didn't ban it for all languages,

[02:43:03] : [02:43:10]

only for Arabic.

[02:43:10] : [02:43:11]

You could actually print books

[02:43:11] : [02:43:13]

in Latin or Hebrew or whateverin the Ottoman Empire,

[02:43:13] : [02:43:18]

just not in Arabic.

[02:43:18] : [02:43:19]

And I thought it was because

[02:43:19] : [02:43:25]

the rulers just wanted to preserve

[02:43:25] : [02:43:27]

the control over thepopulation and the dogma,

[02:43:27] : [02:43:30]

religious dogma and everything.

[02:43:30] : [02:43:32]

But after talking withthe UAE Minister of AI,

[02:43:32] : [02:43:37]

Omar Al Olama,

[02:43:37] : [02:43:40]

he told me no, there was another reason.

[02:43:40] : [02:43:44]

And the other reason was that

[02:43:44] : [02:43:47]

it was to preserve the corporationof calligraphers, right?

[02:43:47] : [02:43:52]

There's like an art form

[02:43:52] : [02:43:56]

which is writing thosebeautiful Arabic poems

[02:43:56] : [02:44:01]

or whatever religious text in this thing.

[02:44:01] : [02:44:04]

And it was very powerfulcorporation of scribes basically

[02:44:04] : [02:44:07]

that kinda run a big chunk of the empire.

[02:44:07] : [02:44:12]

And we couldn't put them out of business.

[02:44:12] : [02:44:14]

So they banned the bridging press

[02:44:14] : [02:44:16]

in part to protect that business.

[02:44:16] : [02:44:18]

Now, what's the analogy for AI today?

[02:44:18] : [02:44:23]

Like who are we protecting by banning AI?

[02:44:23] : [02:44:25]

Like who are the people whoare asking that AI be regulated

[02:44:25] : [02:44:28]

to protect their jobs?

[02:44:28] : [02:44:31]

And of course, it's a real question

[02:44:31] : [02:44:35]

of what is gonna be the effect

[02:44:35] : [02:44:37]

of technological transformation like AI

[02:44:37] : [02:44:41]

on the job market and the labor market?

[02:44:41] : [02:44:45]

And there are economists

[02:44:45] : [02:44:46]

who are much more expertat this than I am,

[02:44:46] : [02:44:49]

but when I talk to them,

[02:44:49] : [02:44:50]

they tell us we're notgonna run out of job.

[02:44:50] : [02:44:54]

This is not gonna cause mass unemployment.

[02:44:54] : [02:44:56]

This is just gonna be gradual shift

[02:44:56] : [02:45:01]

of different professions.

[02:45:01] : [02:45:02]

The professions that are gonna be hot

[02:45:02] : [02:45:04]

10 or 15 years from now,

[02:45:04] : [02:45:05]

we have no idea todaywhat they're gonna be.

[02:45:05] : [02:45:09]

The same way if we goback 20 years in the past,

[02:45:09] : [02:45:12]

like who could have thought 20 years ago

[02:45:12] : [02:45:15]

that like the hottest job,

[02:45:15] : [02:45:17]

even like 5, 10 years agowas mobile app developer?

[02:45:17] : [02:45:21]

Like smartphones weren't invented.

[02:45:21] : [02:45:23]

- Most of the jobs of the future

[02:45:23] : [02:45:24]

might be in the Metaverse. (laughs)

[02:45:24] : [02:45:27]

- Well, it could be.

[02:45:27] : [02:45:28]

Yeah.

[02:45:28] : [02:45:29]

- But the point is youcan't possibly predict.

[02:45:29] : [02:45:31]

But you're right.

[02:45:31] : [02:45:33]

I mean, you've made alot of strong points.

[02:45:33] : [02:45:35]

And I believe that peopleare fundamentally good,

[02:45:35] : [02:45:38]

and so if AI, especially open source AI

[02:45:38] : [02:45:42]

can make them smarter,

[02:45:42] : [02:45:45]

it just empowers the goodness in humans.

[02:45:45] : [02:45:48]

- So I share that feeling.

[02:45:48] : [02:45:49]

Okay?

[02:45:49] : [02:45:50]

I think people arefundamentally good. (laughing)

[02:45:50] : [02:45:54]

And in fact a lot of doomers are doomers

[02:45:54] : [02:45:56]

because they don't think thatpeople are fundamentally good.

[02:45:56] : [02:46:00]

And they either don't trust people

[02:46:00] : [02:46:04]

or they don't trust theinstitution to do the right thing

[02:46:04] : [02:46:07]

so that people behave properly.

[02:46:07] : [02:46:09]

- Well, I think both youand I believe in humanity,

[02:46:09] : [02:46:13]

and I think I speak for a lot of people

[02:46:13] : [02:46:16]

in saying thank you for pushingthe open source movement,

[02:46:16] : [02:46:20]

pushing to making bothresearch and AI open source,

[02:46:20] : [02:46:24]

making it available to people,

[02:46:24] : [02:46:25]

and also the models themselves,

[02:46:25] : [02:46:27]

making that open source also.

[02:46:27] : [02:46:28]

So thank you for that.

[02:46:28] : [02:46:30]

And thank you for speaking your mind

[02:46:30] : [02:46:32]

in such colorful and beautifulways on the internet.

[02:46:32] : [02:46:34]

I hope you never stop.

[02:46:34] : [02:46:35]

You're one of the most fun people I know

[02:46:35] : [02:46:37]

and get to be a fan of.

[02:46:37] : [02:46:39]

So Yann, thank you forspeaking to me once again,

[02:46:39] : [02:46:42]

and thank you for being you.

[02:46:42] : [02:46:43]

- Thank you Lex.

[02:46:43] : [02:46:44]

- Thanks for listening to thisconversation with Yann LeCun.

[02:46:44] : [02:46:48]

To support this podcast,

[02:46:48] : [02:46:49]

please check out oursponsors in the description.

[02:46:49] : [02:46:52]

And now let me leave you with some words

[02:46:52] : [02:46:54]

from Arthur C. Clarke,

[02:46:54] : [02:46:55]

"the only way to discoverthe limits of the possible

[02:46:55] : [02:46:59]

is to go beyond them into the impossible."

[02:46:59] : [02:47:03]

Thank you for listening andhope to see you next time.

[02:47:03] : [02:47:07]




About Author

Video Man HackerNoon profile picture
Video Man@videoman
i'm a man i'm a man i'm a video man

コメント


ラベル

この記事は...

Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
X REMOVE AD