DreamFusion: An AI that Generates 3D Models from Text

Written by whatsai | Published 2022/10/16
Tech Story Tags: ai | artificial-intelligence | art | 3d | machine-learning | data-science | hackernoon-top-story | computer-vision | web-monetization | hackernoon-es | hackernoon-hi | hackernoon-zh | hackernoon-vi | hackernoon-fr | hackernoon-pt | hackernoon-ja

TLDR

DreamFusion is a new Google Research model that can understand a sentence enough to generate a 3D model of it. The results aren’t perfect yet, but the progress we’ve made in the field since this past year is just incredible. We can't really make it much cooler but what’s even more fascinating is how it works. Let's dive into it... here's Dream Fusion a new computer vision model that understands a sentence enough to generate 3D models.via the TL;DR App

We’ve seen models before that were able to take a sentence and generate images.

We've also seen other approaches to manipulate the generated images by learning specific concepts like an object or particular style.

Last week, Meta published the Make-A-Video model that I covered, which allows you to generate a short video also from a text sentence. The results aren’t perfect yet, but the progress we’ve made in the field since this past year is just incredible.

This week we take another step forward.

Here’s DreamFusion, a new Google Research model that can understand a sentence enough to generate a 3D model of it.

You can see this as a DALLE or Stable Diffusion but in 3D.

How cool is that?! We can’t really make it much cooler.

But what’s even more fascinating is how it works. Let’s dive into it...

References

►Read the full article: https://www.louisbouchard.ai/dreamfusion/
►Poole, B., Jain, A., Barron, J.T. and Mildenhall, B., 2022. DreamFusion: Text-to-3D using 2D Diffusion. arXiv preprint arXiv:2209.14988.
►Project website: https://dreamfusion3d.github.io/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

0:02

we've seen models able to take a

0:04

sentence and generate images then other

0:07

approaches to manipulate the generated

0:09

images by learning specific Concepts

0:11

like an object or a particular style

0:13

last week meta published the make a

0:16

video model that I covered which allows

0:18

you to generate a short video also from

0:20

a text sentence the results aren't

0:22

perfect yet but the progress we've made

0:24

in the field since last year is just

0:26

incredible this week we make another

0:28

step forward here's dream Fusion a new

0:32

Google research model that can

0:34

understand a sentence enough to generate

0:36

a 3D model out of it you can see this as

0:39

a dally or stable diffusion but in 3D

0:41

how cool is that we can't make it much

0:44

cooler but what's even more fascinating

0:46

is how it works let's dive into it but

0:49

first give me a few seconds to talk

0:51

about a related subject computer vision

0:53

you'll want to hear that if you are in

0:55

this field as well for this video I'm

0:57

partnering with encord the online

1:00

learning platform for computer vision

1:01

data is one of the most important parts

1:04

of creating Innovative computer vision

1:06

model that's why the encode platform has

1:09

been built from the ground up to make

1:10

the creation of training data and

1:12

testing of machine learning models

1:14

quicker than it's ever been encord does

1:17

this in two ways first it makes it

1:19

easier to manage annotate and evaluate

1:22

training data through a range of

1:24

collaborative annotation tools and

1:25

automation features secondly encod

1:28

offers access to its QA workflows apis

1:31

and SDK so you can create your own

1:33

Active Learning pipelines speeding up

1:35

model development and by using encode

1:38

you don't need to waste time building

1:39

your own annotation tools letting you

1:41

focus on getting the right data into

1:44

your models if that sounds interesting

1:46

please click the first link below to get

1:48

a free 28-day trial of encode exclusive

1:51

to our community

1:54

if you've been following my work dream

1:56

Fusion is quite simple it basically use

1:59

two models I already covered Nerfs and

2:02

one of the text to image models in their

2:04

case it's the Imogen model but and you

2:07

will do like stable diffusion or Dolly

2:09

as you know if you've been a good

2:11

student and watched the previous videos

2:12

Nerfs are a kind of model used to render

2:15

3D scenes by generating neural Radiance

2:18

field out of one or more images of an

2:21

object but then how can you generate a

2:23

3D render from text if the Nerf model

2:26

only works with images well we use

2:29

imagen the other AI to generate image

2:31

variations from the one it takes and why

2:34

do we do that instead of directly

2:36

generating 3D models from text because

2:38

it will require huge data sets of 3D

2:41

data along with their Associated

2:43

captions for our model to be trained on

2:46

which will be very difficult to have

2:48

instead we use a pre-trained text to

2:50

image model with much less complex data

2:53

together and we adapt it to 3D so it

2:56

doesn't require any 3D data to be

2:57

trained on only a pre-existing AI for

3:00

generating images it's really cool how

3:03

we can reuse powerful Technologies for

3:05

new tasks like this when interpreting

3:07

the problem differently so if we start

3:09

from the beginning we have a Nerf model

3:12

as I explained in previous videos this

3:14

type of model takes images to predict

3:17

the pixels in each novel view creating a

3:20

3D model by learning from image pairs of

3:22

the same object with different

3:24

viewpoints in our case we do not start

3:26

with images directly we start with the

3:28

text and Sample a random view

3:30

orientation we want to generate an image

3:33

for basically we are trying to create a

3:35

3D model by generating images of all

3:38

possible angles a camera could cover

3:40

looking around the object and guessing

3:42

the pixels colors densities light

3:45

Reflections Etc everything needed to

3:48

make it look realistic thus we start

3:50

with a caption and add a small tweak to

3:52

it depending on the random camera

3:54

viewpoint we want to generate for

3:56

example we may want to generate a front

3:58

view so we would append front view to

4:01

the caption on the other side we use the

4:03

same angle and camera parameters for

4:05

initial not trained Nerf model to

4:09

predict the first rendering then we

4:11

generate an image version Guided by our

4:13

caption and initial rendering with added

4:17

noise using imagine our pre-trained text

4:20

to image model which I further explained

4:22

in my image and video if you are curious

4:24

to see how it does that so our image and

4:26

model will be guided by the text input

4:28

as well as the current rendering of the

4:30

object with added noise here we add

4:33

noise because this is what the image and

4:36

module can take as input it needs to be

4:38

part of a noise distribution it

4:40

understands we use the model to generate

4:43

a higher quality image add the image

4:45

used to generate it and remove the Noise

4:48

We manually added to use this result to

4:51

guide and improve our Nerf model for the

4:54

next step we do all that to better

4:55

understand where in the image the Nerf

4:57

model should focus its attention to

4:59

produce better results for the next step

5:01

and we repeat that until the 3D model is

5:05

satisfying enough you can then export

5:07

this model to mesh and use it in a scene

5:10

of your choice and before some of you

5:12

ask no you don't have to retrain the

5:15

image generator model as they say so

5:17

well in the paper it just acts as a

5:19

frozen critic that predicts image space

5:21

edits and voira this is how dream Fusion

5:25

generates 3D rendering from text inputs

5:28

if you'd like to have a deeper

5:30

understanding of the approach have a

5:32

look at my videos covering nerves and

5:34

Imogen I also invite you to read their

5:36

paper for more details on this specific

5:39

method thank you for watching the whole

5:41

video and I will see you next week with

5:44

another amazing paper

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2022/10/16