ShaRF: Create a 3D Model of an Object Using Just a Single Imageby@whatsai

# ShaRF: Create a 3D Model of an Object Using Just a Single Image

February 20th, 2021

ShaRF stands for Shape-conditioned Radiance Fields from a Single View. The goal is to take a picture of a real-life object, and translate this into a 3D scene.

Neural scene representation from a single image is a really complex problem. The "end goal" is to be able to take a picture of a real-life object, and translate this picture into a 3D scene. It implies that the model understands a whole 3-dimensional scene, or real-life scene, using information from a single picture.

## References:

[1] Rematas, K., Martin-Brualla, R., and Ferrari, V., "ShaRF: Shape-conditioned Radiance Fields from a Single View", (2021), https://arxiv.org/abs/2102.08860

[2] Project website and link to code for ShaRF: http://www.krematas.com/sharf/index.html

[3] Mildenhall et al., NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, (2020), https://www.matthewtancik.com/nerf

## Follow me for more AI content:

►Instagram: https://www.instagram.com/whats_ai/

https://discord.gg/learnaitogether

The best courses in AI & Guide+Repository on how to start:
https://www.omologapps.com/whats-ai
https://github.com/louisfb01/start-ma...

Become a member of the YouTube community and support my work:

## Chapters:

0:00​ - Hey! Tap the Thumbs Up button and Subscribe. You'll learn a lot of cool stuff, I promise.

0:28​ - Paper explanation & examples

4:42​ - Conclusion

## Video Transcript:

(this has been auto-generated by YouTube and may have inaccuracies)

00:00

just imagine how cool it will be to just

00:02

take a picture of an object and have it

00:04

in

00:05

3d to insert in the movie or video game

00:07

you are creating

00:08

this is what google is working on

00:17

this is what's ai and i share artificial

00:20

intelligence news every week

00:21

if you are new to the channel and want

00:23

to stay up to date please consider

00:25

subscribing to not miss any further news

00:28

neural scene representation from a

00:30

single image is a really complex problem

00:33

the end goal is to be able to take a

00:35

picture from a real life object

00:37

and translate this picture into a 3d

00:39

scene

00:40

it implies that the model understands a

00:42

whole three-dimensional

00:44

scene or real-life scene using

00:46

information from a single picture

00:48

and this is sometimes hard even for

00:51

humans

00:52

where the colors or shadows tricks our

00:54

eyes

00:55

oh and not only that it needs to

00:56

understand the depth

00:58

in the image which is already a

01:00

challenging task to do but it also needs

01:02

to reconstruct the objects with the

01:04

right materials and texture

01:06

so it can look real you can just imagine

01:08

how cool it will be to just take a

01:10

picture of an

01:11

object and have it in 3d to insert in

01:14

the movie or video game you are creating

01:16

or in a 3d scene for an illustration

01:19

well

01:19

i am not the only one thinking about all

01:21

the possibilities this type of model

01:23

could create

01:24

01:26

into this in their new paper

01:28

sherf shape condition regions field

01:31

from a single view and you're seeing the

01:33

results they could produce since the

01:35

start of the video

01:36

note that for each of these results you

01:38

saw they only used one picture

01:40

taken from any angle it was then sent to

01:43

the model in order to produce these

01:45

results

01:46

which are incredible to me when you

01:48

think of the complexity of the task in

01:50

all the possible parameters to take into

01:52

consideration

01:53

just regarding the initial picture such

01:55

as the lighting

01:56

the resolution the size the angle or

01:58

viewpoint the location of the object in

02:00

the image

02:01

and etc if you're like me you may be

02:04

wondering

02:04

how are they doing that okay so i lied a

02:07

little

02:08

they do not only take the image as

02:10

inputs to the network

02:11

but they also take the camera parameters

02:13

to help the process

02:15

the algorithm learn the function that

02:17

converts

02:18

these 3d points and 3d viewpoints into

02:21

an

02:22

rgb color as well as a density value for

02:24

each of this point

02:26

providing enough information to render

02:28

the scene from any viewpoints

02:29

later on this is called a radiance field

02:33

taking positions in its viewing

02:34

direction as inputs to output this color

02:37

and volume density value

02:39

for each of these points it's very

02:41

similar to what nerf does

02:43

which is a paper i already covered on my

02:46

channel

02:46

basically in the nerf case the regions

02:49

field function

02:50

is done using a neural network trained

02:52

on images and the internet output

02:55

this implies that they need a large

02:56

number of images for each scene

02:59

as well as training a different network

03:01

for each of these scenes

03:03

making the process very costly and

03:04

inefficient so the goal is to find a

03:07

better way to have this needed radiance

03:09

field

03:10

composed of rgb and density values to

03:12

then

03:13

render the object in 3d in novel views

03:16

in order to have the needed information

03:18

to create such a radiance field

03:20

they used what they call a shape network

03:23

that maps a latent code of the image

03:25

into a 3d

03:26

shape made of voxels voxels are just the

03:29

same as pixels but in three dimensional

03:32

space

03:32

and the latent code in question is

03:34

basically all the useful information for

03:37

the shape of the object in the image

03:39

this condensed shape information is

03:41

found using a neural network composed of

03:43

fully connected layers

03:45

and followed by convolutions which are

03:47

powerful architectures for computer

03:49

vision applications

03:50

since convolutions have two main

03:52

properties

03:54

they are invariant to translations and

03:56

use the local properties of the images

03:59

then it takes this latent code to

04:01

produce this first

04:02

3d shape estimation you will think that

04:05

we are done here

04:06

but it's not the case this is just the

04:08

first step

04:09

then as we discussed we need the

04:11

04:13

of this representation using here an

04:15

appearance network

04:16

here again it uses a similar latent code

04:19

but for the appearance

04:20

as well as the 3d shapes we just found

04:23

as inputs

04:24

to produce this regions field using

04:26

another network

04:27

referred here as f then this radians

04:31

field

04:31

can finally be used with the camera

04:33

parameters information

04:35

to produce this final render of the

04:38

object

04:38

in novel views

04:42

this was just an overview of this new

04:44

paper i strongly recommend reading the

04:46

paper

04:47

linked in the description below the code

04:49

is unfortunately

04:50

not available right now but i contacted

04:53

one of the authors

04:54

and he said that it will be available in

04:56

a couple of weeks

04:57

so stay tuned for that please leave a

05:01

like if you went this far in the video

05:03

and since there are over 80 percent of

05:05

you guys that are not subscribed yet

05:07

05:08

channel to not miss any further news

05:11

thank you for watching

L O A D I N G
. . . comments & more!