Neural scene representation from a single image is a really complex problem. The "end goal" is to be able to take a picture of a real-life object, and translate this picture into a 3D scene. It implies that the model understands a whole 3-dimensional scene, or real-life scene, using information from a single picture.
[1] Rematas, K., Martin-Brualla, R., and Ferrari, V., "ShaRF: Shape-conditioned Radiance Fields from a Single View", (2021), https://arxiv.org/abs/2102.08860
[2] Project website and link to code for ShaRF: http://www.krematas.com/sharf/index.html
[3] Mildenhall et al., NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, (2020), https://www.matthewtancik.com/nerf
►Instagram: https://www.instagram.com/whats_ai/
►LinkedIn: https://www.linkedin.com/in/whats-ai/
►Twitter: https://twitter.com/Whats_AI
►Facebook: https://www.facebook.com/whats.artifi...
Join Our Discord channel, Learn AI Together:
►https://discord.gg/learnaitogether
The best courses in AI & Guide+Repository on how to start:
►https://www.omologapps.com/whats-ai
►https://github.com/louisfb01/start-ma...
Become a member of the YouTube community and support my work:
https://www.youtube.com/channel/UCUzG...
0:00 - Hey! Tap the Thumbs Up button and Subscribe. You'll learn a lot of cool stuff, I promise.
0:28 - Paper explanation & examples
4:42 - Conclusion
(this has been auto-generated by YouTube and may have inaccuracies)
00:00
just imagine how cool it will be to just
00:02
take a picture of an object and have it
00:04
in
00:05
3d to insert in the movie or video game
00:07
you are creating
00:08
this is what google is working on
00:17
this is what's ai and i share artificial
00:20
intelligence news every week
00:21
if you are new to the channel and want
00:23
to stay up to date please consider
00:25
subscribing to not miss any further news
00:28
neural scene representation from a
00:30
single image is a really complex problem
00:33
the end goal is to be able to take a
00:35
picture from a real life object
00:37
and translate this picture into a 3d
00:39
scene
00:40
it implies that the model understands a
00:42
whole three-dimensional
00:44
scene or real-life scene using
00:46
information from a single picture
00:48
and this is sometimes hard even for
00:51
humans
00:52
where the colors or shadows tricks our
00:54
eyes
00:55
oh and not only that it needs to
00:56
understand the depth
00:58
in the image which is already a
01:00
challenging task to do but it also needs
01:02
to reconstruct the objects with the
01:04
right materials and texture
01:06
so it can look real you can just imagine
01:08
how cool it will be to just take a
01:10
picture of an
01:11
object and have it in 3d to insert in
01:14
the movie or video game you are creating
01:16
or in a 3d scene for an illustration
01:19
well
01:19
i am not the only one thinking about all
01:21
the possibilities this type of model
01:23
could create
01:24
since google researchers are looking
01:26
into this in their new paper
01:28
sherf shape condition regions field
01:31
from a single view and you're seeing the
01:33
results they could produce since the
01:35
start of the video
01:36
note that for each of these results you
01:38
saw they only used one picture
01:40
taken from any angle it was then sent to
01:43
the model in order to produce these
01:45
results
01:46
which are incredible to me when you
01:48
think of the complexity of the task in
01:50
all the possible parameters to take into
01:52
consideration
01:53
just regarding the initial picture such
01:55
as the lighting
01:56
the resolution the size the angle or
01:58
viewpoint the location of the object in
02:00
the image
02:01
and etc if you're like me you may be
02:04
wondering
02:04
how are they doing that okay so i lied a
02:07
little
02:08
they do not only take the image as
02:10
inputs to the network
02:11
but they also take the camera parameters
02:13
to help the process
02:15
the algorithm learn the function that
02:17
converts
02:18
these 3d points and 3d viewpoints into
02:21
an
02:22
rgb color as well as a density value for
02:24
each of this point
02:26
providing enough information to render
02:28
the scene from any viewpoints
02:29
later on this is called a radiance field
02:33
taking positions in its viewing
02:34
direction as inputs to output this color
02:37
and volume density value
02:39
for each of these points it's very
02:41
similar to what nerf does
02:43
which is a paper i already covered on my
02:46
channel
02:46
basically in the nerf case the regions
02:49
field function
02:50
is done using a neural network trained
02:52
on images and the internet output
02:55
this implies that they need a large
02:56
number of images for each scene
02:59
as well as training a different network
03:01
for each of these scenes
03:03
making the process very costly and
03:04
inefficient so the goal is to find a
03:07
better way to have this needed radiance
03:09
field
03:10
composed of rgb and density values to
03:12
then
03:13
render the object in 3d in novel views
03:16
in order to have the needed information
03:18
to create such a radiance field
03:20
they used what they call a shape network
03:23
that maps a latent code of the image
03:25
into a 3d
03:26
shape made of voxels voxels are just the
03:29
same as pixels but in three dimensional
03:32
space
03:32
and the latent code in question is
03:34
basically all the useful information for
03:37
the shape of the object in the image
03:39
this condensed shape information is
03:41
found using a neural network composed of
03:43
fully connected layers
03:45
and followed by convolutions which are
03:47
powerful architectures for computer
03:49
vision applications
03:50
since convolutions have two main
03:52
properties
03:54
they are invariant to translations and
03:56
use the local properties of the images
03:59
then it takes this latent code to
04:01
produce this first
04:02
3d shape estimation you will think that
04:05
we are done here
04:06
but it's not the case this is just the
04:08
first step
04:09
then as we discussed we need the
04:11
radiance field
04:13
of this representation using here an
04:15
appearance network
04:16
here again it uses a similar latent code
04:19
but for the appearance
04:20
as well as the 3d shapes we just found
04:23
as inputs
04:24
to produce this regions field using
04:26
another network
04:27
referred here as f then this radians
04:31
field
04:31
can finally be used with the camera
04:33
parameters information
04:35
to produce this final render of the
04:38
object
04:38
in novel views
04:42
this was just an overview of this new
04:44
paper i strongly recommend reading the
04:46
paper
04:47
linked in the description below the code
04:49
is unfortunately
04:50
not available right now but i contacted
04:53
one of the authors
04:54
and he said that it will be available in
04:56
a couple of weeks
04:57
so stay tuned for that please leave a
05:01
like if you went this far in the video
05:03
and since there are over 80 percent of
05:05
you guys that are not subscribed yet
05:07
please consider subscribing to the
05:08
channel to not miss any further news
05:11
thank you for watching