Neural scene representation from a single image is a really complex problem. The "end goal" is to be able to take a picture of a real-life object, and translate this picture into a 3D scene. It implies that the model understands a whole 3-dimensional scene, or real-life scene, using information from a single picture. Watch the video: References: [1] Rematas, K., Martin-Brualla, R., and Ferrari, V., "ShaRF: Shape-conditioned Radiance Fields from a Single View", (2021), https://arxiv.org/abs/2102.08860 [2] Project website and link to code for ShaRF: http://www.krematas.com/sharf/index.html [3] Mildenhall et al., NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, (2020), https://www.matthewtancik.com/nerf Follow me for more AI content: ►Instagram: ►LinkedIn: ►Twitter: ►Facebook: https://www.instagram.com/whats_ai/ https://www.linkedin.com/in/whats-ai/ https://twitter.com/Whats_AI https://www.facebook.com/whats.artifi... Join Our Discord channel, Learn AI Together: ► https://discord.gg/learnaitogether The best courses in AI & Guide+Repository on how to start: ► ► https://www.omologapps.com/whats-ai https://github.com/louisfb01/start-ma... Become a member of the YouTube community and support my work: https://www.youtube.com/channel/UCUzG... Chapters: 0:00 - Hey! Tap the Thumbs Up button and Subscribe. You'll learn a lot of cool stuff, I promise. 0:28 - Paper explanation & examples 4:42 - Conclusion Video Transcript: (this has been auto-generated by YouTube and may have inaccuracies) 00:00 just imagine how cool it will be to just 00:02 take a picture of an object and have it 00:04 in 00:05 3d to insert in the movie or video game 00:07 you are creating 00:08 this is what google is working on 00:17 this is what's ai and i share artificial 00:20 intelligence news every week 00:21 if you are new to the channel and want 00:23 to stay up to date please consider 00:25 subscribing to not miss any further news 00:28 neural scene representation from a 00:30 single image is a really complex problem 00:33 the end goal is to be able to take a 00:35 picture from a real life object 00:37 and translate this picture into a 3d 00:39 scene 00:40 it implies that the model understands a 00:42 whole three-dimensional 00:44 scene or real-life scene using 00:46 information from a single picture 00:48 and this is sometimes hard even for 00:51 humans 00:52 where the colors or shadows tricks our 00:54 eyes 00:55 oh and not only that it needs to 00:56 understand the depth 00:58 in the image which is already a 01:00 challenging task to do but it also needs 01:02 to reconstruct the objects with the 01:04 right materials and texture 01:06 so it can look real you can just imagine 01:08 how cool it will be to just take a 01:10 picture of an 01:11 object and have it in 3d to insert in 01:14 the movie or video game you are creating 01:16 or in a 3d scene for an illustration 01:19 well 01:19 i am not the only one thinking about all 01:21 the possibilities this type of model 01:23 could create 01:24 since google researchers are looking 01:26 into this in their new paper 01:28 sherf shape condition regions field 01:31 from a single view and you're seeing the 01:33 results they could produce since the 01:35 start of the video 01:36 note that for each of these results you 01:38 saw they only used one picture 01:40 taken from any angle it was then sent to 01:43 the model in order to produce these 01:45 results 01:46 which are incredible to me when you 01:48 think of the complexity of the task in 01:50 all the possible parameters to take into 01:52 consideration 01:53 just regarding the initial picture such 01:55 as the lighting 01:56 the resolution the size the angle or 01:58 viewpoint the location of the object in 02:00 the image 02:01 and etc if you're like me you may be 02:04 wondering 02:04 how are they doing that okay so i lied a 02:07 little 02:08 they do not only take the image as 02:10 inputs to the network 02:11 but they also take the camera parameters 02:13 to help the process 02:15 the algorithm learn the function that 02:17 converts 02:18 these 3d points and 3d viewpoints into 02:21 an 02:22 rgb color as well as a density value for 02:24 each of this point 02:26 providing enough information to render 02:28 the scene from any viewpoints 02:29 later on this is called a radiance field 02:33 taking positions in its viewing 02:34 direction as inputs to output this color 02:37 and volume density value 02:39 for each of these points it's very 02:41 similar to what nerf does 02:43 which is a paper i already covered on my 02:46 channel 02:46 basically in the nerf case the regions 02:49 field function 02:50 is done using a neural network trained 02:52 on images and the internet output 02:55 this implies that they need a large 02:56 number of images for each scene 02:59 as well as training a different network 03:01 for each of these scenes 03:03 making the process very costly and 03:04 inefficient so the goal is to find a 03:07 better way to have this needed radiance 03:09 field 03:10 composed of rgb and density values to 03:12 then 03:13 render the object in 3d in novel views 03:16 in order to have the needed information 03:18 to create such a radiance field 03:20 they used what they call a shape network 03:23 that maps a latent code of the image 03:25 into a 3d 03:26 shape made of voxels voxels are just the 03:29 same as pixels but in three dimensional 03:32 space 03:32 and the latent code in question is 03:34 basically all the useful information for 03:37 the shape of the object in the image 03:39 this condensed shape information is 03:41 found using a neural network composed of 03:43 fully connected layers 03:45 and followed by convolutions which are 03:47 powerful architectures for computer 03:49 vision applications 03:50 since convolutions have two main 03:52 properties 03:54 they are invariant to translations and 03:56 use the local properties of the images 03:59 then it takes this latent code to 04:01 produce this first 04:02 3d shape estimation you will think that 04:05 we are done here 04:06 but it's not the case this is just the 04:08 first step 04:09 then as we discussed we need the 04:11 radiance field 04:13 of this representation using here an 04:15 appearance network 04:16 here again it uses a similar latent code 04:19 but for the appearance 04:20 as well as the 3d shapes we just found 04:23 as inputs 04:24 to produce this regions field using 04:26 another network 04:27 referred here as f then this radians 04:31 field 04:31 can finally be used with the camera 04:33 parameters information 04:35 to produce this final render of the 04:38 object 04:38 in novel views 04:42 this was just an overview of this new 04:44 paper i strongly recommend reading the 04:46 paper 04:47 linked in the description below the code 04:49 is unfortunately 04:50 not available right now but i contacted 04:53 one of the authors 04:54 and he said that it will be available in 04:56 a couple of weeks 04:57 so stay tuned for that please leave a 05:01 like if you went this far in the video 05:03 and since there are over 80 percent of 05:05 you guys that are not subscribed yet 05:07 please consider subscribing to the 05:08 channel to not miss any further news 05:11 thank you for watching