The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape! Watch the video References Read the full article: Paper: Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, Project link: Code: Colab demo: https://www.louisbouchard.me/infinite-nature/ https://arxiv.org/pdf/2012.09855.pdf https://infinite-nature.github.io/ https://github.com/google-research/google-research/tree/master/infinite_nature https://colab.research.google.com/github/google-research/google-research/blob/master/infinite_nature/infinite_nature_demo.ipynb#scrollTo=sCuRX1liUEVM Video Transcript 00:00 This week's paper is about a new task called "Perpetual View Generation," where the goal 00:05 is to take an image to fly into it and explore the landscape. 00:09 This is the first solution for this problem, but it is extremely impressive considering 00:13 we only feed one image into the network, and it can generate what it would look like to 00:18 fly into it like a bird. 00:20 Of course, this task is extremely complex and will improve over time. 00:24 As two-minute papers would say, imagine in just a couple of papers down the line how 00:28 useful this technology can be for video games or flight simulators! 00:32 I'm amazed to see how well it already works, even if this is the paper introducing this 00:37 new task. 00:38 Especially considering how complex this task is. 00:41 And not only because it has to generate new viewpoints like GANverse3D is doing, which 00:46 I covered in a previous video, but it also has to generate a new image at 00:50 each frame, and once you pass a couple of dozen frames, you will have close to nothing 00:54 left from the original image to use. 00:57 And yes, this can be done over hundreds of frames while still looking a lot better than 01:02 current view synthesis approaches. 01:04 Let's see how they can generate an entire bird-view video in the wanted direction from 01:09 a single picture and how you can try it yourself right now without having to set up anything! 01:15 To do that, they have to use the geometry of the image, so they first need to produce 01:19 a disparity map of the image. 01:22 This is done using a state-of-the-art network called MiDaS, which I will not enter into, 01:27 but this is the output it gives. 01:29 This disparity map is basically an inverse depth map, informing the network of the depths 01:34 inside the scene. 01:35 Then, we enter the real first step of their technique, which is the renderer. 01:39 The goal of this renderer is to generate a new view based on the old view. 01:44 This new view will be the next frame, and as you understood, the old view is the input 01:49 image. 01:50 This is done using a differentiable renderer. 01:53 Differentiable just because we can use backpropagation to train it, just like we traditionally do 01:58 with the conventional deep nets, you know. 02:01 This renderer takes the image and disparity map to produce a three-dimensional mesh representing 02:06 the scene. 02:07 Then, we simply use this 3D mesh to generate an image from a novel viewpoint, P1 in this 02:12 case. 02:13 This gives us this amazing new picture that looks just a bit zoomed, but it is not simply 02:18 zoomed in. 02:19 There are some pink marks on the rendered image and black marks on the disparity map, 02:24 as you can see. 02:25 They correspond to the occluded regions and regions outside the field of view in the previous 02:29 image used as input to the renderer since this renderer only generates a new view but 02:35 is unable to invent unseen details. 02:38 This leads us to quite a problem, how can we have a complete and realistic image if 02:43 we do not know what goes there? 02:46 Well, we can use another network that will also take this new disparity map and image 02:50 as input to 'refine' it. 02:53 This other network called SPADE is also a state-of-the network, but for conditional 02:58 image synthesis. 02:59 Here, it is a conditional image synthesis network because we need to tell our network 03:04 some conditions, which in this case are the pink and black missing parts. 03:08 We basically send this faulty image to the second network to fill in holes and add the 03:13 necessary details. 03:14 You can see this SPADE network as a GAN architecture where the image is first encoded into a latent 03:19 code that will give us the style of the image. 03:23 Then, this code is decoded to generate a new version of the initial image, simply filling 03:28 the missing parts with new information following the same style present in the encoded information. 03:34 And voilà! 03:36 You have your new frame and its reverse depth map. 03:39 You can now simply repeat the process over and over to get all future frames, which now 03:43 looks like this. 03:45 Using this output as input in the next iteration, you can produce an infinity of iterations, 03:50 always following the wanted viewpoint and the precedent frame context! 03:53 photo: fig 2 -> repeat [step 3...] -> video examples 03:54 As you know, such powerful algorithms frequently need data and annotation to be trained on, 03:59 and this one isn't the exception. 04:01 To do so, they needed aerial footage of nature taken from drones, which they took from youtube, 04:07 manually curated, and pre-processed them to create their own dataset. 04:11 Fortunately for other researchers wanting to attack this challenge, you don't have to 04:15 do the same thing since they released this dataset of aerial footage of natural coastal 04:20 scenes used to train their algorithm. 04:21 It is available for download on their project page, which is linked in the description below. 04:26 As I mentioned, you can even try it yourself as they made the code publicly available, 04:31 but they also created a demo you can try right now on google colab. 04:35 The link is in the description below. 04:37 You just have to run the first few cells like this, which will install the code and dependencies, 04:42 load their model, and there you go. 04:44 You can now free-fly around the images they have and even upload your own! 04:48 Of course, all the steps I just mentioned were already there. 04:51 Simply run the code and enjoy! 04:53 You can find the article covering this paper on my newly created website, as well as our 04:58 discord community, my guide to learning machine learning, and more exciting stuff I will be 05:03 sharing on there. 05:04 Feel free to become a free member and get notified of new articles I share! 05:09 Congratulations to the winners of the NVIDIA GTC giveaway, all appearing on the screen 05:13 right now. 05:14 You should have received an email from me with the DLI code! 05:18 Thank you for watching.