The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!
Read the full article: https://www.louisbouchard.me/infinite-nature/
Paper: Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural
Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf
Project link: https://infinite-nature.github.io/
Code: https://github.com/google-research/google-research/tree/master/infinite_nature
Colab demo: https://colab.research.google.com/github/google-research/google-research/blob/master/infinite_nature/infinite_nature_demo.ipynb#scrollTo=sCuRX1liUEVM
00:00
This week's paper is about a new task called "Perpetual View Generation," where the goal
00:05
is to take an image to fly into it and explore the landscape.
00:09
This is the first solution for this problem, but it is extremely impressive considering
00:13
we only feed one image into the network, and it can generate what it would look like to
00:18
fly into it like a bird.
00:20
Of course, this task is extremely complex and will improve over time.
00:24
As two-minute papers would say, imagine in just a couple of papers down the line how
00:28
useful this technology can be for video games or flight simulators!
00:32
I'm amazed to see how well it already works, even if this is the paper introducing this
00:37
new task.
00:38
Especially considering how complex this task is.
00:41
And not only because it has to generate new viewpoints like GANverse3D is doing, which
00:46
I covered in a previous video, but it also has to generate a new image at
00:50
each frame, and once you pass a couple of dozen frames, you will have close to nothing
00:54
left from the original image to use.
00:57
And yes, this can be done over hundreds of frames while still looking a lot better than
01:02
current view synthesis approaches.
01:04
Let's see how they can generate an entire bird-view video in the wanted direction from
01:09
a single picture and how you can try it yourself right now without having to set up anything!
01:15
To do that, they have to use the geometry of the image, so they first need to produce
01:19
a disparity map of the image.
01:22
This is done using a state-of-the-art network called MiDaS, which I will not enter into,
01:27
but this is the output it gives.
01:29
This disparity map is basically an inverse depth map, informing the network of the depths
01:34
inside the scene.
01:35
Then, we enter the real first step of their technique, which is the renderer.
01:39
The goal of this renderer is to generate a new view based on the old view.
01:44
This new view will be the next frame, and as you understood, the old view is the input
01:49
image.
01:50
This is done using a differentiable renderer.
01:53
Differentiable just because we can use backpropagation to train it, just like we traditionally do
01:58
with the conventional deep nets, you know.
02:01
This renderer takes the image and disparity map to produce a three-dimensional mesh representing
02:06
the scene.
02:07
Then, we simply use this 3D mesh to generate an image from a novel viewpoint, P1 in this
02:12
case.
02:13
This gives us this amazing new picture that looks just a bit zoomed, but it is not simply
02:18
zoomed in.
02:19
There are some pink marks on the rendered image and black marks on the disparity map,
02:24
as you can see.
02:25
They correspond to the occluded regions and regions outside the field of view in the previous
02:29
image used as input to the renderer since this renderer only generates a new view but
02:35
is unable to invent unseen details.
02:38
This leads us to quite a problem, how can we have a complete and realistic image if
02:43
we do not know what goes there?
02:46
Well, we can use another network that will also take this new disparity map and image
02:50
as input to 'refine' it.
02:53
This other network called SPADE is also a state-of-the network, but for conditional
02:58
image synthesis.
02:59
Here, it is a conditional image synthesis network because we need to tell our network
03:04
some conditions, which in this case are the pink and black missing parts.
03:08
We basically send this faulty image to the second network to fill in holes and add the
03:13
necessary details.
03:14
You can see this SPADE network as a GAN architecture where the image is first encoded into a latent
03:19
code that will give us the style of the image.
03:23
Then, this code is decoded to generate a new version of the initial image, simply filling
03:28
the missing parts with new information following the same style present in the encoded information.
03:34
And voilà!
03:36
You have your new frame and its reverse depth map.
03:39
You can now simply repeat the process over and over to get all future frames, which now
03:43
looks like this.
03:45
Using this output as input in the next iteration, you can produce an infinity of iterations,
03:50
always following the wanted viewpoint and the precedent frame context!
03:53
photo: fig 2 -> repeat [step 3...] -> video examples
03:54
As you know, such powerful algorithms frequently need data and annotation to be trained on,
03:59
and this one isn't the exception.
04:01
To do so, they needed aerial footage of nature taken from drones, which they took from youtube,
04:07
manually curated, and pre-processed them to create their own dataset.
04:11
Fortunately for other researchers wanting to attack this challenge, you don't have to
04:15
do the same thing since they released this dataset of aerial footage of natural coastal
04:20
scenes used to train their algorithm.
04:21
It is available for download on their project page, which is linked in the description below.
04:26
As I mentioned, you can even try it yourself as they made the code publicly available,
04:31
but they also created a demo you can try right now on google colab.
04:35
The link is in the description below.
04:37
You just have to run the first few cells like this, which will install the code and dependencies,
04:42
load their model, and there you go.
04:44
You can now free-fly around the images they have and even upload your own!
04:48
Of course, all the steps I just mentioned were already there.
04:51
Simply run the code and enjoy!
04:53
You can find the article covering this paper on my newly created website, as well as our
04:58
discord community, my guide to learning machine learning, and more exciting stuff I will be
05:03
sharing on there.
05:04
Feel free to become a free member and get notified of new articles I share!
05:09
Congratulations to the winners of the NVIDIA GTC giveaway, all appearing on the screen
05:13
right now.
05:14
You should have received an email from me with the DLI code!
05:18
Thank you for watching.