I explain Artificial Intelligence terms and news to non-experts.
The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!
Read the full article: https://www.louisbouchard.me/infinite-nature/
Paper: Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural
Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf
Project link: https://infinite-nature.github.io/
Colab demo: https://colab.research.google.com/github/google-research/google-research/blob/master/infinite_nature/infinite_nature_demo.ipynb#scrollTo=sCuRX1liUEVM
This week's paper is about a new task called "Perpetual View Generation," where the goal
is to take an image to fly into it and explore the landscape.
This is the first solution for this problem, but it is extremely impressive considering
we only feed one image into the network, and it can generate what it would look like to
fly into it like a bird.
Of course, this task is extremely complex and will improve over time.
As two-minute papers would say, imagine in just a couple of papers down the line how
useful this technology can be for video games or flight simulators!
I'm amazed to see how well it already works, even if this is the paper introducing this
Especially considering how complex this task is.
And not only because it has to generate new viewpoints like GANverse3D is doing, which
I covered in a previous video, but it also has to generate a new image at
each frame, and once you pass a couple of dozen frames, you will have close to nothing
left from the original image to use.
And yes, this can be done over hundreds of frames while still looking a lot better than
current view synthesis approaches.
Let's see how they can generate an entire bird-view video in the wanted direction from
a single picture and how you can try it yourself right now without having to set up anything!
To do that, they have to use the geometry of the image, so they first need to produce
a disparity map of the image.
This is done using a state-of-the-art network called MiDaS, which I will not enter into,
but this is the output it gives.
This disparity map is basically an inverse depth map, informing the network of the depths
inside the scene.
Then, we enter the real first step of their technique, which is the renderer.
The goal of this renderer is to generate a new view based on the old view.
This new view will be the next frame, and as you understood, the old view is the input
This is done using a differentiable renderer.
Differentiable just because we can use backpropagation to train it, just like we traditionally do
with the conventional deep nets, you know.
This renderer takes the image and disparity map to produce a three-dimensional mesh representing
Then, we simply use this 3D mesh to generate an image from a novel viewpoint, P1 in this
This gives us this amazing new picture that looks just a bit zoomed, but it is not simply
There are some pink marks on the rendered image and black marks on the disparity map,
as you can see.
They correspond to the occluded regions and regions outside the field of view in the previous
image used as input to the renderer since this renderer only generates a new view but
is unable to invent unseen details.
This leads us to quite a problem, how can we have a complete and realistic image if
we do not know what goes there?
Well, we can use another network that will also take this new disparity map and image
as input to 'refine' it.
This other network called SPADE is also a state-of-the network, but for conditional
Here, it is a conditional image synthesis network because we need to tell our network
some conditions, which in this case are the pink and black missing parts.
We basically send this faulty image to the second network to fill in holes and add the
You can see this SPADE network as a GAN architecture where the image is first encoded into a latent
code that will give us the style of the image.
Then, this code is decoded to generate a new version of the initial image, simply filling
the missing parts with new information following the same style present in the encoded information.
You have your new frame and its reverse depth map.
You can now simply repeat the process over and over to get all future frames, which now
looks like this.
Using this output as input in the next iteration, you can produce an infinity of iterations,
always following the wanted viewpoint and the precedent frame context!
photo: fig 2 -> repeat [step 3...] -> video examples
As you know, such powerful algorithms frequently need data and annotation to be trained on,
and this one isn't the exception.
To do so, they needed aerial footage of nature taken from drones, which they took from youtube,
manually curated, and pre-processed them to create their own dataset.
Fortunately for other researchers wanting to attack this challenge, you don't have to
do the same thing since they released this dataset of aerial footage of natural coastal
scenes used to train their algorithm.
It is available for download on their project page, which is linked in the description below.
As I mentioned, you can even try it yourself as they made the code publicly available,
but they also created a demo you can try right now on google colab.
The link is in the description below.
You just have to run the first few cells like this, which will install the code and dependencies,
load their model, and there you go.
You can now free-fly around the images they have and even upload your own!
Of course, all the steps I just mentioned were already there.
Simply run the code and enjoy!
You can find the article covering this paper on my newly created website, as well as our
discord community, my guide to learning machine learning, and more exciting stuff I will be
sharing on there.
Feel free to become a free member and get notified of new articles I share!
Congratulations to the winners of the NVIDIA GTC giveaway, all appearing on the screen
You should have received an email from me with the DLI code!
Thank you for watching.
Create your free account to unlock your custom reading experience.