This model takes a picture, understands which particles are supposed to be moving, and realistically animates them in an infinite loop while conserving the rest of the picture entirely.
The end result is amazingly realistic videos like this one, using only still pictures to generate it.
►Read the full article: https://www.louisbouchard.ai/animate-pictures/
►Paper: Holynski, Aleksander, et al. "Animating Pictures with Eulerian
Motion Fields." Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2021., https://arxiv.org/abs/2011.15128
►Project link (code coming soon): https://eulerian.cs.washington.edu/
00:00
Have you ever taken a beautiful landscape picture and later on you noticed that it didn't
00:05
look quite as good as when you were there.
00:07
It may be because you just cannot freeze such a real-life landscape and expect it to look
00:12
as good.
00:13
In that case, what about having this picture animated where the normally-moving particles
00:18
would be in constant movement, just like the moment you took the photo?
00:22
Observing the water flow or see the smoke disperse in the air.
00:25
Well, this is what a new algorithm from Facebook and the University of Washington does.
00:30
It takes a picture, understands which particles are supposed to be moving, and realistically
00:35
animates them in an infinite loop while conserving the rest of the picture entirely still creating
00:41
amazing-looking videos like this one.
00:44
Sincerely, I don't know why but I LOVE how it looks and wanted to share their work.
00:49
What do you think about these results, and how would you use them?
00:53
Personally, once the code is released, I am using these as desktop backgrounds.
00:57
Now that we've seen what it can achieve, I hope you are as excited as I was when discovering
01:02
this paper.
01:03
Let's get into the even more interesting things.
01:06
Which is: how can they take a single picture and create a realistic animated looping video
01:11
out of it?
01:13
This is done in three important steps.
01:15
The first step is to find what needs to be animated from what needs to stay still.
01:20
In other words, find the water, smoke, or clouds to animate.
01:23
Of course, detecting these moving particles is extremely easy for humans as we can imagine
01:29
the real scene and how it actually was, but how can a computer that sees only a picture
01:35
and doesn't know the world do this?
01:37
Well, the answer lies within the question: we need to teach it a bit more about the world
01:43
and how it works, or in this case, how it moves.
01:46
This is done by training an artificial intelligence model on videos of real landscape scenes instead
01:52
of pictures.
01:53
This way, it can learn how water, smoke, and clouds typically behave in the form of a flow
01:59
field.
02:00
This flow field is a version of the input image where each pixel value is an approximation
02:04
of their direction and speed at a frozen time.
02:07
It is called an Eulerian flow field.
02:10
Eulerian flow fields look at how fluid moves focusing on a fixed location instead of following
02:15
the particles of the fluid.
02:17
You can see this as sitting in front of a waterfall and watching the same exact positions
02:22
observing how the water changes there, instead of following the water down the waterfall.
02:27
And this is exactly what we need in this case as the image is precisely representing that:
02:32
flowing water in a still position.
02:35
So using many landscape videos, they started by identifying these fields for each video.
02:41
This is done quite easily as it actually moves during the videos, and we can use widely known
02:46
techniques to identify the moving particles in each frame.
02:50
Then uses this identified flow for each frame as a landmark to train their algorithm.
02:55
The training starts with an image-to-image translation network using video frames as
03:00
inputs.
03:01
These identified flow fields are used to compare the outputs to teach the network in a supervised
03:06
way what we want to achieve.
03:08
This is done by iteratively correcting and improving the network based on the difference
03:12
between the generated image and our known flow fields.
03:16
After such training, the network can generate this flow field without any external help
03:20
for any image of a landscape received.
03:23
This works just like any other GAN architecture, more precisely any encoder coupled with a
03:29
decoder.
03:30
It first encodes the input frame, the landscape image, and then decodes it to generate a new
03:36
version of the same image, conserving the spatial features and changing the image's
03:40
style.
03:41
In this case, the style changed is the pixel values which identify a motion field instead
03:46
of the actual colors of the images.
03:49
The second step is to animate these sections of the image and do it realistically.
03:53
For this, we only need two things: the input image and the Eulerian or static flow estimation
04:00
we just found for the image.
04:02
Using this information, we know where the pixels are supposed to go next based on their
04:06
speed and directions, but directly applying this will cause some
04:10
issues as some pixels may not have any values after the translation, resulting in black
04:15
holes starting where the motion begins in the picture.
04:18
This is because 1.
04:19
the predicted motion field isn't perfect and 2.
04:22
some pixels will go to the same resulting pixel after their displacement
04:26
, which means that it will get worse over time and produce something like this.
04:30
So how can we make this more intelligent?
04:33
Again, it is done using an encoder and a decoder and doing one more step in-between the two.
04:39
So they encode the input frame a second time using a different encoder trained on this
04:44
specific task, producing what they call here their deep features.
04:48
These deep features are the encodings of the input image, meaning that it is a concentration
04:52
of the important information for this task about the picture.
04:56
What is judged "important information" here is what they optimized their model to do during
05:01
training.
05:02
Using these deep features, controlled by the displacement fields indicating how the next
05:06
frame looks like, they use a decoder trained to generate the
05:10
next frame from this condensed information about the frame and the flow field we give
05:15
it.
05:16
Note that during training, they used two different frames, the first and last frames, to learn
05:20
the real-looking flow of the fluids and try to avoid such black holes from happening.
05:25
Now comes the third and last step: the looping part.
05:29
Using the same frame as starting frame, they generate animation in two directions, a forward
05:34
movement and a backward movement, until they reach the second frame.
05:38
This enables them to produce the looping effect by merging the two videos since one starts
05:44
when the other ends and meets in the center.
05:46
Then, at inference time, or in other words, when you actually use the model, it does the
05:52
same thing with only a starting frame, which is the image you give the model.
05:56
And voila, you have your animated image!
05:59
I hope you enjoyed this video as much as I enjoyed discovering this technique.
06:03
If so, I invite you to read their paper too for more technical details about this super
06:08
cool model.
06:09
It is extremely well done!
06:14
Thank you for watching!