Meta AI’s new model make-a-video is out and in a single sentence: it generates videos from text. It’s not only able to generate videos, but it’s also the new state-of-the-art method, producing higher quality and more coherent videos than ever before!
You can see this model as a stable diffusion model for videos. Surely the next step after being able to generate images. This is all information you must’ve seen already on a news website or just by reading the title of the article, but what you don’t know yet is what it is exactly and how it works.
Here's how...
►Read the full article: https://www.louisbouchard.ai/make-a-video/
► Meta's blog post: https://ai.facebook.com/blog/generative-ai-text-to-video/
►Singer et al. (Meta AI), 2022, "MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA", https://makeavideo.studio/Make-A-Video.pdf
►Make-a-video (official page): https://makeavideo.studio/?fbclid=IwAR0tuL9Uc6kjZaMoJHCngAMUNp9bZbyhLmdOUveJ9leyyfL9awRy4seQGW4
► Pytorch implementation: https://github.com/lucidrains/make-a-video-pytorch
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
0:00
methias new model make a video is out
0:03
and in a single sentence it generates
0:05
videos from text it's not unable to
0:07
generate videos but it's also the new
0:09
state-of-the-art method producing higher
0:11
quality and more coherent videos than
0:14
ever you can see this model as a stable
0:16
diffusion model for videos surely the
0:19
next step after being able to generate
0:21
images this is how information you must
0:23
have seen already on a News website or
0:26
just by reading the title of the video
0:28
but what you don't know yet is what is
0:30
it exactly and how it works make a video
0:33
is the most recent publication by met
0:35
III and it allows you to generate a
0:37
short video out of textual inputs just
0:40
like this so you are adding complexity
0:42
to the image generation test by not only
0:45
having to generate multiple frames of
0:47
the same subject and scene but it also
0:49
has to be coherent in time you cannot
0:51
simply generate 60 images using dally
0:53
and generate a video it will just look
0:56
bad and nothing realistic you need a
0:58
model that understands the world in a
1:00
better way and leverages this level of
1:02
understanding to generate a coherent
1:04
series of images that blend well
1:06
together you basically want to simulate
1:08
a world and then simulate recordings of
1:11
it but how can you do that typically you
1:14
will need tons of text video pairs to
1:16
train your model to generate such videos
1:18
from textual input but not in this case
1:21
since this kind of data is really
1:23
difficult to get and the training costs
1:25
are super expensive they approach this
1:27
problem differently another way is to
1:30
take the best text to image model and
1:32
adapt it to videos and that's what met I
1:35
did in a research paper they just
1:38
released in their case the text to image
1:40
model is an another model by meta called
1:43
magazine which I covered in a previous
1:45
video if you'd like to learn more about
1:47
it but how do you adapt such a model to
1:50
take time into consideration you add a
1:53
spatial temporal pipeline for your model
1:55
to be able to process videos this means
1:58
that the model will not only generate an
2:00
image but in this case 16 of them in low
2:03
resolution to create a short coherent
2:06
video in a similar manner as a text to
2:08
image model but adding a one-dimensional
2:11
convolution along with the regular
2:13
two-dimensional one the simple addition
2:15
allows them to keep the pre-trained
2:17
two-dimensional convolutions the same
2:19
and add a temporal Dimension that they
2:22
will train from scratch reusing most of
2:25
the code and models parameters from the
2:27
image model they started from we also
2:30
want to guide Our Generations with text
2:32
input which will be very similar to
2:34
image models using clip embeddings a
2:37
process I go in detail in my stable
2:39
diffusion video if you are not familiar
2:41
with their problem but they will also be
2:43
adding the temporal Dimension when
2:45
blending the text features with the
2:47
image features doing the same thing
2:49
keeping the attention module I described
2:52
in my make a scene video and adding a
2:55
one-dimensional attention module or
2:57
temporal considerations copy pasting the
3:00
image generator model and duplicating
3:02
the generation modules for one more
3:04
Dimension to have all our 16 initial
3:07
frames but what can you do with 16
3:10
frames well nothing really interesting
3:13
we need to make a high definition video
3:16
out of those frames the model will do
3:19
that by having access to previews and
3:21
future frames and iteratively
3:23
interpolating from them both in terms of
3:27
temporal and spatial Dimensions at the
3:30
same time so basically generating new
3:33
and larger frames in between those
3:35
initial 16 frames based on the frames
3:38
before and after them which will
3:40
fascinate making the movement coherent
3:43
and overall video ruined this is done
3:45
using a frame interpolation Network
3:47
which I also described in other videos
3:50
but will basically take the images we
3:52
have and fill in gaps generating in
3:54
between information it will do the same
3:57
thing for a spatial component enlarging
3:59
the image and filling the pixel gaps to
4:02
make it more high definition
4:04
so to summarize the fine tune a text to
4:07
image model for video generation this
4:09
means they take a powerful model already
4:12
trained and adapt and train it a little
4:14
bit more to get used to videos this
4:16
retraining will be done with unlabeled
4:19
videos just to teach the model to
4:21
understand videos and video frame
4:23
consistency which makes the data set
4:25
building process much simpler then we
4:27
use once again an image optimized model
4:30
to improve spatial resolution in our
4:32
last frame interpolation component to
4:35
add more frames to make the video fluid
4:38
of course the results aren't perfect yet
4:40
just like text to image models but we
4:43
know how fast the progress goes this was
4:45
just an overview of how met I
4:47
successfully tackled the text to video
4:49
task in this great paper all the links
4:52
are in the description below if you'd
4:53
like to learn more about their approach
4:55
at pytorch implementation is also
4:57
already being developed by the community
4:59
as well so stay tuned for that if you'd
5:02
like to implement it yourself thank you
5:04
for watching the whole video and I will
5:06
see you next time with another amazing
5:08
paper