Meta AI’s new model make-a-video is out and in a single sentence: . It’s not only able to generate videos, but it’s also the new state-of-the-art method, producing higher quality and more coherent videos than ever before! it generates videos from text You can see this model as a stable diffusion model for videos. Surely the next step after being able to generate images. This is all information you must’ve seen already on a news website or just by reading the title of the article, but what you don’t know yet is what it is exactly and how it works. Here's how... References ►Read the full article: ► Meta's blog post: ►Singer et al. (Meta AI), 2022, "MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA", ►Make-a-video (official page): ► Pytorch implementation: ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/make-a-video/ https://ai.facebook.com/blog/generative-ai-text-to-video/ https://makeavideo.studio/Make-A-Video.pdf https://makeavideo.studio/?fbclid=IwAR0tuL9Uc6kjZaMoJHCngAMUNp9bZbyhLmdOUveJ9leyyfL9awRy4seQGW4 https://github.com/lucidrains/make-a-video-pytorch https://www.louisbouchard.ai/newsletter/ Video Transcript 0:00 methias new model make a video is out 0:03 and in a single sentence it generates 0:05 videos from text it's not unable to 0:07 generate videos but it's also the new 0:09 state-of-the-art method producing higher 0:11 quality and more coherent videos than 0:14 ever you can see this model as a stable 0:16 diffusion model for videos surely the 0:19 next step after being able to generate 0:21 images this is how information you must 0:23 have seen already on a News website or 0:26 just by reading the title of the video 0:28 but what you don't know yet is what is 0:30 it exactly and how it works make a video 0:33 is the most recent publication by met 0:35 III and it allows you to generate a 0:37 short video out of textual inputs just 0:40 like this so you are adding complexity 0:42 to the image generation test by not only 0:45 having to generate multiple frames of 0:47 the same subject and scene but it also 0:49 has to be coherent in time you cannot 0:51 simply generate 60 images using dally 0:53 and generate a video it will just look 0:56 bad and nothing realistic you need a 0:58 model that understands the world in a 1:00 better way and leverages this level of 1:02 understanding to generate a coherent 1:04 series of images that blend well 1:06 together you basically want to simulate 1:08 a world and then simulate recordings of 1:11 it but how can you do that typically you 1:14 will need tons of text video pairs to 1:16 train your model to generate such videos 1:18 from textual input but not in this case 1:21 since this kind of data is really 1:23 difficult to get and the training costs 1:25 are super expensive they approach this 1:27 problem differently another way is to 1:30 take the best text to image model and 1:32 adapt it to videos and that's what met I 1:35 did in a research paper they just 1:38 released in their case the text to image 1:40 model is an another model by meta called 1:43 magazine which I covered in a previous 1:45 video if you'd like to learn more about 1:47 it but how do you adapt such a model to 1:50 take time into consideration you add a 1:53 spatial temporal pipeline for your model 1:55 to be able to process videos this means 1:58 that the model will not only generate an 2:00 image but in this case 16 of them in low 2:03 resolution to create a short coherent 2:06 video in a similar manner as a text to 2:08 image model but adding a one-dimensional 2:11 convolution along with the regular 2:13 two-dimensional one the simple addition 2:15 allows them to keep the pre-trained 2:17 two-dimensional convolutions the same 2:19 and add a temporal Dimension that they 2:22 will train from scratch reusing most of 2:25 the code and models parameters from the 2:27 image model they started from we also 2:30 want to guide Our Generations with text 2:32 input which will be very similar to 2:34 image models using clip embeddings a 2:37 process I go in detail in my stable 2:39 diffusion video if you are not familiar 2:41 with their problem but they will also be 2:43 adding the temporal Dimension when 2:45 blending the text features with the 2:47 image features doing the same thing 2:49 keeping the attention module I described 2:52 in my make a scene video and adding a 2:55 one-dimensional attention module or 2:57 temporal considerations copy pasting the 3:00 image generator model and duplicating 3:02 the generation modules for one more 3:04 Dimension to have all our 16 initial 3:07 frames but what can you do with 16 3:10 frames well nothing really interesting 3:13 we need to make a high definition video 3:16 out of those frames the model will do 3:19 that by having access to previews and 3:21 future frames and iteratively 3:23 interpolating from them both in terms of 3:27 temporal and spatial Dimensions at the 3:30 same time so basically generating new 3:33 and larger frames in between those 3:35 initial 16 frames based on the frames 3:38 before and after them which will 3:40 fascinate making the movement coherent 3:43 and overall video ruined this is done 3:45 using a frame interpolation Network 3:47 which I also described in other videos 3:50 but will basically take the images we 3:52 have and fill in gaps generating in 3:54 between information it will do the same 3:57 thing for a spatial component enlarging 3:59 the image and filling the pixel gaps to 4:02 make it more high definition 4:04 so to summarize the fine tune a text to 4:07 image model for video generation this 4:09 means they take a powerful model already 4:12 trained and adapt and train it a little 4:14 bit more to get used to videos this 4:16 retraining will be done with unlabeled 4:19 videos just to teach the model to 4:21 understand videos and video frame 4:23 consistency which makes the data set 4:25 building process much simpler then we 4:27 use once again an image optimized model 4:30 to improve spatial resolution in our 4:32 last frame interpolation component to 4:35 add more frames to make the video fluid 4:38 of course the results aren't perfect yet 4:40 just like text to image models but we 4:43 know how fast the progress goes this was 4:45 just an overview of how met I 4:47 successfully tackled the text to video 4:49 task in this great paper all the links 4:52 are in the description below if you'd 4:53 like to learn more about their approach 4:55 at pytorch implementation is also 4:57 already being developed by the community 4:59 as well so stay tuned for that if you'd 5:02 like to implement it yourself thank you 5:04 for watching the whole video and I will 5:06 see you next time with another amazing 5:08 paper