Make-A-Scene is not “just another Dalle”. The goal of this new model isn’t to allow users to generate random images following text prompt as dalle does — which is really cool — but restricts the user control on the generations. Instead, Meta wanted to push creative expression forward, merging this text-to-image trend with previous sketch-to-image models, leading to “Make-A-Scene”: a fantastic blend between text and sketch-conditioned image generation. Learn more in the video... References ►Read the full article: ►Meta's blog post: ►Paper: Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D. and Taigman, Y., 2022. Make-a-scene: Scene-based text-to-image generation with human priors. ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/make-a-scene/ https://ai.facebook.com/blog/greater-creative-control-for-ai-image-generation https://www.louisbouchard.ai/newsletter/ Video Transcript 0:00 [Music] 0:06 this is make a scene it's not just 0:08 another deli the goal of this new model 0:11 isn't to allow users to generate random 0:13 images following text prompt as dali 0:15 does which is really cool but restricts 0:17 the user control on the generations 0:20 instead meta wanted to push creative 0:22 expression forward merging this text to 0:25 image trend with previous sketch to 0:27 image models leading to make a scene a 0:30 fantastic blend between text and sketch 0:32 conditioned image generation this simply 0:35 means that using this new approach you 0:37 can quickly sketch out a cat and write 0:40 what kind of image you would like and 0:42 the image generation process will follow 0:43 both the sketch and the guidance of your 0:45 text it gets us even closer to being 0:48 able to generate the perfect 0:49 illustration we want in a few seconds 0:52 you can see this multimodal generative 0:54 ai method as a daily model with a bit 0:57 more control over the generations since 0:59 it can also take in a quick sketch as 1:01 input this is why we call it multimodal 1:04 since it can take multiple modalities as 1:07 inputs like text and an image a sketch 1:10 in this case compared to delhi which 1:12 only takes text to generate an image 1:14 multi-modal models are something super 1:17 promising especially if we match the 1:19 quality of the results we see online 1:21 since we have more control over the 1:23 results getting closer to a very 1:25 interesting end goal of generating the 1:27 perfect image we have in mind without 1:30 any design skills of course this is 1:32 still in the research state and is an 1:34 exploratory ai research concept it 1:37 doesn't mean what we see isn't 1:38 achievable it just means it will take a 1:41 bit more time to get to the public the 1:43 progress is extremely fast in the field 1:45 and i wouldn't be surprised to see it 1:47 live very shortly or a similar model 1:49 from other people to play with i believe 1:52 such sketch and text-based models are 1:54 even more interesting especially for the 1:56 industry which is why i wanted to cover 1:58 it on my channel even though the results 2:00 are a bit behind those of daily 2 we see 2:03 online and it's not only interesting for 2:05 the industry but for artists too some 2:08 use the sketch feature to generate even 2:10 more unexpected results than what delhi 2:13 could do we can ask it to generate 2:14 something and draw a form that doesn't 2:17 represent the specific thing like 2:18 drawing a jellyfish in a flower shape 2:21 which may not be impossible to have with 2:23 dali but much more complicated without 2:25 sketch guidance as the model will only 2:27 reproduce what it learns from which 2:29 comes from real world images and 2:32 illustrations so the main question is 2:34 how can they guide the generations with 2:36 both text input like delhi and a sketch 2:39 simultaneously and have the model follow 2:41 both guidelines well it's very very 2:44 similar to how delhi works so i won't 2:47 enter too much into the details of a 2:49 generative model as i covered at least 2:51 five different approaches in the past 2:53 two months which you should definitely 2:55 watch if you haven't yet as these models 2:57 like dali 2 or imogen are quite 2:59 fantastic 3:00 typically these models will take 3:02 millions of training examples to learn 3:04 how to generate images from text with 3:07 data in the form of images and their 3:09 captions scraped from the internet here 3:12 during training instead of only relying 3:14 on the caption generating the first 3:17 version of the image and comparing it to 3:19 the actual image and repeating this 3:21 process numerous times with all our 3:23 images we will also feed it a sketch 3:26 what's cool is that the sketches are 3:28 quite easy to produce for training 3:30 simply take a pre-trained network you 3:32 can download online and perform instance 3:35 segmentation for those who wants the 3:37 details they use a free pre-trained vgg 3:40 model on imagenet so a quite small 3:42 network compared to those today super 3:44 accurate and fast producing results like 3:47 this called a segmentation map they 3:49 simply process all their images once and 3:52 get these maps for training the model 3:55 then use this map as well as the caption 3:58 to orient the model to generate the 4:00 initial image at inference time or when 4:02 one of us will use it our sketch will 4:05 replace those maps as i said they used a 4:08 model called vgg to create fake sketches 4:11 for training they use a transformer 4:13 architecture for the image generation 4:15 process which is different from dolly to 4:17 and i invite you to watch the video i 4:19 made introducing transformers for vision 4:21 applications if you'd like more details 4:23 on how it can process and generate 4:25 images this sketch guided transformer is 4:28 the main difference with magazine along 4:30 with not using an image text ranker like 4:33 clip to measure text and image pairs 4:36 which you can also learn about in my 4:37 daily video 4:39 instead all the encoded text and 4:41 segmentation maps are sent to the 4:43 transformer model the model then 4:45 generates the relevant image tokens 4:48 encoded and decoded by the corresponding 4:50 networks mainly to produce the image the 4:53 encoder is used during training to 4:55 calculate the difference between the 4:57 produced and initial image but only the 4:59 decoder is needed to take this 5:01 transformer output and transform it into 5:04 an image 5:05 and voila this is how meta's new model 5:08 is able to take a sketch and text inputs 5:11 and generate a high definition image out 5:13 of it allowing more control over the 5:16 results with great quality 5:18 and as they say it's just the beginning 5:20 of this new kind of ai model the 5:22 approaches will just keep improving both 5:24 in terms of quality and availability for 5:27 the public which is super exciting many 5:30 artists are already using the model for 5:32 their own work as described in meta's 5:34 blog post and i'm excited about when we 5:37 will be able to use it too their 5:39 approach doesn't require any coding 5:41 knowledge only a good sketching hand and 5:43 some prompt engineering which means 5:45 trial and error with the text inputs 5:48 tweaking the formulations and words used 5:50 to produce different and better results 5:53 of course this was just an overview of 5:55 the new make a scene approach and i 5:57 invite you to read the full paper linked 5:59 below for a complete overview of how it 6:02 works i hope you've enjoyed this video 6:04 and i will see you next week with 6:06 another amazing paper 6:09 [Music]