eDiffi, NVIDIA's most recent model, generates better-looking and more accurate images than all previous approaches like DALLE 2 or Stable Diffusion. eDiffi better understands the text you send and is more customizable, adding a feature we saw in a previous paper from NVIDIA: the painter tool. Learn more in the video... References ►Read the full article: ► Balaji, Y. et al., 2022, eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers, ►Project page: ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/ediffi/ https://arxiv.org/abs/2211.01324 https://deepimagination.cc/eDiffi/ https://www.louisbouchard.ai/newsletter/ Video Transcript 0:06 the new state-of-the-art approach for 0:08 image synthesis it generates better 0:10 looking and more accurate images than 0:13 all previous approaches like Delhi 2 or 0:15 stable diffusion either if he better 0:17 understands the text you send and is 0:19 more customizable adding a new feature 0:21 we saw in a previous paper from Nvidia 0:23 the painter tool as they see you can 0:26 paint with words in short this means you 0:29 can enter a few subjects and paint in 0:32 the image what should appear here and 0:34 there allowing you to create much more 0:36 customized images compared to a random 0:39 generation following a prompt this is 0:41 the next level enabling you to pretty 0:43 much get the exact image you have in 0:45 mind by simply drawing a horrible quick 0:47 sketch something even I can do as I 0:50 mentioned the results are not only Sota 0:52 and better looking than stable diffusion 0:55 but they are also way more controllable 0:57 of course it's a different use case as 0:59 it needs a bit more work and a clearer 1:02 ID in mind for creating such a draft but 1:04 it's definitely super very exciting and 1:06 interesting it's also why I wanted to 1:08 cover it on my channel since it's not 1:11 merely a better model but also a 1:13 different approach with much more 1:15 control over the output the tool isn't 1:17 available yet unfortunately but I sure 1:19 hope it will be soon by the way you 1:22 should definitely subscribe to the 1:23 channel and follow me on Twitter at what 1:25 say hi if you like this kind of video 1:27 and would like to have access to easily 1:30 digestible news on this heavily 1:32 complicated field another win which they 1:34 allow you to have more control in this 1:37 new model is by using the same feature 1:39 we saw but differently indeed the model 1:42 generates images Guided by a sentence 1:44 but it can also be influenced using a 1:47 quick sketch so it basically takes an 1:49 image and a text as inputs this means 1:52 you can do other stuff as it understands 1:54 images here they leverage this 1:56 capability by developing a style 1:58 transfer approach where you can 2:00 influence the style of the image 2:02 generation process giving an image with 2:04 a particular style well along with your 2:06 text input this is super cool and just 2:09 look at the results they speak for 2:11 themselves it's incredible beating both 2:14 Sota style transfer models and image 2:16 synthesis models with a single approach 2:18 now the question is how could Nvidia 2:22 develop a model that creates better 2:23 looking images enable more control over 2:26 both the style and the image structure 2:29 as well as better understanding and 2:31 representing what you actually want in 2:34 your text well they change the typical 2:36 diffusion architecture in two ways first 2:39 they encode the text using two different 2:41 approaches that I already covered on the 2:43 channel which we refer to as clip and T5 2:46 encoders this means they will use 2:48 pre-trained models to take text and 2:50 create various embeddings focusing on 2:52 different features as they are trained 2:55 and behaved differently and meanings are 2:57 just representations maximizing what the 3:00 sentence actually means for the 3:01 algorithm or the machine to understand 3:04 it regarding the input image they just 3:06 use the clip embeddings as well 3:08 basically encoding the image so that the 3:11 model can understand it which you can 3:13 learn more about in my other videos 3:14 covering generative models as they are 3:16 pretty much all built on clip this is 3:19 what allows them to have more control 3:21 over the output as well as processed 3:23 text and images rather than only text 3:25 the second modification is using a 3:28 Cascade of diffusion models instead of 3:31 reusing the same iteratively as we 3:33 usually do with diffusion based models 3:35 here the use models trained for the 3:38 specific part of the generative process 3:39 meaning that each model does not have to 3:42 be as general as the regular diffusion 3:44 denoiser since each model has to focus 3:46 on a specific part of the process it can 3:49 be much better at it they use this 3:51 approach because they observed that the 3:52 denoising models seemed to use the text 3:55 embeddings a lot more to orient its 3:57 generation towards the beginning of the 3:59 process and then use it less and less to 4:02 focus on output quality and Fidelity the 4:05 this naturally brings the hypothesis 4:07 that reusing the same denoising model 4:09 throughout the whole process might not 4:11 be the best ID since it automatically 4:13 focuses on different tasks and we know 4:15 that a generalist is far from the expert 4:18 level at all tasks why not use a few 4:20 experts instead of one generalist to get 4:23 much better results so this is what they 4:25 did and why they call them denoising 4:28 experts and the main reason for this 4:30 improves performance in quality and 4:32 faithfulness the rest of the 4:34 architecture is pretty similar to other 4:36 approaches of scaling the final results 4:38 with other models to get a high 4:40 definition final image the image and 4:43 video synthesis fields are just getting 4:45 crazy nowadays and we are seeing 4:47 impressive results coming out every week 4:49 I am super excited for the next releases 4:51 and I love to see different approaches 4:53 with both innovative ways of tackling 4:55 the problem and also going for different 4:57 use cases as a great person once said 5:01 what a time to be alive I hope you like 5:04 this quick overview of the approach a 5:06 bit more high level than what I usually 5:08 do as it takes most Parts I already 5:10 covered in numerous videos and changed 5:12 them to act differently I invite you to 5:15 watch my stable diffusion video to learn 5:17 a bit more about the diffusion approach 5:19 itself and read the nvidia's paper to 5:21 learn more about this specific approach 5:23 and its implementation I will see you 5:26 next week with another amazing paper 5:32 foreign 5:36 [Music]