Authors:
(1) Anton Razzhigaev, AIRI and Skoltech;
(2) Arseniy Shakhmatov, Sber AI;
(3) Anastasia Maltseva, Sber AI;
(4) Vladimir Arkhipkin, Sber AI;
(5) Igor Pavlov, Sber AI;
(6) Ilya Ryabov, Sber AI;
(7) Angelina Kuts, Sber AI;
(8) Alexander Panchenko, AIRI and Skoltech;
(9) Andrey Kuznetsov, AIRI and Sber AI;
(10) Denis Dimitrov, AIRI and Sber AI.
Editor's Note: This is Part 2 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below.
Early text-to-image generative models, such as DALL-E (Ramesh et al., 2021) and CogView (Ding et al., 2021), or later Parti (Yu et al., 2022) employed autoregressive approaches but often suffered from significant content-level artifacts. This led to the development of a new breed of models that utilized the diffusion process to enhance image quality. Diffusion-based models, such as DALLE 2 (Ramesh et al., 2022), Imagen (Saharia et al., 2022b), and Stable Diffusion[6], have since become cornerstones in this domain. These models are typically divided into pixel-level (Ramesh et al., 2022; Saharia et al., 2022b) and latent-level (Rombach et al., 2022) approaches.
This surge of interest has led to the design of innovative approaches and architectures, paving the way for numerous applications based on opensource generative models, such as DreamBooth (Ruiz et al., 2023) and DreamPose (Karras et al., 2023). These applications exploit image generation techniques to offer remarkable features, further fueling the popularity and the rapid development of diffusion-based image generation approaches.
This enabled a wide array of applications like 3D object synthesis (Poole et al., 2023; Tang et al., 2023; Lin et al., 2022; Chen et al., 2023), video generation (Ho et al., 2022b; Luo et al., 2023; Ho et al., 2022a; Singer et al., 2023; Blattmann et al., 2023; Esser et al., 2023), controllable image editing (Hertz et al., 2023; Parmar et al., 2023; Liew et al., 2022; Mou et al., 2023; Lu et al., 2023), and more which are now at the forefront of this domain.
Diffusion models achieve state-of-the-art results in image generation task both unconditional (Ho et al., 2020; Nichol and Dhariwal, 2021) and conditional (Peebles and Xie, 2022). They beat GANs (Goodfellow et al., 2014) by generating images with better scores of fidelity and diversity without adversarial training (Dhariwal and Nichol, 2021). Diffusion models also show the best performance in various image processing tasks like inpainting, outpainting, and super-resolution (Batzolis et al., 2021; Saharia et al., 2022a).
Text-to-image diffusion models have become a popular research direction due to the high performance of diffusion models and the ability to simply integrate text conditions with the classifierfree guidance algorithm (Ho and Salimans, 2022). Early models like GLIDE (Nichol et al., 2022), Imagen (Saharia et al., 2022b), DALL-E 2 (Ramesh et al., 2022) and eDiff-I (Balaji et al., 2022) generate low-resolution image in pixel space and then upsample it with another super-resolution diffusion models. They are also using different text encoders, large language model T5 (Raffel et al., 2020) in Imagen, CLIP (Radford et al., 2021) in GLIDE and DALL-E 2.
This paper is available on arxiv under CC BY 4.0 DEED license.
[6] https://github.com/CompVis/stable-diffusion