Authors:
(1) Anton Razzhigaev, AIRI and Skoltech;
(2) Arseniy Shakhmatov, Sber AI;
(3) Anastasia Maltseva, Sber AI;
(4) Vladimir Arkhipkin, Sber AI;
(5) Igor Pavlov, Sber AI;
(6) Ilya Ryabov, Sber AI;
(7) Angelina Kuts, Sber AI;
(8) Alexander Panchenko, AIRI and Skoltech;
(9) Andrey Kuznetsov, AIRI and Sber AI;
(10) Denis Dimitrov, AIRI and Sber AI.
Editor's Note: This is Part 1 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below.
Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky[1], a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.
In quite a short period of time, generative abilities of text-to-image models have improved substantially, providing users with photorealistic quality, near real-time inference speed, a great number of applications and features, including simple easy-to-use web-based platforms and sophisticated AI graphics editors.
This paper presents our unique investigation of latent diffusion architecture design, offering a fresh and innovative perspective on this dynamic field of study. First, we describe the new architecture of Kandinsky and its details. The demo system with implemented features of the model is also described. Second, we show the experiments, carried out in terms of image generation quality and come up with the highest FID score among existing open-source models. Additionally, we present the rigorous ablation study of prior setups that we conducted, enabling us to carefully analyze and evaluate various configurations to arrive at the most effective and refined model design.
Our contributions are as follows:
• We present the first text-to-image architecture designed using a combination of image prior and latent diffusion.
• We demonstrate experimental results comparable to the state-of-the-art (SotA) models such as Stable Diffusion, IF, and DALL-E 2, in terms of FID metric and achieve the SotA score among all existing open source models.
• We provide a software implementation of the proposed state-of-the-art method for textto-image generation, and release pre-trained models, which is unique among the topperforming methods. Apache 2.0 license makes it possible to use the model for both non-commercial and commercial purposes.2 3
• We create a web image editor application that can be used for interactive generation of images by text prompts (English and Russian languages are supported) on the basis of the proposed method, and provides inpainting/outpainting functionality.4 The video demonstration is available on YouTube.5
This paper is available on arxiv under CC BY 4.0 DEED license.
[1] The system is named after Wassily Kandinsky, a famous painter and an art theorist.
[2] https://github.com/ai-forever/Kandinsky-2
[3] https://huggingface.co/kandinsky-community
[4] https://fusionbrain.ai/en/editor
[5] https://www.youtube.com/watch?v=c7zHPc59cWU