Authors:
(1) Anton Razzhigaev, AIRI and Skoltech;
(2) Arseniy Shakhmatov, Sber AI;
(3) Anastasia Maltseva, Sber AI;
(4) Vladimir Arkhipkin, Sber AI;
(5) Igor Pavlov, Sber AI;
(6) Ilya Ryabov, Sber AI;
(7) Angelina Kuts, Sber AI;
(8) Alexander Panchenko, AIRI and Skoltech;
(9) Andrey Kuznetsov, AIRI and Sber AI;
(10) Denis Dimitrov, AIRI and Sber AI. Editor's Note: This is Part 1 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Table of Links Abstract and Introduction
Related Work
Demo System
Kandinsky Architecture
Experiments
Results
Conclusion & Limitations
Ethical Considerations, Acknowledgements and References Abstract Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky[1], a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality. 1 Introduction In quite a short period of time, generative abilities of text-to-image models have improved substantially, providing users with photorealistic quality, near real-time inference speed, a great number of applications and features, including simple easy-to-use web-based platforms and sophisticated AI graphics editors. This paper presents our unique investigation of latent diffusion architecture design, offering a fresh and innovative perspective on this dynamic field of study. First, we describe the new architecture of Kandinsky and its details. The demo system with implemented features of the model is also described. Second, we show the experiments, carried out in terms of image generation quality and come up with the highest FID score among existing open-source models. Additionally, we present the rigorous ablation study of prior setups that we conducted, enabling us to carefully analyze and evaluate various configurations to arrive at the most effective and refined model design. Our contributions are as follows: • We present the first text-to-image architecture designed using a combination of image prior and latent diffusion. • We demonstrate experimental results comparable to the state-of-the-art (SotA) models such as Stable Diffusion, IF, and DALL-E 2, in terms of FID metric and achieve the SotA score among all existing open source models. • We provide a software implementation of the proposed state-of-the-art method for textto-image generation, and release pre-trained models, which is unique among the topperforming methods. Apache 2.0 license makes it possible to use the model for both non-commercial and commercial purposes.2 3 • We create a web image editor application that can be used for interactive generation of images by text prompts (English and Russian languages are supported) on the basis of the proposed method, and provides inpainting/outpainting functionality.4 The video demonstration is available on YouTube.5 This paper is available on arxiv under CC BY 4.0 DEED license. [1] The system is named after Wassily Kandinsky, a famous painter and an art theorist. [2] https://github.com/ai-forever/Kandinsky-2 [3] https://huggingface.co/kandinsky-community [4] https://fusionbrain.ai/en/editor [5] https://www.youtube.com/watch?v=c7zHPc59cWU Authors: (1) Anton Razzhigaev, AIRI and Skoltech; (2) Arseniy Shakhmatov, Sber AI; (3) Anastasia Maltseva, Sber AI; (4) Vladimir Arkhipkin, Sber AI; (5) Igor Pavlov, Sber AI; (6) Ilya Ryabov, Sber AI; (7) Angelina Kuts, Sber AI; (8) Alexander Panchenko, AIRI and Skoltech; (9) Andrey Kuznetsov, AIRI and Sber AI; (10) Denis Dimitrov, AIRI and Sber AI. Authors: Authors: (1) Anton Razzhigaev, AIRI and Skoltech; (2) Arseniy Shakhmatov, Sber AI; (3) Anastasia Maltseva, Sber AI; (4) Vladimir Arkhipkin, Sber AI; (5) Igor Pavlov, Sber AI; (6) Ilya Ryabov, Sber AI; (7) Angelina Kuts, Sber AI; (8) Alexander Panchenko, AIRI and Skoltech; (9) Andrey Kuznetsov, AIRI and Sber AI; (10) Denis Dimitrov, AIRI and Sber AI. Editor's Note: This is Part 1 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Editor's Note: This is Part 1 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Editor's Note: This is Part 1 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Table of Links Abstract and Introduction Related Work Demo System Kandinsky Architecture Experiments Results Conclusion & Limitations Ethical Considerations, Acknowledgements and References Abstract and Introduction Abstract and Introduction Related Work Related Work Demo System Demo System Kandinsky Architecture Kandinsky Architecture Experiments Experiments Results Results Conclusion & Limitations Conclusion & Limitations Ethical Considerations, Acknowledgements and References Ethical Considerations, Acknowledgements and References Abstract Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky[1], a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality. 1 Introduction In quite a short period of time, generative abilities of text-to-image models have improved substantially, providing users with photorealistic quality, near real-time inference speed, a great number of applications and features, including simple easy-to-use web-based platforms and sophisticated AI graphics editors. This paper presents our unique investigation of latent diffusion architecture design, offering a fresh and innovative perspective on this dynamic field of study. First, we describe the new architecture of Kandinsky and its details. The demo system with implemented features of the model is also described. Second, we show the experiments, carried out in terms of image generation quality and come up with the highest FID score among existing open-source models. Additionally, we present the rigorous ablation study of prior setups that we conducted, enabling us to carefully analyze and evaluate various configurations to arrive at the most effective and refined model design. Our contributions are as follows: contributions • We present the first text-to-image architecture designed using a combination of image prior and latent diffusion. • We demonstrate experimental results comparable to the state-of-the-art (SotA) models such as Stable Diffusion, IF, and DALL-E 2, in terms of FID metric and achieve the SotA score among all existing open source models. • We provide a software implementation of the proposed state-of-the-art method for textto-image generation, and release pre-trained models, which is unique among the topperforming methods. Apache 2.0 license makes it possible to use the model for both non-commercial and commercial purposes.2 3 • We create a web image editor application that can be used for interactive generation of images by text prompts (English and Russian languages are supported) on the basis of the proposed method, and provides inpainting/outpainting functionality.4 The video demonstration is available on YouTube.5 This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv [1] The system is named after Wassily Kandinsky, a famous painter and an art theorist. [2] https://github.com/ai-forever/Kandinsky-2 [3] https://huggingface.co/kandinsky-community [4] https://fusionbrain.ai/en/editor [5] https://www.youtube.com/watch?v=c7zHPc59cWU

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Russian Scientists Develop First Text-to-Image Architecture Using Image Prior, Latent Diffusion

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

12 Key Aspects for Assessing the Power of Text-to-Image Models

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

12 Key Aspects for Assessing the Power of Text-to-Image Models

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps