Authors:
(1) Anton Razzhigaev, AIRI and Skoltech;
(2) Arseniy Shakhmatov, Sber AI;
(3) Anastasia Maltseva, Sber AI;
(4) Vladimir Arkhipkin, Sber AI;
(5) Igor Pavlov, Sber AI;
(6) Ilya Ryabov, Sber AI;
(7) Angelina Kuts, Sber AI;
(8) Alexander Panchenko, AIRI and Skoltech;
(9) Andrey Kuznetsov, AIRI and Sber AI;
(10) Denis Dimitrov, AIRI and Sber AI.
Editor's Note: This is Part 7 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below.
We presented Kandinsky, a system for various image generation and processing tasks based on a novel latent diffusion model. Our model yielded the SotA results among open-sourced systems. Additionally, we provided an extensive ablation study of an image prior to design choices. Our system is equipped with free-to-use interfaces in the form of Web application and Telegram messenger bot. The pre-trained models are available on Hugging Face, and the source code is released under a permissive license enabling various, including commercial, applications of the developed technology.
In future research, our goal is to investigate the potential of the latest image encoders. We plan to explore the development of more efficient UNet architectures for text-to-image tasks and focus on improving the understanding of textual prompts. Additionally, we aim to experiment with generating images at higher resolutions and to investigate new features extending the model: local image editing by a text prompt, attention reweighting, physics-based generation control, etc. The robustness against generating abusive content remains a crucial concern, warranting the exploration of real-time moderation layers or robust classifiers to mitigate undesirable, e.g. toxic or abusive, outputs.
The current system produces images that appear natural, however, additional research can be conducted to (1) enhance the semantic coherence between the input text and the generated image, and (2) to improve the absolute values of FID and image quality based on human evaluations.
This paper is available on arxiv under CC BY 4.0 DEED license.