Authors: (1) Anton Razzhigaev, AIRI and Skoltech; (2) Arseniy Shakhmatov, Sber AI; (3) Anastasia Maltseva, Sber AI; (4) Vladimir Arkhipkin, Sber AI; (5) Igor Pavlov, Sber AI; (6) Ilya Ryabov, Sber AI; (7) Angelina Kuts, Sber AI; (8) Alexander Panchenko, AIRI and Skoltech; (9) Andrey Kuznetsov, AIRI and Sber AI; (10) Denis Dimitrov, AIRI and Sber AI. Editor's Note: This is Part 6 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Table of Links Abstract and Introduction Related Work Demo System Kandinsky Architecture Experiments Results Conclusion & Limitations Ethical Considerations, Acknowledgements and References 6 Results Our experiments and evaluations have showcased the capabilities of Kandinsky architecture in text-toimage synthesis. Kandinsky achieved the FID score of 8.03 on the COCO-30K validation set at a resolution of 256×256, which puts it in close competition with the state-of-the-art models, and among the top performers within open-source systems. Our methodical ablation studies further dissected the performance of different configurations: quantization of latent codes in MoVQ slightly improves the quality of images (FID 9.86 vs 9.87). The best CLIP score and human-eval score are obtained by diffusion prior. The best FID score is achieved using Linear Prior. This configuration stands out with the best FID score of 8.03. It is an intriguing outcome: the simplest linear mapping showcased the best FID, suggesting that there might exist a linear relationship between visual and textual embedding vector spaces. To further scrutinize this hypothesis, we trained a linear mapping on a subset of 500 cat images and termed it the "cat prior". Astonishingly, this mapping displayed high proficiency (cf. Figure 5). This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Anton Razzhigaev, AIRI and Skoltech; (2) Arseniy Shakhmatov, Sber AI; (3) Anastasia Maltseva, Sber AI; (4) Vladimir Arkhipkin, Sber AI; (5) Igor Pavlov, Sber AI; (6) Ilya Ryabov, Sber AI; (7) Angelina Kuts, Sber AI; (8) Alexander Panchenko, AIRI and Skoltech; (9) Andrey Kuznetsov, AIRI and Sber AI; (10) Denis Dimitrov, AIRI and Sber AI. Authors: Authors: (1) Anton Razzhigaev, AIRI and Skoltech; (2) Arseniy Shakhmatov, Sber AI; (3) Anastasia Maltseva, Sber AI; (4) Vladimir Arkhipkin, Sber AI; (5) Igor Pavlov, Sber AI; (6) Ilya Ryabov, Sber AI; (7) Angelina Kuts, Sber AI; (8) Alexander Panchenko, AIRI and Skoltech; (9) Andrey Kuznetsov, AIRI and Sber AI; (10) Denis Dimitrov, AIRI and Sber AI. Editor's Note: This is Part 6 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Editor's Note: This is Part 6 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Editor's Note: This is Part 6 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Table of Links Abstract and Introduction Related Work Demo System Kandinsky Architecture Experiments Results Conclusion & Limitations Ethical Considerations, Acknowledgements and References Abstract and Introduction Abstract and Introduction Related Work Related Work Demo System Demo System Kandinsky Architecture Kandinsky Architecture Experiments Experiments Results Results Conclusion & Limitations Conclusion & Limitations Ethical Considerations, Acknowledgements and References Ethical Considerations, Acknowledgements and References 6 Results Our experiments and evaluations have showcased the capabilities of Kandinsky architecture in text-toimage synthesis. Kandinsky achieved the FID score of 8.03 on the COCO-30K validation set at a resolution of 256×256, which puts it in close competition with the state-of-the-art models, and among the top performers within open-source systems. Our methodical ablation studies further dissected the performance of different configurations: quantization of latent codes in MoVQ slightly improves the quality of images (FID 9.86 vs 9.87). The best CLIP score and human-eval score are obtained by diffusion prior. The best FID score is achieved using Linear Prior. This configuration stands out with the best FID score of 8.03. It is an intriguing outcome: the simplest linear mapping showcased the best FID, suggesting that there might exist a linear relationship between visual and textual embedding vector spaces. To further scrutinize this hypothesis, we trained a linear mapping on a subset of 500 cat images and termed it the "cat prior". Astonishingly, this mapping displayed high proficiency (cf. Figure 5). This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv