Authors:
(1) Anton Razzhigaev, AIRI and Skoltech;
(2) Arseniy Shakhmatov, Sber AI;
(3) Anastasia Maltseva, Sber AI;
(4) Vladimir Arkhipkin, Sber AI;
(5) Igor Pavlov, Sber AI;
(6) Ilya Ryabov, Sber AI;
(7) Angelina Kuts, Sber AI;
(8) Alexander Panchenko, AIRI and Skoltech;
(9) Andrey Kuznetsov, AIRI and Sber AI;
(10) Denis Dimitrov, AIRI and Sber AI. Editor's Note: This is Part 5 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Table of Links Abstract and Introduction
Related Work
Demo System
Kandinsky Architecture
Experiments
Results
Conclusion & Limitations
Ethical Considerations, Acknowledgements and References 5 Experiments We sought to evaluate and refine the performance of our proposed latent diffusion architecture in our experimental analysis. To this end, we employed automatic metrics, specifically FID-CLIP curves on the COCO-30K dataset, to obtain the optimal guidance-scale value and compare Kandinsky with competitors (cf. Figure 4). Furthermore, we conducted investigations with various image prior setups, exploring the impact of different configurations on the performance. These setups included: no prior, utilizing text embeddings directly; linear prior, implementing one linear layer; ResNet prior, consisting of 18 residual MLP blocks; and transformer diffusion prior. An essential aspect of our experiments was the exploration of the effect of latent quantization within the MoVQ autoencoder. We examined the outputs with latent quantization, both enabled and disabled, to better comprehend its influence on image generation quality. To ensure a comprehensive evaluation, we also included an assessment of the IF model [12], which is the closest open-source competitor to our proposed model. For this purpose, we computed FID scores for the IF model [13] (Table 1). However, we acknowledged the limitations of automatic metrics that become obvious when it comes to capturing user experience nuances. Hence, in addition to the FID-CLIP curves, we conducted a blind human evaluation to obtain insightful feed-back and validate the quality of the generated images from the perspective of human perception based on the DrawBench dataset (Saharia et al., 2022b). The combination of automatic metrics and human evaluation provides a comprehensive assessment of Kandinsky performance, enabling us to make informed decisions about the effectiveness and usability of our proposed image prior to design. This paper is available on arxiv under CC BY 4.0 DEED license. [12] https://github.com/deep-floyd/IF [13] https://github.com/mseitzer/pytorch-fid Authors: (1) Anton Razzhigaev, AIRI and Skoltech; (2) Arseniy Shakhmatov, Sber AI; (3) Anastasia Maltseva, Sber AI; (4) Vladimir Arkhipkin, Sber AI; (5) Igor Pavlov, Sber AI; (6) Ilya Ryabov, Sber AI; (7) Angelina Kuts, Sber AI; (8) Alexander Panchenko, AIRI and Skoltech; (9) Andrey Kuznetsov, AIRI and Sber AI; (10) Denis Dimitrov, AIRI and Sber AI. Authors: Authors: (1) Anton Razzhigaev, AIRI and Skoltech; (2) Arseniy Shakhmatov, Sber AI; (3) Anastasia Maltseva, Sber AI; (4) Vladimir Arkhipkin, Sber AI; (5) Igor Pavlov, Sber AI; (6) Ilya Ryabov, Sber AI; (7) Angelina Kuts, Sber AI; (8) Alexander Panchenko, AIRI and Skoltech; (9) Andrey Kuznetsov, AIRI and Sber AI; (10) Denis Dimitrov, AIRI and Sber AI. Editor's Note: This is Part 5 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Editor's Note: This is Part 5 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Editor's Note: This is Part 5 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Table of Links Abstract and Introduction Related Work Demo System Kandinsky Architecture Experiments Results Conclusion & Limitations Ethical Considerations, Acknowledgements and References Abstract and Introduction Abstract and Introduction Related Work Related Work Demo System Demo System Kandinsky Architecture Kandinsky Architecture Experiments Experiments Results Results Conclusion & Limitations Conclusion & Limitations Ethical Considerations, Acknowledgements and References Ethical Considerations, Acknowledgements and References 5 Experiments We sought to evaluate and refine the performance of our proposed latent diffusion architecture in our experimental analysis. To this end, we employed automatic metrics, specifically FID-CLIP curves on the COCO-30K dataset, to obtain the optimal guidance-scale value and compare Kandinsky with competitors (cf. Figure 4). Furthermore, we conducted investigations with various image prior setups, exploring the impact of different configurations on the performance. These setups included: no prior, utilizing text embeddings directly; linear prior, implementing one linear layer; ResNet prior, consisting of 18 residual MLP blocks; and transformer diffusion prior. An essential aspect of our experiments was the exploration of the effect of latent quantization within the MoVQ autoencoder. We examined the outputs with latent quantization, both enabled and disabled, to better comprehend its influence on image generation quality. To ensure a comprehensive evaluation, we also included an assessment of the IF model [12], which is the closest open-source competitor to our proposed model. For this purpose, we computed FID scores for the IF model [13] (Table 1). However, we acknowledged the limitations of automatic metrics that become obvious when it comes to capturing user experience nuances. Hence, in addition to the FID-CLIP curves, we conducted a blind human evaluation to obtain insightful feed-back and validate the quality of the generated images from the perspective of human perception based on the DrawBench dataset (Saharia et al., 2022b). The combination of automatic metrics and human evaluation provides a comprehensive assessment of Kandinsky performance, enabling us to make informed decisions about the effectiveness and usability of our proposed image prior to design. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv [12] https://github.com/deep-floyd/IF [13] https://github.com/mseitzer/pytorch-fid

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Russian Scientists Create AI That Generates Images People Actually Love

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

12 Key Aspects for Assessing the Power of Text-to-Image Models

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps