Authors:
(1) Anton Razzhigaev, AIRI and Skoltech;
(2) Arseniy Shakhmatov, Sber AI;
(3) Anastasia Maltseva, Sber AI;
(4) Vladimir Arkhipkin, Sber AI;
(5) Igor Pavlov, Sber AI;
(6) Ilya Ryabov, Sber AI;
(7) Angelina Kuts, Sber AI;
(8) Alexander Panchenko, AIRI and Skoltech;
(9) Andrey Kuznetsov, AIRI and Sber AI;
(10) Denis Dimitrov, AIRI and Sber AI. Editor's Note: This is Part 4 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Table of Links Abstract and Introduction
Related Work
Demo System
Kandinsky Architecture
Experiments
Results
Conclusion & Limitations
Ethical Considerations, Acknowledgements and References 4 Kandinsky Architecture In our work, we opted to deliver state-of-the-art text-to-image synthesis. In the initial stages of our research, we experimented with multilingual text encoders, such as mT5 (Xue et al., 2021), XLMR (Conneau et al., 2020), XLMR-CLIP[7], to facilitate robust multilingual text-to-image generation. However, we discovered that using the CLIP-image embeddings instead of standalone text encoders resulted in improved image quality. As a result, we adopted an image prior approach, utilizing diffusion and linear mappings between text and image embedding spaces of CLIP, while keeping additional conditioning with XLMR text embeddings. That is why Kandinsky uses two text encoders: CLIP-text with image prior mapping and XLMR. We have set these encoders to be frozen during the training phase. The significant factor that influenced our design choice was the efficiency of training latent diffusion models, as compared to pixel-level diffusion models (Rombach et al., 2022). This led us to focus our efforts on the latent diffusion architecture. Our model essentially comprises three stages: text encoding, embedding mapping (image prior), and latent diffusion. The construction of our model involves three primary steps: text encoding, embedding mapping (image prior), and latent diffusion. At the embedding mapping step, which we also refer to as the image prior, we use the transformer-encoder model. This model was trained from scratch with a diffusion process on text and image embeddings provided by the CLIP-ViT-L14 model. A noteworthy feature in our training process is the use of elementwise normalization of visual embeddings. This normalization is based on full-dataset statistics and leads to faster convergence of the diffusion process. We implemented inverse normalization to revert to the original CLIP-image embedding space in the inference stage. The image prior model is trained on text and image embeddings, provided by the CLIP models. We conducted a series of experiments and ablation studies on the specific architecture design of the image prior model (Table 3, Figure 6). The model with the best human evaluation score is based on a 1D-diffusion and standard transformer-encoder with the following parameters: num_layers=20, num_heads=32, and hidden_size=2048. The latent diffusion part employs a UNet model along with a custom pre-trained autoencoder. Our diffusion model uses a combination of multiple condition signals: CLIP-image embeddings, CLIPtext embeddings, and XLMR-CLIP text embeddings. CLIP-image and XLMR-CLIP embeddings are merged and utilized as an input to the latent diffusion process. Also, we conditioned the diffusion process on these embeddings by adding all of them to the time-embedding. Notably, we did not skip the quantization step of the autoencoder during diffusion inference as it leads to an increase in the diversity and the quality of generated images (cf. Figure 4). In total, our model comprises 3.3 B parameters (Table 2). We observed that the image decoding was our main bottleneck in terms of generated image quality; hence, we developed a Sber-MoVQGAN, our custom implementation of MoVQGAN (Zheng et al., 2022) with minor modifications. We trained this autoencoder on the LAION HighRes dataset (Schuhmann et al., 2022), obtaining the SotA results in image reconstruction. We released the weights and code for these models under an open source licence[11]. The comparison of our autoencoder with competitors can be found in Table 4. This paper is available on arxiv under CC BY 4.0 DEED license. [7] https://github.com/FreddeFrallan/ Multilingual-CLIP [8] https://github.com/Stability-AI/ stablediffusion [9] https://github.com/ai-forever/ru-dalle [10] https://github.com/gligen/GLIGEN [11] https://github.com/ai-forever/MoVQGAN Authors: (1) Anton Razzhigaev, AIRI and Skoltech; (2) Arseniy Shakhmatov, Sber AI; (3) Anastasia Maltseva, Sber AI; (4) Vladimir Arkhipkin, Sber AI; (5) Igor Pavlov, Sber AI; (6) Ilya Ryabov, Sber AI; (7) Angelina Kuts, Sber AI; (8) Alexander Panchenko, AIRI and Skoltech; (9) Andrey Kuznetsov, AIRI and Sber AI; (10) Denis Dimitrov, AIRI and Sber AI. Authors: Authors: (1) Anton Razzhigaev, AIRI and Skoltech; (2) Arseniy Shakhmatov, Sber AI; (3) Anastasia Maltseva, Sber AI; (4) Vladimir Arkhipkin, Sber AI; (5) Igor Pavlov, Sber AI; (6) Ilya Ryabov, Sber AI; (7) Angelina Kuts, Sber AI; (8) Alexander Panchenko, AIRI and Skoltech; (9) Andrey Kuznetsov, AIRI and Sber AI; (10) Denis Dimitrov, AIRI and Sber AI. Editor's Note: This is Part 4 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Editor's Note: This is Part 4 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Editor's Note: This is Part 4 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below. Table of Links Abstract and Introduction Related Work Demo System Kandinsky Architecture Experiments Results Conclusion & Limitations Ethical Considerations, Acknowledgements and References Abstract and Introduction Abstract and Introduction Related Work Related Work Demo System Demo System Kandinsky Architecture Kandinsky Architecture Experiments Experiments Results Results Conclusion & Limitations Conclusion & Limitations Ethical Considerations, Acknowledgements and References Ethical Considerations, Acknowledgements and References 4 Kandinsky Architecture In our work, we opted to deliver state-of-the-art text-to-image synthesis. In the initial stages of our research, we experimented with multilingual text encoders, such as mT5 (Xue et al., 2021), XLMR (Conneau et al., 2020), XLMR-CLIP[7], to facilitate robust multilingual text-to-image generation. However, we discovered that using the CLIP-image embeddings instead of standalone text encoders resulted in improved image quality. As a result, we adopted an image prior approach, utilizing diffusion and linear mappings between text and image embedding spaces of CLIP, while keeping additional conditioning with XLMR text embeddings. That is why Kandinsky uses two text encoders: CLIP-text with image prior mapping and XLMR. We have set these encoders to be frozen during the training phase. The significant factor that influenced our design choice was the efficiency of training latent diffusion models, as compared to pixel-level diffusion models (Rombach et al., 2022). This led us to focus our efforts on the latent diffusion architecture. Our model essentially comprises three stages: text encoding, embedding mapping (image prior), and latent diffusion. The construction of our model involves three primary steps: text encoding, embedding mapping (image prior), and latent diffusion. At the embedding mapping step, which we also refer to as the image prior, we use the transformer-encoder model. This model was trained from scratch with a diffusion process on text and image embeddings provided by the CLIP-ViT-L14 model. A noteworthy feature in our training process is the use of elementwise normalization of visual embeddings. This normalization is based on full-dataset statistics and leads to faster convergence of the diffusion process. We implemented inverse normalization to revert to the original CLIP-image embedding space in the inference stage. The image prior model is trained on text and image embeddings, provided by the CLIP models. We conducted a series of experiments and ablation studies on the specific architecture design of the image prior model (Table 3, Figure 6). The model with the best human evaluation score is based on a 1D-diffusion and standard transformer-encoder with the following parameters: num_layers=20, num_heads=32, and hidden_size=2048. The latent diffusion part employs a UNet model along with a custom pre-trained autoencoder. Our diffusion model uses a combination of multiple condition signals: CLIP-image embeddings, CLIPtext embeddings, and XLMR-CLIP text embeddings. CLIP-image and XLMR-CLIP embeddings are merged and utilized as an input to the latent diffusion process. Also, we conditioned the diffusion process on these embeddings by adding all of them to the time-embedding. Notably, we did not skip the quantization step of the autoencoder during diffusion inference as it leads to an increase in the diversity and the quality of generated images (cf. Figure 4). In total, our model comprises 3.3 B parameters (Table 2). We observed that the image decoding was our main bottleneck in terms of generated image quality; hence, we developed a Sber-MoVQGAN, our custom implementation of MoVQGAN (Zheng et al., 2022) with minor modifications. We trained this autoencoder on the LAION HighRes dataset (Schuhmann et al., 2022), obtaining the SotA results in image reconstruction. We released the weights and code for these models under an open source licence[11]. The comparison of our autoencoder with competitors can be found in Table 4. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv [7] https://github.com/FreddeFrallan/ Multilingual-CLIP [8] https://github.com/Stability-AI/ stablediffusion [9] https://github.com/ai-forever/ru-dalle [10] https://github.com/gligen/GLIGEN [11] https://github.com/ai-forever/MoVQGAN

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Russian Scientists Say New AI Architecture Produces State-of-the-Art Text-to-Image Synthesis

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

12 Key Aspects for Assessing the Power of Text-to-Image Models

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps