paint-brush
Russian Scientists Create AI That Generates Images People Actually Loveby@autoencoder

Russian Scientists Create AI That Generates Images People Actually Love

tldt arrow

Too Long; Didn't Read

Researchers have developed a text-to-image generation model called Kandinsky that uses a novel latent diffusion model to produce images that appear natural.
featured image - Russian Scientists Create AI That Generates Images People Actually Love
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Anton Razzhigaev, AIRI and Skoltech;

(2) Arseniy Shakhmatov, Sber AI;

(3) Anastasia Maltseva, Sber AI;

(4) Vladimir Arkhipkin, Sber AI;

(5) Igor Pavlov, Sber AI;

(6) Ilya Ryabov, Sber AI;

(7) Angelina Kuts, Sber AI;

(8) Alexander Panchenko, AIRI and Skoltech;

(9) Andrey Kuznetsov, AIRI and Sber AI;

(10) Denis Dimitrov, AIRI and Sber AI.

Editor's Note: This is Part 5 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below.

5 Experiments

We sought to evaluate and refine the performance of our proposed latent diffusion architecture in our experimental analysis. To this end, we employed automatic metrics, specifically FID-CLIP curves on the COCO-30K dataset, to obtain the optimal guidance-scale value and compare Kandinsky with competitors (cf. Figure 4). Furthermore, we conducted investigations with various image prior setups, exploring the impact of different configurations on the performance. These setups included: no prior, utilizing text embeddings directly; linear prior, implementing one linear layer; ResNet prior, consisting of 18 residual MLP blocks; and transformer diffusion prior.


An essential aspect of our experiments was the exploration of the effect of latent quantization within the MoVQ autoencoder. We examined the outputs with latent quantization, both enabled and disabled, to better comprehend its influence on image generation quality.


To ensure a comprehensive evaluation, we also included an assessment of the IF model [12], which is the closest open-source competitor to our proposed model. For this purpose, we computed FID scores for the IF model [13] (Table 1).


However, we acknowledged the limitations of automatic metrics that become obvious when it comes to capturing user experience nuances. Hence, in addition to the FID-CLIP curves, we conducted a blind human evaluation to obtain insightful feed-back and validate the quality of the generated images from the perspective of human perception based on the DrawBench dataset (Saharia et al., 2022b).


Figure 4: CLIP-FID curves for different setups.


Figure 5: Image generation results with prompt "astronaut riding a horse" for original image prior and linear prior trained on 500 pairs of images with cats.


The combination of automatic metrics and human evaluation provides a comprehensive assessment of Kandinsky performance, enabling us to make informed decisions about the effectiveness and usability of our proposed image prior to design.


This paper is available on arxiv under CC BY 4.0 DEED license.


[12] https://github.com/deep-floyd/IF


[13] https://github.com/mseitzer/pytorch-fid