Russian Scientists Create AI That Generates Images People Actually Love

by Auto Encoder: How to Ignore the Signal NoiseDecember 18th, 2024

Too Long; Didn't Read

Researchers have developed a text-to-image generation model called Kandinsky that uses a novel latent diffusion model to produce images that appear natural.

featured image - Russian Scientists Create AI That Generates Images People Actually Love

Authors:

(1) Anton Razzhigaev, AIRI and Skoltech;

(2) Arseniy Shakhmatov, Sber AI;

(3) Anastasia Maltseva, Sber AI;

(4) Vladimir Arkhipkin, Sber AI;

(5) Igor Pavlov, Sber AI;

(6) Ilya Ryabov, Sber AI;

(7) Angelina Kuts, Sber AI;

(8) Alexander Panchenko, AIRI and Skoltech;

(9) Andrey Kuznetsov, AIRI and Sber AI;

(10) Denis Dimitrov, AIRI and Sber AI.

Editor's Note: This is Part 5 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below.

Table of Links

5 Experiments

We sought to evaluate and refine the performance of our proposed latent diffusion architecture in our experimental analysis. To this end, we employed automatic metrics, specifically FID-CLIP curves on the COCO-30K dataset, to obtain the optimal guidance-scale value and compare Kandinsky with competitors (cf. Figure 4). Furthermore, we conducted investigations with various image prior setups, exploring the impact of different configurations on the performance. These setups included: no prior, utilizing text embeddings directly; linear prior, implementing one linear layer; ResNet prior, consisting of 18 residual MLP blocks; and transformer diffusion prior.

An essential aspect of our experiments was the exploration of the effect of latent quantization within the MoVQ autoencoder. We examined the outputs with latent quantization, both enabled and disabled, to better comprehend its influence on image generation quality.

To ensure a comprehensive evaluation, we also included an assessment of the IF model [12], which is the closest open-source competitor to our proposed model. For this purpose, we computed FID scores for the IF model [13] (Table 1).

However, we acknowledged the limitations of automatic metrics that become obvious when it comes to capturing user experience nuances. Hence, in addition to the FID-CLIP curves, we conducted a blind human evaluation to obtain insightful feed-back and validate the quality of the generated images from the perspective of human perception based on the DrawBench dataset (Saharia et al., 2022b).

The combination of automatic metrics and human evaluation provides a comprehensive assessment of Kandinsky performance, enabling us to make informed decisions about the effectiveness and usability of our proposed image prior to design.

This paper is available on arxiv under CC BY 4.0 DEED license.

[12] https://github.com/deep-floyd/IF

[13] https://github.com/mseitzer/pytorch-fid

L O A D I N G
. . . comments & more!

About Author

Auto Encoder: How to Ignore the Signal Noise@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

Read my stories Learn More

TOPICS

machine-learning #artificial-intelligence #text-to-image-generation #computer-vision #generative-architectures #diffusion-based-models #kandinsky-ai-model #latent-diffusion-architecture #stable-diffusion-alternatives

THIS ARTICLE WAS FEATURED IN...

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Russian Scientists Create AI That Generates Images People Actually Love

Too Long; Didn't Read

Table of Links

5 Experiments

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES