paint-brush
Russian Scientists Develop AI That Outsmarts Top Image Models With a Surprisingly Simple Trickby@autoencoder

Russian Scientists Develop AI That Outsmarts Top Image Models With a Surprisingly Simple Trick

tldt arrow

Too Long; Didn't Read

Researchers have developed a text-to-image generation model called Kandinsky that uses a novel latent diffusion model to produce images that appear natural.
featured image - Russian Scientists Develop AI That Outsmarts Top Image Models With a Surprisingly Simple Trick
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Anton Razzhigaev, AIRI and Skoltech;

(2) Arseniy Shakhmatov, Sber AI;

(3) Anastasia Maltseva, Sber AI;

(4) Vladimir Arkhipkin, Sber AI;

(5) Igor Pavlov, Sber AI;

(6) Ilya Ryabov, Sber AI;

(7) Angelina Kuts, Sber AI;

(8) Alexander Panchenko, AIRI and Skoltech;

(9) Andrey Kuznetsov, AIRI and Sber AI;

(10) Denis Dimitrov, AIRI and Sber AI.

Editor's Note: This is Part 6 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below.

6 Results

Our experiments and evaluations have showcased the capabilities of Kandinsky architecture in text-toimage synthesis. Kandinsky achieved the FID score of 8.03 on the COCO-30K validation set at a resolution of 256×256, which puts it in close competition with the state-of-the-art models, and among the top performers within open-source systems. Our methodical ablation studies further dissected the performance of different configurations: quantization of latent codes in MoVQ slightly improves


Figure 6: Human evaluation: competitors vs Kandinsky with diffusion prior on Drawbench. The total count of votes is 5000.


Table 4: Sber-MoVQGAN comparison with competitors on ImageNet dataset.


the quality of images (FID 9.86 vs 9.87). The best CLIP score and human-eval score are obtained by diffusion prior.


The best FID score is achieved using Linear Prior. This configuration stands out with the best FID score of 8.03. It is an intriguing outcome: the simplest linear mapping showcased the best FID, suggesting that there might exist a linear relationship between visual and textual embedding vector spaces. To further scrutinize this hypothesis, we trained a linear mapping on a subset of 500 cat images and termed it the "cat prior". Astonishingly, this mapping displayed high proficiency (cf. Figure 5).


This paper is available on arxiv under CC BY 4.0 DEED license.