172 reads

A Deep Dive Into Stable Diffusion and Other Leading Text-to-Image Models

by Auto Encoder: How to Ignore the Signal NoiseOctober 13th, 2024

Too Long; Didn't Read

This article covers a range of advanced text-to-image models, focusing on Stable Diffusion versions, DALL-E, and other models fine-tuned for specific purposes, such as Dreamlike and Openjourney, as well as new models like GigaGAN and DeepFloyd-IF.

featured image - A Deep Dive Into Stable Diffusion and Other Leading Text-to-Image Models

Authors:

(1) Tony Lee, Stanford with Equal contribution;

(2) Michihiro Yasunaga, Stanford with Equal contribution;

(3) Chenlin Meng, Stanford with Equal contribution;

(4) Yifan Mai, Stanford;

(5) Joon Sung Park, Stanford;

(6) Agrim Gupta, Stanford;

(7) Yunzhi Zhang, Stanford;

(8) Deepak Narayanan, Microsoft;

(9) Hannah Benita Teufel, Aleph Alpha;

(10) Marco Bellagente, Aleph Alpha;

(11) Minguk Kang, POSTECH;

(12) Taesung Park, Adobe;

(13) Jure Leskovec, Stanford;

(14) Jun-Yan Zhu, CMU;

(15) Li Fei-Fei, Stanford;

(16) Jiajun Wu, Stanford;

(17) Stefano Ermon, Stanford;

(18) Percy Liang, Stanford.

Table of Links

Abstract and 1 Introduction

7 Experiments and results

8 Related work

9 Conclusion

10 Limitations

Author contributions, Acknowledgments and References

E Human evaluation procedure

D Model details

Stable Diffusion {v1-4, v1-5, v2-base, v2-1}. Stable Diffusion (v1-4, v1-5, v2-base, v2-1) is a family of 1B-parameter text-to-image models based on latent diffusion [4] trained on LAION [40], a large-scale paired text-image dataset.

Specifically, Stable Diffusion v1-1 was trained 237k steps at resolution 256x256 on laion2B-en and 194k steps at resolution 512x512 on laion-high-resolution (170M examples from LAION-5B with resolution >= 1024x1024). Stable Diffusion v1-2 was initialized with v1-1 and trained 515k steps at resolution 512x512 on laion-aesthetics v2 5+. Stable Diffusion v1-4 is initialized with v1-2 and trained 225k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling. Similarly, Stable Diffusion v1-5 is initialized with v1-2 and trained 595k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning.

Stable Diffusion v2-base is trained from scratch 550k steps at resolution 256x256 on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier with punsafe = 0.1 and an aesthetic score >= 4.5. Then it is further trained for 850k steps at resolution 512x512 on the same dataset on images with resolution >= 512x512. Stable Diffusion v2-1 is resumed from Stable diffusion v2-base and finetuned using a v-objective [82] on a filtered subset of the LAION dataset.

Lexica Search (Stable Diffusion v1-5). Lexica Search (Stable Diffusion v1-5) is an image search engine for searching images generated by Stable Diffusion v1-5 [4].

DALL-E 2. DALL-E 2 [3] is a 3.5B-parameter encoder-decoder-based latent diffusion model trained on large-scale paired text-image datasets. The model is available via the OpenAI API.

Dreamlike Diffusion 1.0. Dreamlike Diffusion 1.0 [45] is a Stable Diffusion v1-5 model fine-tuned on high-quality art images.

Dreamlike Photoreal 2.0. Dreamlike Photoreal 2.0 [46] is a photorealistic model fine-tuned from Stable Diffusion 1.5. While the original Stable Diffusion generates resolutions of 512×512 by default, Dreamlike Photoreal 2.0 generates 768×768 by default.

Openjourney {v1, v4}. Openjourney [47] is a Stable Diffusion model fine-tuned on Midjourney images. Openjourney v4 [48] was further fine-tuned using +124000 images, 12400 steps, 4 epochs +32 training hours. Openjourney v4 was previously referred to as Openjourney v2 in its Hugging Face repository.

Redshift Diffusion. Redshift Diffusion [49] is a Stable Diffusion model fine-tuned on high-resolution 3D artworks.

Vintedois (22h) Diffusion. Vintedois (22h) Diffusion [50] is a Stable Diffusion v1-5 model finetuned on a large number of high-quality images with simple prompts to generate beautiful images without a lot of prompt engineering.

SafeStableDiffusion-{Weak, Medium, Strong, Max}. Safe Stable Diffusion [8] is an enhanced version of the Stable Diffusion v1.5 model. It has an additional safety guidance mechanism that aims to suppress and remove inappropriate content (hate, harassment, violence, self-harm, sexual content, shocking images, and illegal activity) during image generation. The strength levels for inappropriate content removal are categorized as: {Weak, Medium, Strong, Max}.

Promptist + Stable Diffusion v1-4. Promptist [28] is a prompt engineering model, initialized by a 1.5 billion parameter GPT-2 model [83], specifically designed to refine user input into prompts that are favored by image generation models. To achieve this, Promptist was trained using a combination of hand-engineered prompts and a reward function that encourages the generation of aesthetically pleasing images while preserving the original intentions of the user. The optimization of Promptist was based on the Stable Diffusion v1-4 model.

DALL-E {mini, mega}. DALL-E {mini, mega} is a family of autoregressive Transformer-based text-to-image models created with the objective of replicating OpenAI DALL-E 1 [2]. The mini and mega variants have 0.4B and 2.6B parameters, respectively.

minDALL-E. minDALL-E [53], named after minGPT, is a 1.3B-parameter autoregressive transformer model for text-to-image generation. It was trained using 14 million image-text pairs.

CogView2. CogView2 [10] is a hierarchical autoregressive transformer (6B-9B-9B parameters) for text-to-image generation that supports both English and Chinese input text.

MultiFusion. MultiFusion (13B) [54] is a multimodal, multilingual diffusion model that extends the capabilities of Stable Diffusion v1.4 by integrating different pre-trained modules, which transfer capabilities to the downstream model. This combination results in novel decoder embeddings, which enable prompting of the image generation model with interleaved multimodal, multilingual inputs, despite being trained solely on monomodal data in a single language.

DeepFloyd-IF { M, L, XL } v1.0. DeepFloyd-IF [55] is a pixel-based text-to-image triple-cascaded diffusion model with state-of-the-art photorealism and language understanding. Each cascaded diffusion module is designed to generate images of increasing resolution: 64×64, 256×256, and 1024×1024. All stages utilize a frozen T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention-pooling. The model is available in three different sizes: M, L, and XL. M has 0.4B parameters, L has 0.9B parameters, and XL has 4.3B parameters.

GigaGAN. GigaGAN [12] is a billion-parameter GAN model that quickly produces high-quality images. The model was trained on text and image pairs from LAION2B-en [40] and COYO-700M [84].