Authors:
(1) Tony Lee, Stanford with Equal contribution;
(2) Michihiro Yasunaga, Stanford with Equal contribution;
(3) Chenlin Meng, Stanford with Equal contribution;
(4) Yifan Mai, Stanford;
(5) Joon Sung Park, Stanford;
(6) Agrim Gupta, Stanford;
(7) Yunzhi Zhang, Stanford;
(8) Deepak Narayanan, Microsoft;
(9) Hannah Benita Teufel, Aleph Alpha;
(10) Marco Bellagente, Aleph Alpha;
(11) Minguk Kang, POSTECH;
(12) Taesung Park, Adobe;
(13) Jure Leskovec, Stanford;
(14) Jun-Yan Zhu, CMU;
(15) Li Fei-Fei, Stanford;
(16) Jiajun Wu, Stanford;
(17) Stefano Ermon, Stanford;
(18) Percy Liang, Stanford. Table of Links Abstract and 1 Introduction 2 Core framework 3 Aspects 4 Scenarios 5 Metrics 6 Models 7 Experiments and results 8 Related work 9 Conclusion 10 Limitations Author contributions, Acknowledgments and References A Datasheet B Scenario details C Metric details D Model details E Human evaluation procedure D Model details Stable Diffusion {v1-4, v1-5, v2-base, v2-1}. Stable Diffusion (v1-4, v1-5, v2-base, v2-1) is a family of 1B-parameter text-to-image models based on latent diffusion [4] trained on LAION [40], a large-scale paired text-image dataset. Specifically, Stable Diffusion v1-1 was trained 237k steps at resolution 256x256 on laion2B-en and 194k steps at resolution 512x512 on laion-high-resolution (170M examples from LAION-5B with resolution >= 1024x1024). Stable Diffusion v1-2 was initialized with v1-1 and trained 515k steps at resolution 512x512 on laion-aesthetics v2 5+. Stable Diffusion v1-4 is initialized with v1-2 and trained 225k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling. Similarly, Stable Diffusion v1-5 is initialized with v1-2 and trained 595k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning. Stable Diffusion v2-base is trained from scratch 550k steps at resolution 256x256 on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier with punsafe = 0.1 and an aesthetic score >= 4.5. Then it is further trained for 850k steps at resolution 512x512 on the same dataset on images with resolution >= 512x512. Stable Diffusion v2-1 is resumed from Stable diffusion v2-base and finetuned using a v-objective [82] on a filtered subset of the LAION dataset. Lexica Search (Stable Diffusion v1-5). Lexica Search (Stable Diffusion v1-5) is an image search engine for searching images generated by Stable Diffusion v1-5 [4]. DALL-E 2. DALL-E 2 [3] is a 3.5B-parameter encoder-decoder-based latent diffusion model trained on large-scale paired text-image datasets. The model is available via the OpenAI API. Dreamlike Diffusion 1.0. Dreamlike Diffusion 1.0 [45] is a Stable Diffusion v1-5 model fine-tuned on high-quality art images. Dreamlike Photoreal 2.0. Dreamlike Photoreal 2.0 [46] is a photorealistic model fine-tuned from Stable Diffusion 1.5. While the original Stable Diffusion generates resolutions of 512×512 by default, Dreamlike Photoreal 2.0 generates 768×768 by default. Openjourney {v1, v4}. Openjourney [47] is a Stable Diffusion model fine-tuned on Midjourney images. Openjourney v4 [48] was further fine-tuned using +124000 images, 12400 steps, 4 epochs +32 training hours. Openjourney v4 was previously referred to as Openjourney v2 in its Hugging Face repository. Redshift Diffusion. Redshift Diffusion [49] is a Stable Diffusion model fine-tuned on high-resolution 3D artworks. Vintedois (22h) Diffusion. Vintedois (22h) Diffusion [50] is a Stable Diffusion v1-5 model finetuned on a large number of high-quality images with simple prompts to generate beautiful images without a lot of prompt engineering. SafeStableDiffusion-{Weak, Medium, Strong, Max}. Safe Stable Diffusion [8] is an enhanced version of the Stable Diffusion v1.5 model. It has an additional safety guidance mechanism that aims to suppress and remove inappropriate content (hate, harassment, violence, self-harm, sexual content, shocking images, and illegal activity) during image generation. The strength levels for inappropriate content removal are categorized as: {Weak, Medium, Strong, Max}. Promptist + Stable Diffusion v1-4. Promptist [28] is a prompt engineering model, initialized by a 1.5 billion parameter GPT-2 model [83], specifically designed to refine user input into prompts that are favored by image generation models. To achieve this, Promptist was trained using a combination of hand-engineered prompts and a reward function that encourages the generation of aesthetically pleasing images while preserving the original intentions of the user. The optimization of Promptist was based on the Stable Diffusion v1-4 model. DALL-E {mini, mega}. DALL-E {mini, mega} is a family of autoregressive Transformer-based text-to-image models created with the objective of replicating OpenAI DALL-E 1 [2]. The mini and mega variants have 0.4B and 2.6B parameters, respectively. minDALL-E. minDALL-E [53], named after minGPT, is a 1.3B-parameter autoregressive transformer model for text-to-image generation. It was trained using 14 million image-text pairs. CogView2. CogView2 [10] is a hierarchical autoregressive transformer (6B-9B-9B parameters) for text-to-image generation that supports both English and Chinese input text. MultiFusion. MultiFusion (13B) [54] is a multimodal, multilingual diffusion model that extends the capabilities of Stable Diffusion v1.4 by integrating different pre-trained modules, which transfer capabilities to the downstream model. This combination results in novel decoder embeddings, which enable prompting of the image generation model with interleaved multimodal, multilingual inputs, despite being trained solely on monomodal data in a single language. DeepFloyd-IF { M, L, XL } v1.0. DeepFloyd-IF [55] is a pixel-based text-to-image triple-cascaded diffusion model with state-of-the-art photorealism and language understanding. Each cascaded diffusion module is designed to generate images of increasing resolution: 64×64, 256×256, and 1024×1024. All stages utilize a frozen T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention-pooling. The model is available in three different sizes: M, L, and XL. M has 0.4B parameters, L has 0.9B parameters, and XL has 4.3B parameters. GigaGAN. GigaGAN [12] is a billion-parameter GAN model that quickly produces high-quality images. The model was trained on text and image pairs from LAION2B-en [40] and COYO-700M [84]. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Authors: Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Core framework 2 Core framework 3 Aspects 3 Aspects 4 Scenarios 4 Scenarios 5 Metrics 5 Metrics 6 Models 6 Models 7 Experiments and results 7 Experiments and results 8 Related work 8 Related work 9 Conclusion 9 Conclusion 10 Limitations 10 Limitations Author contributions, Acknowledgments and References Author contributions, Acknowledgments and References A Datasheet A Datasheet B Scenario details B Scenario details C Metric details C Metric details D Model details D Model details E Human evaluation procedure E Human evaluation procedure D Model details Stable Diffusion {v1-4, v1-5, v2-base, v2-1}. Stable Diffusion (v1-4, v1-5, v2-base, v2-1) is a family of 1B-parameter text-to-image models based on latent diffusion [4] trained on LAION [40], a large-scale paired text-image dataset. Stable Diffusion {v1-4, v1-5, v2-base, v2-1}. Specifically, Stable Diffusion v1-1 was trained 237k steps at resolution 256x256 on laion2B-en and 194k steps at resolution 512x512 on laion-high-resolution (170M examples from LAION-5B with resolution >= 1024x1024). Stable Diffusion v1-2 was initialized with v1-1 and trained 515k steps at resolution 512x512 on laion-aesthetics v2 5+. Stable Diffusion v1-4 is initialized with v1-2 and trained 225k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling. Similarly, Stable Diffusion v1-5 is initialized with v1-2 and trained 595k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning. Stable Diffusion v1-4 Stable Diffusion v1-5 Stable Diffusion v2-base is trained from scratch 550k steps at resolution 256x256 on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier with punsafe = 0.1 and an aesthetic score >= 4.5. Then it is further trained for 850k steps at resolution 512x512 on the same dataset on images with resolution >= 512x512. Stable Diffusion v2-1 is resumed from Stable diffusion v2-base and finetuned using a v-objective [82] on a filtered subset of the LAION dataset. Stable Diffusion v2-base Stable Diffusion v2-1 Lexica Search (Stable Diffusion v1-5). Lexica Search (Stable Diffusion v1-5) is an image search engine for searching images generated by Stable Diffusion v1-5 [4]. Lexica Search (Stable Diffusion v1-5). DALL-E 2. DALL-E 2 [3] is a 3.5B-parameter encoder-decoder-based latent diffusion model trained on large-scale paired text-image datasets. The model is available via the OpenAI API. DALL-E 2. Dreamlike Diffusion 1.0. Dreamlike Diffusion 1.0 [45] is a Stable Diffusion v1-5 model fine-tuned on high-quality art images. Dreamlike Diffusion 1.0. Dreamlike Photoreal 2.0. Dreamlike Photoreal 2.0 [46] is a photorealistic model fine-tuned from Stable Diffusion 1.5. While the original Stable Diffusion generates resolutions of 512×512 by default, Dreamlike Photoreal 2.0 generates 768×768 by default. Dreamlike Photoreal 2.0. Openjourney {v1, v4}. Openjourney [47] is a Stable Diffusion model fine-tuned on Midjourney images. Openjourney v4 [48] was further fine-tuned using +124000 images, 12400 steps, 4 epochs +32 training hours. Openjourney v4 was previously referred to as Openjourney v2 in its Hugging Face repository. Openjourney {v1, v4}. Redshift Diffusion. Redshift Diffusion [49] is a Stable Diffusion model fine-tuned on high-resolution 3D artworks. Redshift Diffusion. Vintedois (22h) Diffusion . Vintedois (22h) Diffusion [50] is a Stable Diffusion v1-5 model finetuned on a large number of high-quality images with simple prompts to generate beautiful images without a lot of prompt engineering. Vintedois (22h) Diffusion SafeStableDiffusion-{Weak, Medium, Strong, Max}. Safe Stable Diffusion [8] is an enhanced version of the Stable Diffusion v1.5 model. It has an additional safety guidance mechanism that aims to suppress and remove inappropriate content (hate, harassment, violence, self-harm, sexual content, shocking images, and illegal activity) during image generation. The strength levels for inappropriate content removal are categorized as: {Weak, Medium, Strong, Max}. SafeStableDiffusion-{Weak, Medium, Strong, Max}. Promptist + Stable Diffusion v1-4. Promptist [28] is a prompt engineering model, initialized by a 1.5 billion parameter GPT-2 model [83], specifically designed to refine user input into prompts that are favored by image generation models. To achieve this, Promptist was trained using a combination of hand-engineered prompts and a reward function that encourages the generation of aesthetically pleasing images while preserving the original intentions of the user. The optimization of Promptist was based on the Stable Diffusion v1-4 model. Promptist + Stable Diffusion v1-4. DALL-E {mini, mega}. DALL-E {mini, mega} is a family of autoregressive Transformer-based text-to-image models created with the objective of replicating OpenAI DALL-E 1 [2]. The mini and mega variants have 0.4B and 2.6B parameters, respectively. DALL-E {mini, mega}. minDALL-E. minDALL-E [53], named after minGPT, is a 1.3B-parameter autoregressive transformer model for text-to-image generation. It was trained using 14 million image-text pairs. minDALL-E. CogView2 . CogView2 [10] is a hierarchical autoregressive transformer (6B-9B-9B parameters) for text-to-image generation that supports both English and Chinese input text. CogView2 MultiFusion . MultiFusion (13B) [54] is a multimodal, multilingual diffusion model that extends the capabilities of Stable Diffusion v1.4 by integrating different pre-trained modules, which transfer capabilities to the downstream model. This combination results in novel decoder embeddings, which enable prompting of the image generation model with interleaved multimodal, multilingual inputs, despite being trained solely on monomodal data in a single language. MultiFusion DeepFloyd-IF { M, L, XL } v1.0. DeepFloyd-IF [55] is a pixel-based text-to-image triple-cascaded diffusion model with state-of-the-art photorealism and language understanding. Each cascaded diffusion module is designed to generate images of increasing resolution: 64×64, 256×256, and 1024×1024. All stages utilize a frozen T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention-pooling. The model is available in three different sizes: M, L, and XL. M has 0.4B parameters, L has 0.9B parameters, and XL has 4.3B parameters. DeepFloyd-IF { M, L, XL } v1.0. GigaGAN. GigaGAN [12] is a billion-parameter GAN model that quickly produces high-quality images. The model was trained on text and image pairs from LAION2B-en [40] and COYO-700M [84]. GigaGAN. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

A Deep Dive Into Stable Diffusion and Other Leading Text-to-Image Models

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

Pay attention to that man behind the curtain

AI Is Inherently Neutral - It Is Human Beings Who Are Biased, and the Machines Merely Replicate Them

AI Is Not the Concern - It’s AI Developers You Should Be Worried About

Are Smart Cities a Threat to Data Privacy?

Can AI Ever Overcome Built-In Human Biases?

12 Key Aspects for Assessing the Power of Text-to-Image Models

Pay attention to that man behind the curtain

AI Is Inherently Neutral - It Is Human Beings Who Are Biased, and the Machines Merely Replicate Them

AI Is Not the Concern - It’s AI Developers You Should Be Worried About

Are Smart Cities a Threat to Data Privacy?

Can AI Ever Overcome Built-In Human Biases?

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps