Improving Stable Diffusion Architecture

Authors:

(1) Dustin Podell, Stability AI, Applied Research;

(2) Zion English, Stability AI, Applied Research;

(3) Kyle Lacey, Stability AI, Applied Research;

(4) Andreas Blattmann, Stability AI, Applied Research;

(5) Tim Dockhorn, Stability AI, Applied Research;

(6) Jonas Müller, Stability AI, Applied Research;

(7) Joe Penna, Stability AI, Applied Research;

(8) Robin Rombach, Stability AI, Applied Research.

Table of Links

Abstract and 1 Introduction

2 Improving Stable Diffusion

2.1 Architecture & Scale

2.2 Micro-Conditioning

2.3 Multi-Aspect Training

2.4 Improved Autoencoder and 2.5 Putting Everything Together

Appendix

D Comparison to the State of the Art

E Comparison to Midjourney v5.1

F On FID Assessment of Generative Text-Image Foundation Models

G Additional Comparison between Single- and Two-Stage SDXL pipeline

References

2.1 Architecture & Scale

Starting with the seminal works Ho et al. [14] and Song et al. [47], which demonstrated that DMs are powerful generative models for image synthesis, the convolutional UNet [39] architecture has been the dominant architecture for diffusion-based image synthesis. However, with the development

of foundational DMs [40, 37, 38], the underlying architecture has constantly evolved: from adding self-attention and improved upscaling layers [5], over cross-attention for text-to-image synthesis [38], to pure transformer-based architectures [33].

We follow this trend and, following Hoogeboom et al. [16], shift the bulk of the transformer computation to lower-level features in the UNet. In particular, and in contrast to the original Stable Diffusion architecture, we use a heterogeneous distribution of transformer blocks within the UNet: For efficiency reasons, we omit the transformer block at the highest feature level, use 2 and 10 blocks at the lower levels, and remove the lowest level (8× downsampling) in the UNet altogether — see Tab. 1 for a comparison between the architectures of Stable Diffusion 1.x & 2.x and SDXL. We opt for a more powerful pre-trained text encoder that we use for text conditioning. Specifically, we use OpenCLIP ViT-bigG [19] in combination with CLIP ViT-L [34], where we concatenate the penultimate text encoder outputs along the channel-axis [1]. Besides using cross-attention layers to condition the model on the text-input, we follow [30] and additionally condition the model on the pooled text embedding from the OpenCLIP model. These changes result in a model size of 2.6B parameters in the UNet, see Tab. 1. The text encoders have a total size of 817M parameters.

This paper is available on arxiv under CC BY 4.0 DEED license.

Improving Stable Diffusion Architecture

Too Long; Didn't Read

Table of Links

2.1 Architecture & Scale

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Improving Stable Diffusion Architecture

Too Long; Didn't Read

Table of Links

2.1 Architecture & Scale

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES