paint-brush
Improving Stable Diffusion Architectureby@synthesizing

Improving Stable Diffusion Architecture

by SynthesizingOctober 3rd, 2024
Read on Terminal Reader
tldt arrow

Too Long; Didn't Read

This section details the architectural improvements in SDXL, emphasizing a heterogeneous distribution of transformer blocks in the UNet and the use of advanced pre-trained text encoders. The resulting model features a total of 2.6 billion parameters in the UNet and 817 million in the text encoders.
featured image - Improving Stable Diffusion Architecture
Synthesizing HackerNoon profile picture

Authors:

(1) Dustin Podell, Stability AI, Applied Research;

(2) Zion English, Stability AI, Applied Research;

(3) Kyle Lacey, Stability AI, Applied Research;

(4) Andreas Blattmann, Stability AI, Applied Research;

(5) Tim Dockhorn, Stability AI, Applied Research;

(6) Jonas Müller, Stability AI, Applied Research;

(7) Joe Penna, Stability AI, Applied Research;

(8) Robin Rombach, Stability AI, Applied Research.

Abstract and 1 Introduction

2 Improving Stable Diffusion

2.1 Architecture & Scale

2.2 Micro-Conditioning

2.3 Multi-Aspect Training

2.4 Improved Autoencoder and 2.5 Putting Everything Together

3 Future Work


Appendix

A Acknowledgements

B Limitations

C Diffusion Models

D Comparison to the State of the Art

E Comparison to Midjourney v5.1

F On FID Assessment of Generative Text-Image Foundation Models

G Additional Comparison between Single- and Two-Stage SDXL pipeline

References

2.1 Architecture & Scale

Starting with the seminal works Ho et al. [14] and Song et al. [47], which demonstrated that DMs are powerful generative models for image synthesis, the convolutional UNet [39] architecture has been the dominant architecture for diffusion-based image synthesis. However, with the development


Table 1: Comparison of SDXL and older Stable Diffusion models.


of foundational DMs [40, 37, 38], the underlying architecture has constantly evolved: from adding self-attention and improved upscaling layers [5], over cross-attention for text-to-image synthesis [38], to pure transformer-based architectures [33].


We follow this trend and, following Hoogeboom et al. [16], shift the bulk of the transformer computation to lower-level features in the UNet. In particular, and in contrast to the original Stable Diffusion architecture, we use a heterogeneous distribution of transformer blocks within the UNet: For efficiency reasons, we omit the transformer block at the highest feature level, use 2 and 10 blocks at the lower levels, and remove the lowest level (8× downsampling) in the UNet altogether — see Tab. 1 for a comparison between the architectures of Stable Diffusion 1.x & 2.x and SDXL. We opt for a more powerful pre-trained text encoder that we use for text conditioning. Specifically, we use OpenCLIP ViT-bigG [19] in combination with CLIP ViT-L [34], where we concatenate the penultimate text encoder outputs along the channel-axis [1]. Besides using cross-attention layers to condition the model on the text-input, we follow [30] and additionally condition the model on the pooled text embedding from the OpenCLIP model. These changes result in a model size of 2.6B parameters in the UNet, see Tab. 1. The text encoders have a total size of 817M parameters.


This paper is available on arxiv under CC BY 4.0 DEED license.