Authors: (1) Dustin Podell, Stability AI, Applied Research; (2) Zion English, Stability AI, Applied Research; (3) Kyle Lacey, Stability AI, Applied Research; (4) Andreas Blattmann, Stability AI, Applied Research; (5) Tim Dockhorn, Stability AI, Applied Research; (6) Jonas Müller, Stability AI, Applied Research; (7) Joe Penna, Stability AI, Applied Research; (8) Robin Rombach, Stability AI, Applied Research. Table of Links Abstract and 1 Introduction 2 Improving Stable Diffusion 2.1 Architecture & Scale 2.2 Micro-Conditioning 2.3 Multi-Aspect Training 2.4 Improved Autoencoder and 2.5 Putting Everything Together 3 Future Work Appendix A Acknowledgements B Limitations C Diffusion Models D Comparison to the State of the Art E Comparison to Midjourney v5.1 F On FID Assessment of Generative Text-Image Foundation Models G Additional Comparison between Single- and Two-Stage SDXL pipeline References 3 Future Work This report presents a preliminary analysis of improvements to the foundation model Stable Diffusion for text-to-image synthesis. While we achieve significant improvements in synthesized image quality, prompt adherence and composition, in the following, we discuss a few aspects for which we believe the model may be improved further: • Single stage: Currently, we generate the best samples from SDXL using a two-stage approach with an additional refinement model. This results in having to load two large models into memory, hampering accessibility and sampling speed. Future work should investigate ways to provide a single stage of equal or better quality. • Text synthesis: While the scale and the larger text encoder (OpenCLIP ViT-bigG [19]) help to improve the text rendering capabilities over previous versions of Stable Diffusion, incorporating byte-level tokenizers [52, 27] or simply scaling the model to larger sizes [53, 40] may further improve text synthesis. • Architecture: During the exploration stage of this work, we briefly experimented with transformer-based architectures such as UViT [16] and DiT [33], but found no immediate benefit. We remain, however, optimistic that a careful hyperparameter study will eventually enable scaling to much larger transformer-dominated architectures. • Distillation: While our improvements over the original Stable Diffusion model are significant, they come at the price of increased inference cost (both in VRAM and sampling speed). Future work will thus focus on decreasing the compute needed for inference, and increased sampling speed, for example through guidance- [29], knowledge- [6, 22, 24] and progressive distillation [41, 2, 29]. • Our model is trained in the discrete-time formulation of [14], and requires offset-noise [11, 25] for aesthetically pleasing results. The EDM-framework of Karras et al. [21] is a promising candidate for future model training, as its formulation in continuous time allows for increased sampling flexibility and does not require noise-schedule corrections. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Dustin Podell, Stability AI, Applied Research; (2) Zion English, Stability AI, Applied Research; (3) Kyle Lacey, Stability AI, Applied Research; (4) Andreas Blattmann, Stability AI, Applied Research; (5) Tim Dockhorn, Stability AI, Applied Research; (6) Jonas Müller, Stability AI, Applied Research; (7) Joe Penna, Stability AI, Applied Research; (8) Robin Rombach, Stability AI, Applied Research. Authors: Authors: (1) Dustin Podell, Stability AI, Applied Research; (2) Zion English, Stability AI, Applied Research; (3) Kyle Lacey, Stability AI, Applied Research; (4) Andreas Blattmann, Stability AI, Applied Research; (5) Tim Dockhorn, Stability AI, Applied Research; (6) Jonas Müller, Stability AI, Applied Research; (7) Joe Penna, Stability AI, Applied Research; (8) Robin Rombach, Stability AI, Applied Research. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Improving Stable Diffusion 2 Improving Stable Diffusion 2.1 Architecture & Scale 2.1 Architecture & Scale 2.2 Micro-Conditioning 2.2 Micro-Conditioning 2.3 Multi-Aspect Training 2.3 Multi-Aspect Training 2.4 Improved Autoencoder and 2.5 Putting Everything Together 2.4 Improved Autoencoder and 2.5 Putting Everything Together 3 Future Work 3 Future Work Appendix Appendix A Acknowledgements A Acknowledgements B Limitations B Limitations C Diffusion Models C Diffusion Models D Comparison to the State of the Art D Comparison to the State of the Art E Comparison to Midjourney v5.1 E Comparison to Midjourney v5.1 F On FID Assessment of Generative Text-Image Foundation Models F On FID Assessment of Generative Text-Image Foundation Models G Additional Comparison between Single- and Two-Stage SDXL pipeline G Additional Comparison between Single- and Two-Stage SDXL pipeline References References 3 Future Work This report presents a preliminary analysis of improvements to the foundation model Stable Diffusion for text-to-image synthesis. While we achieve significant improvements in synthesized image quality, prompt adherence and composition, in the following, we discuss a few aspects for which we believe the model may be improved further: • Single stage: Currently, we generate the best samples from SDXL using a two-stage approach with an additional refinement model. This results in having to load two large models into memory, hampering accessibility and sampling speed. Future work should investigate ways to provide a single stage of equal or better quality. • Text synthesis: While the scale and the larger text encoder (OpenCLIP ViT-bigG [19]) help to improve the text rendering capabilities over previous versions of Stable Diffusion, incorporating byte-level tokenizers [52, 27] or simply scaling the model to larger sizes [53, 40] may further improve text synthesis. • Architecture: During the exploration stage of this work, we briefly experimented with transformer-based architectures such as UViT [16] and DiT [33], but found no immediate benefit. We remain, however, optimistic that a careful hyperparameter study will eventually enable scaling to much larger transformer-dominated architectures. • Distillation: While our improvements over the original Stable Diffusion model are significant, they come at the price of increased inference cost (both in VRAM and sampling speed). Future work will thus focus on decreasing the compute needed for inference, and increased sampling speed, for example through guidance- [29], knowledge- [6, 22, 24] and progressive distillation [41, 2, 29]. • Our model is trained in the discrete-time formulation of [14], and requires offset-noise [11, 25] for aesthetically pleasing results. The EDM-framework of Karras et al. [21] is a promising candidate for future model training, as its formulation in continuous time allows for increased sampling flexibility and does not require noise-schedule corrections. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv