2,857 reads

Accelerating Diffusion Models with TheStage AI: A Case Study of Recraft's 20B and Red Panda models

by TheStage AINovember 26th, 2024

NSO

Too Long; Didn't Read

This article explores the acceleration of Recraft's text-to-image diffusion models using TheStage AI's Python tools. We examine the challenges of optimizing proprietary models and developing efficient pipelines for PyTorch inference optimization.

featured image - Accelerating Diffusion Models with TheStage AI:
A Case Study of Recraft's 20B and Red Panda models

Recraft AI is a design tool powered by proprietary diffusion models. Their new model Red Panda outperforming all existing text-to-image models including Midjourney, DALL-E 3, FLUX. Recraft combines a unique user experience for designers with cutting-edge AI tools. To support millions of users, diffusion models require robust inference infrastructure—merging powerful hardware with efficient software. In this article, we'll explore how TheStage AI acceleration tools helped Recraft’s AI engineers and researchers to achieve up to 2x performance on Nvidia GPUs through an intuitive Python interface!

Introduction

Diffusion models have shown extraordinary results in recent years for content generation, including images, music, videos, and 3D meshes. These models use inference time compute costs to iteratively improve generation results, slightly updating the output at each inference step. We can now see analogy in LLMs which uses reasoning through multistep inference to provide high quality answer.

In TheStage AI we are building general mathematical framework for arbitrary PyTorch models to handle complicated model acceleration flow fully automatically. Our system automatically detects existing optimizations on your hardware (qunatization, sparsification) and select for each layer proper algorithm to achieve the best quality with a desired model size and latency constraints or find the best accceleration with restricted quality constraints. Its a hard mathematical problem which we can solve in a highly efficient way! This article explores how we apply these tools through our partnership with Recraft AI.

When designing our tools, we decided to respect the following principles:

Hardware customization. High-quality AI products already have their preferred infrastructure
Quality preserving. High-quality AI products cannot accept quality degradation
Privacy. High-quality AI products want to keep their technologies confidential and work with tools on their own infrastructure
Arbitrary DNNs. High-quality AI products may use in-house developed architectures where public acceleration tools for open-source models can't handle complex DNN architectures to produce correct outputs.
Pytorch. The most popular and convenient framework for a lot of AI engineers.

Given these initial conditions, we aimed to create tools with the following features:

Controllable acceleration. We frame inference optimization as a business optimization problem, allowing customers to specify their desired model size, latency, or quality for their data.
Simple compilation. Compiling the produced models for efficient hardware execution requires just a single line of code. We also provide a simple interface to handle graph breaks.
Fast cold start. To achieve the fastest cold start possible, we enable saving of compiled models. This is why we don't use JIT compilers.
Simple deployment. Deploying the optimized model should be as straightforward as deploying the original one.

Text-to-Image Diffusion Models

In each iteration of the diffusion process, a neural network denoises the image in the latent space of a Variational AutoEncoder. The newly obtained image is then mixed with noise again, but with progressively less weight. During the initial iterations, the diffusion model sketches the main scene, leveraging the significant noise weight to make substantial upgrades. In later iterations, it refines high-frequency details. This observation allows us to design specific acceleration pipelines by strategically allocating network capacity across layers from iteration to iteration, preserving quality. However, such allocation requires specialized tools that combine mathematical insights with sound engineering — this is where TheStage AI can significantly help!

Diffusion Models Acceleration and Compression

Accelerating diffusion models can be viewed as accelerating arbitrary DNNs, but we need to account for specific challenges. For instance, static quantization, which typically provides significant acceleration, introduces a challenge in diffusion models as activation distributions change from iteration to iteration. To address this, we either need to properly estimate optimal values for all iterations or use different quantization setups for each iteration.

Diffusion models are challenging to train and achieve high performance. Nevertheless, results demonstrated by the Recraft team outperforming all modern text-to-image models. Validating the degradation of such models can be difficult, making it crucial to use acceleration techniques that preserve the original feature semantics. Quantization algorithms can be a good choice if they can handle the challenge of varying activation distributions. Let’s take a look on our automatic pipelines, which we will describe in the following sections.

Profiling

Profile a given model with a specific data allows to:

Determine the size of each parameter
Identify applicable quantization, sparsification, pruning algorithms for each basic block
Estimate latency for individual blocks with different memory layouts
Compile all collected information for ANNA (Automated NNs Accelerator)

Automatic Compression & Acceleration

After the profiler has collected all necessary data, we can start our ANNA board and move the slider to produce different optimized model versions. Our users can then select the best candidates based on the quality vs. inference cost trade-off. Our tools handle these subjective quality decisions in a simple way.

Operations Fusion and Compilation

As mentioned earlier, we don't use JIT compilation because it increases the cold start time of a new node. We also don't use off-the-shelf compilers. Instead, we compile our own complex accelerated configuration that can mix different algorithms. To achieve this, we've developed our own internal protocol to define DNN-accelerated layers in a hardware-agnostic way. One of the key benefits of TheStage AI acceleration framework is that a single checkpoint can be compiled for a wide range of hardware, solving cross-platform compatibility for AI software. This feature will be particularly important for edge device deployment in application development.

The goals of DNN compilers are to:

Graph Simplification. Simplify the execution graph through mathematical operation fusion, reducing inference time
Memory Management. Calculate the memory required for each operation and manage allocation scheduling with efficient memory reuse
Optimal Implementation. Profile the optimal implementation for each basic operation—a challenging task, as the best implementation may require specific memory layouts, leading to analysis of interlayer connections
Operations Scheduling. Create an operations schedule for the optimized execution graph
Serialization. Save all this information to avoid recompiling the model in subsequent runs

Deployment and Serving

Inference servers and auto-scaling pipelines play an important role in cost-effective and efficient processing of incoming requests. It can also include specific request grouping and statistics collection to set up predictive scaling for auto-scalers. In our future articles, we will discuss efficient inference servers in more detail!

Results

Applying all pipeline we can achieve performance which is better than Pytorch compiler (torch.compile) and of course significantly better than float16 eager PyTorch execution. Moreover as PyTorch compiler uses JIT compilation approach on each model initialization it requires recompilation for a lot input sizes what makes cold start long enough for practical applications where latency is highly important.

Business Benefits

Here are the key business benefits of resulted acceleration for Recraft’s product:

Lower infrastructure costs by serving twice as many users with the same hardware
Improved user experience with faster image generation response times
Ability to serve more concurrent users during peak loads
Competitive advantage through faster service delivery

TheStage AI optimisation tools allow us to speed up our text-to-image models without quality degradation, creating a better user experience for our customers.

CEO Recraft, Anna Veronika Dorogush

Acknowledgements

These results provide excellent validation of our tools and research on high-scale workload products. TheStage AI team continues to work toward delivering even greater performance. To achieve this, we're collaborating with outstanding partners! We are deeply grateful to:

Recraft CEO Anna Veronika for the fruitful cooperation. We're thrilled to be even a small part of their great journey in delivering the best design tools.
Recraft Head of AI Pavel Ostyakov for his expertise in DNNs, strong feedback on tools, and for setting challenging goals for our cooperation project.
The Recraft AI team for building this great product. Images in this article were generated with Recraft!
The Nebius team for their consistent support with excellent GPU infrastructure for our research.

Contacts / Resources

Feel free to connect with us regarding any questions! We can help you to reduce inference infrastructure costs!

Our email: [email protected]

TheStage AI main page: thestage.ai

TheStage AI inference optimization platform: app.thestage.ai

L O A D I N G
. . . comments & more!

About Author

TheStage AI@thestage

A full-stack AI platform 👽 Trusted voice in AI, we grindin', no sleep ✨

Read my stories

TOPICS

tech-stories #diffusion-models #machine-learning #quantization #model-acceleration #redpanda #pytorch #compiler-optimizations

THIS ARTICLE WAS FEATURED IN...

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas