Recraft AI is a design tool powered by proprietary diffusion models. Their new model Red Panda outperforming all existing text-to-image models including Midjourney, DALL-E 3, FLUX. Recraft combines a unique user experience for designers with cutting-edge AI tools. To support millions of users, diffusion models require robust inference infrastructure—merging powerful hardware with efficient software. In this article, we'll explore how TheStage AI acceleration tools helped Recraft’s AI engineers and researchers to achieve up to 2x performance on Nvidia GPUs through an intuitive Python interface!
Diffusion models have shown extraordinary results in recent years for content generation, including images, music, videos, and 3D meshes. These models use inference time compute costs to iteratively improve generation results, slightly updating the output at each inference step. We can now see analogy in LLMs which uses reasoning through multistep inference to provide high quality answer.
In TheStage AI we are building general mathematical framework for arbitrary PyTorch models to handle complicated model acceleration flow fully automatically. Our system automatically detects existing optimizations on your hardware (qunatization, sparsification) and select for each layer proper algorithm to achieve the best quality with a desired model size and latency constraints or find the best accceleration with restricted quality constraints. Its a hard mathematical problem which we can solve in a highly efficient way! This article explores how we apply these tools through our partnership with Recraft AI.
When designing our tools, we decided to respect the following principles:
Given these initial conditions, we aimed to create tools with the following features:
In each iteration of the diffusion process, a neural network denoises the image in the latent space of a Variational AutoEncoder. The newly obtained image is then mixed with noise again, but with progressively less weight. During the initial iterations, the diffusion model sketches the main scene, leveraging the significant noise weight to make substantial upgrades. In later iterations, it refines high-frequency details. This observation allows us to design specific acceleration pipelines by strategically allocating network capacity across layers from iteration to iteration, preserving quality. However, such allocation requires specialized tools that combine mathematical insights with sound engineering — this is where TheStage AI can significantly help!
Accelerating diffusion models can be viewed as accelerating arbitrary DNNs, but we need to account for specific challenges. For instance, static quantization, which typically provides significant acceleration, introduces a challenge in diffusion models as activation distributions change from iteration to iteration. To address this, we either need to properly estimate optimal values for all iterations or use different quantization setups for each iteration.
Diffusion models are challenging to train and achieve high performance. Nevertheless, results demonstrated by the Recraft team outperforming all modern text-to-image models. Validating the degradation of such models can be difficult, making it crucial to use acceleration techniques that preserve the original feature semantics. Quantization algorithms can be a good choice if they can handle the challenge of varying activation distributions. Let’s take a look on our automatic pipelines, which we will describe in the following sections.
Profile a given model with a specific data allows to:
After the profiler has collected all necessary data, we can start our ANNA board and move the slider to produce different optimized model versions. Our users can then select the best candidates based on the quality vs. inference cost trade-off. Our tools handle these subjective quality decisions in a simple way.
As mentioned earlier, we don't use JIT compilation because it increases the cold start time of a new node. We also don't use off-the-shelf compilers. Instead, we compile our own complex accelerated configuration that can mix different algorithms. To achieve this, we've developed our own internal protocol to define DNN-accelerated layers in a hardware-agnostic way. One of the key benefits of TheStage AI acceleration framework is that a single checkpoint can be compiled for a wide range of hardware, solving cross-platform compatibility for AI software. This feature will be particularly important for edge device deployment in application development.
The goals of DNN compilers are to:
Inference servers and auto-scaling pipelines play an important role in cost-effective and efficient processing of incoming requests. It can also include specific request grouping and statistics collection to set up predictive scaling for auto-scalers. In our future articles, we will discuss efficient inference servers in more detail!
Applying all pipeline we can achieve performance which is better than Pytorch compiler (torch.compile
) and of course significantly better than float16 eager PyTorch execution. Moreover as PyTorch compiler uses JIT compilation approach on each model initialization it requires recompilation for a lot input sizes what makes cold start long enough for practical applications where latency is highly important.
Here are the key business benefits of resulted acceleration for Recraft’s product:
TheStage AI optimisation tools allow us to speed up our text-to-image models without quality degradation, creating a better user experience for our customers.
CEO Recraft, Anna Veronika Dorogush
These results provide excellent validation of our tools and research on high-scale workload products. TheStage AI team continues to work toward delivering even greater performance. To achieve this, we're collaborating with outstanding partners! We are deeply grateful to:
Feel free to connect with us regarding any questions! We can help you to reduce inference infrastructure costs!
Our email: [email protected]
TheStage AI main page: thestage.ai
TheStage AI inference optimization platform: app.thestage.ai