How I Cut AI Video Generation Costs by 8x

A few months ago, I was working on a pet project. YumCut is an end-to-end service for creating short vertical videos, from writing the text and generating images to editing and adding subtitles.

A critical problem surfaced quickly: cost. One minute of video required around twenty generated images, or about $0.80 per minute. Beyond visuals, audio generation added another $0.20 per minute, along with minor additional costs for editing and subtitle generation.

I started looking for a way out. This article explores the unconventional techniques that helped reduce those costs several times over, as well as an open-source solution that makes it possible to generate images up to eight times cheaper than commercial APIs.

Full code and instructions are available on GitHub.

First approach: multiple scenes in one frame

The logical solution seemed obvious: generate several images in a single request by placing scenes next to each other. In theory, this should reduce costs proportionally to the number of images.

The first attempt was to put all eight scenes into the prompt at once. The result was disastrous: the model simply mixed all elements into one blurry composition, unusable for video editing.

By reducing the number of scenes to two per image, I got an acceptable result. That already cut the cost in half, but it was still far from the target.

Key insight: borders must be literally visible

It turned out that AI models struggle to determine logical boundaries between separate areas. The solution was simple: use colored zones (red and blue).

Instead of an abstract description, I started sending a PNG template with clear borders and a matching instruction: “The first idea is in the red area, the second idea is in the blue area. Fill each area completely.”

This technique did not require additional costs, but it dramatically improved the quality of scene separation. The model now understood the structure and rarely mixed elements.

However, this was only a partial solution. I needed a drastic reduction. I had to cut the price by another order of magnitude.

Second approach: migrating to open-source alternatives

The idea was to find open-source/open-weight image-generation models, run them in the cloud, and reduce costs that way.

First, I had to identify which models were freely available. There are many: Qwen-Image, FLUX, HunyuanImage, Stable Diffusion, and others. For my use case, I had one extra requirement - the ability to reuse characters across many generated images. That is why I chose Qwen-Image-Edit.

I audited the market of commercial generators:

Major APIs (OpenAI/Google Gemini/Stability AI): similar prices or higher
Alibaba cloud services: about $0.04 per image - roughly the same
Self-hosted options like RunPod: you need a large number of images per run to reach meaningful savings

The picture was disappointing - even the creators of Qwen-Image, Alibaba, were offering the model at inflated prices. But then I found runware.ai and together.ai, where generating images with Qwen-Image-Edit and Qwen-Image was almost eight times cheaper than Nano Banana - ~$0.005 vs. $0.04.

Third approach: improving image detail

As it turned out, with low image-generation prices, the model started producing more uniform images - all scenes looked too similar to each other.

Here is an example of generated images with different prompts about a happy cat on the beach:

Even though the images look good, there is a missing layer for improving the story prompts. The obvious solution is to add a layer between the story prompt and the image generator. But we have to do so with an LLM that doesn't increase the cost significantly. So I knew I needed to test many LLMs before settling on one that worked best for me.

To do so, I used openrouter.ai. Once you write the wrapper code, you can switch to any available model. After testing a dozen models, I settled on openai/gpt-oss-120b with low reasoning effort. The improved image-generation prompt costs about ~$0.0003, and the images above turn into results like these.

The images became more diverse, even with an almost identical prompt. GPT OSS improved the description for Qwen-Image, and it also offered a lever that let me control the style and mood of the images.

Results: numbers that speak for themselves

Metric	Before	After	Savings
Cost per image	$0.04	$0.0053	7.54
1-minute video (20 images)	$0.8	$0.106	7.54
Full minute of video	$1	$0.306	3.26

By combining the two approaches, the price drops by another factor of two, but sometimes you will see artifacts in the images because the original image is being split.

This price was good enough for me, so I stopped at this result.