The Creator’s Shortcut to Sounded AI Video: LTX-2, Distilled and Deployable

This is a simplified guide to an AI model called LTX-2 maintained by Lightricks. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.

Model overview

LTX-2 is a DiT-based audio-video foundation model developed by Lightricks that generates synchronized video and audio within a single model. This represents a significant step forward in video generation technology, as it combines core building blocks of modern video synthesis with open weights and support for local execution. The model comes in multiple variants to suit different use cases, from the full 19 billion parameter model in various quantization formats to a distilled 8-step version for faster inference. If you're familiar with earlier work from the same team, LTX-Video and LTX-Video 0.9.7 Distilled demonstrated real-time video generation capabilities, while LTX-2 extends this foundation to include synchronized audio generation.

Model inputs and outputs

LTX-2 accepts text prompts and images as inputs to generate high-quality video content with accompanying audio. The model operates on resolution and frame specifications with particular requirements to ensure optimal output quality.

Inputs

Text prompts describing the desired video content
Images for image-to-video generation workflows
Resolution settings (width and height must be divisible by 32)
Frame count specifications (must be divisible by 8 plus 1)

Outputs

Video files with synchronized audio and visual content
Adjustable frame rates through temporal upscaling options
Variable resolutions through spatial upscaling capabilities

Capabilities

The model generates videos with realistic motion and synchronized audio in a single unified process. It supports both text-to-video and image-to-video workflows, meaning you can either describe what you want to create or provide a starting image and let the model continue the narrative. The base model is trainable in bf16 precision, allowing fine-tuning for specific styles, motion patterns, or audio characteristics. Training for custom motion, style, or likeness can take less than an hour in many settings. Multiple upscaler modules are available to increase spatial resolution by 2x or temporal resolution (frame rate) by 2x, enabling multi-stage generation pipelines for higher quality outputs.

What can I use it for?

LTX-2 is suitable for creators needing synchronized video and audio generation across various domains. Content creators can use it for social media videos, promotional materials, or short-form creative content. The LTX-Studio interface provides immediate access via web browsers for text-to-video and image-to-video workflows. Developers can integrate the model into applications through the PyTorch codebase or the Diffusers Python library. The trainable base model enables businesses to create custom versions for branded content, maintaining specific visual styles or audio signatures. ComfyUI integration through built-in LTXVideo nodes makes it accessible for creators using node-based workflows.

Things to try

Experiment with detailed prompts that specify camera angles, lighting conditions, and character appearances to improve video quality and consistency. The distilled model with 8 steps offers significant speed improvements when 1.0 CFG is acceptable for your use case, though it sacrifices some quality for generation speed. Test the LoRA training capabilities on the distilled model to customize audio generation for speech scenarios, as the model produces lower quality audio when speech is absent. Multi-stage generation approaches using spatial and temporal upscalers can produce higher resolution outputs than single-pass generation. Padding input dimensions to divisible values rather than working with non-standard resolutions will yield cleaner results without artifacts from dimension mismatches.