This 15B Model Generates Talking Videos With Synced Audio From Text

Model overview

daVinci-MagiHuman is a unified audio-video generation foundation model developed by GAIR that generates synchronized video and speech from text prompts. Unlike competing models that use separate processing streams, this model employs a single 15-billion parameter Transformer architecture that handles text, video, and audio jointly through self-attention mechanisms. The architecture delivers state-of-the-art results, achieving an 80% win rate against Ovi 1.1 and a 60.9% win rate against LTX 2.3 in human evaluations across over 2,000 comparisons.

Model inputs and outputs

daVinci-MagiHuman takes text descriptions and optional reference images as input, then generates synchronized video and audio outputs in a single unified process. The model produces high-quality video across multiple resolutions while maintaining natural speech-to-facial expression synchronization and realistic body movement throughout the generated content.

Inputs

Text prompts describing the desired video content and speech
Reference image (optional) to condition facial appearance and identity
Language specification supporting Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French

Outputs

Video at resolutions ranging from 256p to 1080p with realistic human performance capture
Audio with synchronized speech and natural prosody matching the video content
Combined output with precisely aligned audio-video synchronization

Capabilities

The model excels at human-centric video generation with particular strengths in facial expressiveness, natural speech-expression coordination, and realistic body motion. It generates a 5-second 256p video in 2 seconds and a 5-second 1080p video in 38 seconds on a single H100 GPU, making it significantly faster than competing solutions. The single-stream architecture eliminates cross-attention complexity while improving quality through unified denoising of video and audio within a shared token sequence. The model achieves a 14.60% word error rate in speech generation, superior to comparable systems.

What can I use it for?

Content creators can generate synchronized talking head videos from simple text descriptions without complex multi-step workflows. Marketing teams can produce localized video content in six different languages and cultural contexts. Entertainment and education professionals can create character-driven narratives with natural facial performance and synchronized speech. The model's speed enables rapid iteration on video concepts, while the open-source architecture allows developers to fine-tune the system for specific use cases like virtual presenters, educational videos, or narrative content generation.

Things to try

Experiment with detailed character descriptions to see how the model interprets specific personalities through facial expression and speech patterns. Test multilingual prompts to observe how the model maintains coherent expression across different language structures. Generate videos of the same character with varying emotional contexts to explore the relationship between text content and facial performance. Try using reference images of specific individuals to see how the identity conditioning affects the generated output while maintaining natural movement. Combine the base model with the distilled variant to understand the quality tradeoffs between faster generation and maximum visual fidelity.

This is a simplified guide to an AI model called daVinci-MagiHuman maintained by GAIR. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.