The Case Against Text Prompts for AI Sound Generation

The problem with words in video-to-audio generation

When you watch a video of a dog barking, you know immediately whether it's a tiny Chihuahua or a massive German Shepherd, even with your eyes closed. The acoustic information is unmistakable. Yet current AI systems that generate audio from videos struggle with these distinctions, trapped by a fundamental limitation: they rely on text descriptions to specify what sounds they should create.

The bottleneck is deceptively simple. Text is an imprecise tool for describing acoustic properties. Try telling someone over the phone what a Chihuahua's bark sounds like, then have them describe a large dog's bark. Two people might interpret those same words completely differently. One might imagine a high-pitched yelp, another a stressed corgi's bark. Now scale this to machine learning: existing video-to-audio systems use text prompts as the primary control mechanism, but training data introduces semantic granularity gaps. A dataset might label dozens of acoustically distinct sounds under a single coarse category like "dog bark," losing the subtle differences that distinguish a Chihuahua from a St. Bernard.

The ceiling on audio quality and fine-grained control that text alone can reach is remarkably low. How do you verbalize timbre in words? How do you describe the specific texture of a footstep on gravel? What does "crispy" mean for a sound? Text collapses acoustic space into discrete categories, but acoustic reality is continuous and subtle. This gap between what text can express and what audio needs to capture creates a hard limit on synthesis precision that no amount of prompt engineering can overcome.

Why reference audio solves what text cannot

Here's where the insight clicks into place: if someone handed you a 2-second audio clip of a Chihuahua bark and asked you to make it match a video of a dog, you could do it perfectly without ever using the word "Chihuahua." The audio clip is the specification. It contains all the acoustic information needed. The system doesn't need to interpret "high-pitched" or "small dog." It has the ground truth.

This reframes the problem entirely. Instead of trying to make text descriptions more precise, AC-Foley asks a different question: what if we condition directly on reference audio? The answer is that reference audio contains exactly what you want to transfer: timbre, pitch characteristics, texture, temporal dynamics. There's no semantic gap between specification and reality, because the reference is what you want.

This shift enables something powerful. With audio conditioning, the model can learn acoustic transfer rather than just sound classification. You can use any reference audio without needing it to exist in the training data or even relate to the video semantically. Pair a dog video with reference audio of wind chimes and the model generates a wind-chime-like dog bark. The acoustic character transfers while the semantic content from the video remains intact. This is zero-shot generation: use any reference audio and the model synthesizes in that acoustic style, even for sounds never seen during training.

AC-Foley for conditional Foley generation with audio controls. The top row shows fine-grained sound synthesis where the same dog video produces different sounds based on reference audio: a Chihuahua's bark versus a large dog's bark. The bottom section demonstrates timbre transfer and other audio-guided capabilities.

AC-Foley generates precise audio from silent video based on reference sounds, enabling fine-grained synthesis and timbre transfer

How AC-Foley actually works

The architecture is elegant in its logic. Think of the model as a multimodal translator watching three information streams simultaneously. The video tells the system when and what kind of activity is happening. The reference audio tells it how it should sound. Text descriptions provide a general semantic anchor. These three streams flow into a transformer network that learns to blend them, extracting the acoustic character from the reference while respecting the visual timing and content from the video.

The video provides temporal structure and semantic context. Reference audio provides acoustic properties. The key insight is that these can be embedded into the same latent space, giving the model a concrete target to match. The transformer learns to extract and transfer acoustic characteristics from the reference while aligning to video semantics.

Overview of the AC-Foley method showing how different modalities interact. Video, text, and audio inputs flow through a multimodal transformer network that jointly processes them for more precise conditioning.

Different modalities jointly interact in the multimodal transformer network for precise control

Learning to transfer, not memorize

A natural worry emerges: how does the model avoid simply memorizing "when you see this video, output this reference audio"? The training process prevents this through clever design.

Instead of always giving the model the full target audio as reference, the researchers randomly provide only 2 seconds of the target audio during training. This forces the model to learn the principle of acoustic transfer. It's shown a snippet of the target audio's character and must extrapolate that character across the entire video-audio pair. The model learns to generalize from partial information rather than memorize specific input-output pairs.

The two-stage training structure makes this concrete. In stage one, overlapping conditioning uses random 2-second windows of the target audio as reference. The model must infer acoustic principles from incomplete information. This design is specifically engineered to learn transferable acoustic properties rather than rote memorization, which is what enables zero-shot capability. A model that has learned to transfer acoustic character can apply it to any reference audio, not just those it memorized during training.

Illustration of the two-stage training process showing overlapping conditioning windows. Random 2-second segments of target audio are used during training, forcing the model to learn acoustic transfer principles.

Overlapping conditioning windows force the model to learn generalization rather than memorization

What becomes possible with audio conditioning

Text-guided video-to-audio systems are constrained to sounds that match the video content reasonably well. Audio conditioning opens a completely different possibility space.

Fine-grained sound synthesis becomes straightforward: the same dog video can generate a Chihuahua's high-pitched bark or a German Shepherd's low growl depending on the reference audio provided. Timbre transfer works by applying the acoustic character of one sound to a different action. A footstep video conditioned on wind chime reference audio generates a wind-chime-like footstep. Zero-shot generation means you can hand the model any reference audio and it synthesizes in that acoustic style, even for reference sounds never encountered during training.

These capabilities emerge naturally from audio conditioning without requiring special training procedures. The model simply applies what it learned about acoustic transfer to new reference audio.

Qualitative examples showing the same videos paired with three distinct conditional audio inputs, demonstrating the model's ability to generate different sounds from identical visual input.

The same video produces different sounds when paired with different reference audio

Measuring performance

Claims about breakthrough performance need empirical grounding. The researchers evaluated AC-Foley using both automatic metrics and human judgment.

Automatic metrics measured semantic alignment, asking whether the generated audio matched the video content. Human studies had listeners rate whether generated audio matched the video and assess overall quality. The results are clear: AC-Foley achieves state-of-the-art performance when conditioned on reference audio. It also remains competitive with existing video-to-audio methods even when audio conditioning is removed and the system falls back to text-only guidance. This dual performance demonstrates that audio conditioning genuinely helps, while the architecture remains robust without it.

Ablation studies isolated the contribution of different design choices, confirming that the multimodal architecture and training strategy both contribute meaningfully to the results.

Screenshot of the user study survey interface used to evaluate audio quality and semantic alignment.

User studies systematically evaluated the quality and alignment of generated audio

What this means for audio generation

AC-Foley represents a conceptual shift in how we think about controlling generated audio. Rather than trying to make text descriptions more precise, the research shows that directly conditioning on reference audio is more natural and more powerful. The core principle, "use audio to specify audio," is simple but opens new possibilities.

This approach connects to existing work on audio-guided synthesis. Related research like DreamFoley and Foley Flow explored different approaches to video-to-audio generation, but AC-Foley's direct audio conditioning represents a distinct shift in paradigm. The principle could extend beyond video-to-audio to music generation with reference styles, voice cloning with specific acoustic characteristics, and sound design workflows where precise control matters.

The implications for practical applications are significant: film post-production, game audio design, and immersive audio for VR/AR all depend on fine-grained acoustic control. AC-Foley provides a more direct path to achieving it than text-based systems. The approach suggests that whenever you need precise control over generated audio, conditioning on reference examples outperforms describing properties in language.

Open questions remain. Can this scale to longer sequences and higher-fidelity audio? How does it perform across more diverse sound domains? But the fundamental insight holds: when you need to specify how something should sound, showing the model is more effective than telling it.

This is a Plain English Papers summary of a research paper called AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.