Fish Audio’s S2-Pro Brings Emotion Tags to Text-to-Speech

Model overview

s2-pro is a state-of-the-art text-to-speech model from Fish Audio that combines fine-grained prosody control with production-ready performance. Trained on over 10 million hours of audio across 80+ languages, it uses a dual-autoregressive architecture with reinforcement learning alignment to deliver natural-sounding speech. Unlike earlier fish-speech-1.4 and s1-mini variants, this model introduces inline emotional and stylistic control through free-form natural language tags embedded directly in text.

Model inputs and outputs

The model accepts text with optional inline control tags and outputs high-fidelity audio. It handles multi-speaker scenarios and supports long-context inference with low-latency streaming capabilities. The system reconstructs audio using a 10-codebook RVQ-based audio codec operating at approximately 21 Hz frame rate.

Inputs

Text with optional control tags: Natural language instructions like [whisper in small voice], [professional broadcast tone], or [pitch up] embedded using bracket syntax
Speaker identity: Support for multiple speakers in generation
Language specification: One of 80+ supported languages

Outputs

Audio tokens: Acoustic tokens representing the synthesized speech at high frame rate
Full audio waveform: Reconstructed speech in standard audio format with fine-grained acoustic detail

Capabilities

The model excels at generating expressive speech with word-level control over emotion and tone. Over 15,000 unique control tags are supported, including markers for prosody variations like [pause], [emphasis], [laughing], [whisper], [shouting], [angry], and [sad]. The dual-autoregressive design keeps a 4-billion-parameter slow component handling semantic content while a 400-million-parameter fast component reconstructs acoustic details, enabling efficient inference without sacrificing audio quality. Multi-turn generation and streaming capabilities allow for interactive speech synthesis applications.

What can I use it for?

This model suits applications requiring expressive, controllable speech synthesis at scale. Content creators can produce audiobooks and podcasts with natural emotional inflection. Customer service platforms can generate personalized voice responses with appropriate tone and emotion. Game developers and entertainment studios can create character dialogue with fine-grained control over delivery style. Educational platforms can generate narration with dynamic emphasis and pacing. The production-ready streaming performance and support for 80+ languages make it practical for international applications where maintaining speaking style matters.

Things to try

Experiment with layering multiple control tags in a single sentence to create complex emotional arcs, such as combining [surprised] with [excited tone] at different points. Test the streaming capabilities by generating long-form content to explore how real-time audio output maintains coherence across extended passages. Try generating the same text with different speaker identities to understand how speaker characteristics interact with the emotional control tags. Explore the language tier system by comparing Tier 1 languages (Japanese, English, Chinese) with other supported languages to understand quality variations. Use multi-turn generation for dialogue scenarios where emotional consistency must persist across speaker transitions.

This is a simplified guide to an AI model called s2-pro maintained by fishaudio. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.