This tutorial shows you how to build a complete voice agent that can have natural conversations with users. You'll create an application that listens to speech, processes it with AI, and responds back with voice—handling the full conversation loop in real-time.
You'll use AssemblyAI's streaming speech-to-text API for real-time transcription, OpenAI's GPT-4 for intelligent response generation, and Python's text-to-speech libraries for voice output. The implementation uses WebSocket connections to handle streaming audio and asynchronous processing to manage conversation turns without delays. By the end, you'll have a working voice agent that can maintain context across multiple exchanges and respond naturally to spoken questions.
What is a voice agent?
A voice agent is software that talks to users through speech instead of text. This means you speak to it, and it speaks back—just like talking to a human assistant.
Voice agents work differently from chatbots or old phone systems. They understand what you're saying, think about it, and give you a smart answer—all through voice.
|
Feature |
Voice Agents |
Chatbots |
Traditional IVR |
|---|---|---|---|
|
Input Method |
Voice/Audio |
Text only |
Touch-tone or limited voice |
|
Processing Speed |
Real-time streaming |
Instant |
Menu-based delays |
|
Interaction Style |
Natural conversation |
Text conversation |
Rigid menu trees |
|
Context Awareness |
Full conversation memory |
Message history |
No context |
|
Response Generation |
Dynamic AI responses |
AI or scripted |
Pre-recorded messages |
Every voice agent needs three parts working together: speech-to-text (turns your voice into words), a language model (decides what to say back), and text-to-speech (turns the response into voice).
Voice agent architecture with three core components
Your voice goes through three steps to become an AI response. First, speech-to-text converts your audio into text that computers can read. Next, a language model reads that text and writes a response. Finally, text-to-speech turns that response back into audio you can hear.
Streaming makes it feel natural: Instead of waiting for you to finish talking completely, the system processes your voice in small chunks every few milliseconds.
WebSocket connections handle the flow: These persistent connections let audio flow back and forth without delays from reconnecting.
Real-time processing reduces waiting: Each step happens simultaneously rather than one after another, so you get responses in under a second.
Prerequisites for building a voice agent
You'll need API keys and software before you start coding. Each service handles one part of the voice agent pipeline.
|
Service |
Purpose |
Where to Get |
|---|---|---|
|
AssemblyAI API |
Real-time speech-to-text |
Sign up at AssemblyAI dashboard for free credits |
|
OpenAI API |
LLM for response generation |
Create account at OpenAI platform |
|
Python 3.8+ |
Runtime environment |
Download from python.org |
|
pyaudio |
Audio capture from microphone |
Install via pip |
|
websockets |
Real-time communication |
Install via pip |
Start by creating your project folder:
mkdir voice-agent
cd voice-agent
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install the required packages:
pip install websockets pyaudio openai python-dotenv aiohttp
Create a .env file for your API keys:
ASSEMBLYAI_API_KEY=your_assemblyai_key_here
OPENAI_API_KEY=your_openai_key_here
AssemblyAI gives you free credits when you sign up. You'll find your API key in the dashboard after creating an account.
Build a voice agent with Python and AssemblyAI
Building your voice agent happens in four steps. Each step adds a new capability until you have a complete talking AI.
Set up real-time speech-to-text with AssemblyAI
Speech recognition is the foundation of your voice agent. AssemblyAI's streaming API turns your voice into text as you speak through WebSocket connections.
Create voice_agent.py and add the basic structure with WebSocket connection code to AssemblyAI's Universal-3 Pro Streaming endpoint.
This code creates a WebSocket connection that sends your microphone audio to AssemblyAI and receives transcriptions back.
Partial transcripts: Show incomplete words as you're still speaking.
Final transcripts: Complete sentences with proper punctuation and capitalization.
AssemblyAI automatically formats final transcripts with punctuation and proper capitalization. This clean text works better with language models than raw, unformatted speech.
Integrate an LLM for response generation
Now you'll add OpenAI's GPT-4 to create intelligent responses. The language model reads the transcribed text and generates appropriate answers.
The conversation history keeps track of what you've talked about. This lets the AI remember context from earlier in your conversation.
Short response limits: Voice conversations work better with brief answers than long explanations.
Streaming responses: The AI starts generating text while you're still talking, reducing wait time.
Error handling: If something goes wrong, the agent gives you a helpful error message instead of crashing.
Add text-to-speech output
Text-to-speech turns the AI's written response into spoken words. You'll use pyttsx3 because it works locally without additional API keys.
The speaking flag prevents the agent from listening to itself talk. Threading keeps the voice output from blocking other processes.
Speaking rate adjustment: 175 words per minute sounds natural for most users.
Volume control: Set to 90% to avoid audio distortion on most systems.
Background processing: The TTS runs separately so the agent can keep listening for your next question.
Handle conversation turns with voice activity detection
The trickiest part is knowing when you've finished talking and expect an answer. AssemblyAI's v3 streaming API handles this with built-in turn detection.
This prevents feedback loops where the agent transcribes its own voice. The built-in turn detection ensures responses when you finish speaking, configured by the min_turn_silence and max_turn_silence parameters.
Test your voice agent
Run your voice agent with:
python voice_agent.py
You should see "Voice agent ready. Start speaking..." and your microphone will activate.
Try these test phrases:
- "Hello, can you hear me?" - Tests basic transcription and response
- "What's the weather like?" - Tests LLM integration
- "Count to five slowly" - Tests TTS timing
Common problems and fixes:
Microphone not working: Check your system permissions. macOS requires microphone access in System Preferences. Windows may show a permission dialog.
Audio feedback or echo: The agent is hearing itself talk. Use headphones or increase the silence detection sensitivity.
Connection errors: Usually means wrong API keys. Double-check your .env file and make sure your AssemblyAI account is active.
AssemblyAI's real-time dashboard shows your live transcription sessions. You can see accuracy metrics, latency measurements, and debug any connection issues there.
Aim for under 2 seconds from when you stop talking to when the agent starts responding. This timing feels natural in conversation.
Next steps with voice agents
Once your basic agent works, you can add more advanced features to improve the user experience.
Speaker diarization: For streaming use cases, enable diarization with speaker_labels=true, which adds a speaker_label field such as A or B to each Turn event. This helps in meetings or customer service scenarios where you need to track different speakers.
Sentiment analysis: Detects if someone sounds happy, frustrated, or neutral. Your agent can adjust its tone to match—speaking more empathetically to upset users.
Keyterms prompting: Improves recognition of specialized terms. If your agent handles medical topics or technical topics, you can teach AssemblyAI industry-specific words.
Deployment options:
- Cloud hosting on AWS Lambda or Google Cloud Functions for automatic scaling
- Edge devices like Raspberry Pi for local processing without internet
- Web browsers using WebRTC for voice agents that run in any browser
Final words
Building voice agents involves connecting streaming speech-to-text from AssemblyAI with language models and text-to-speech synthesis. The Python implementation covered here provides a working foundation you can expand with additional features and deploy to production environments.
AssemblyAI's streaming transcription delivers the accuracy and low latency that voice agents need to feel natural. The service handles diverse accents and noisy environments while providing the real-time processing essential for conversational AI. AssemblyAI also offers dedicated engineering support and additional Voice AI models for features like sentiment analysis and speaker identification that can enhance your voice agent's capabilities.
Frequently asked questions
What programming languages can I use to build voice agents?
Python works best for voice agent development because of its extensive AI libraries and WebSocket support, though you can also use JavaScript with Node.js, Java, or any language that supports WebSocket connections and HTTP APIs.
How does AssemblyAI's streaming API differ from batch transcription for voice agents?
Streaming transcription processes audio in real-time as you speak, sending partial results immediately, while batch processing waits for complete audio files before starting transcription, making streaming essential for conversational voice agents.
What causes audio feedback in voice agents and how do I prevent it?
Audio feedback happens when the agent's microphone picks up its own speech output, causing it to transcribe its own voice, which you can prevent by pausing audio capture during text-to-speech playback or using headphones during development.
Can I run a voice agent locally without cloud APIs?
You can run speech-to-text and text-to-speech locally using open-source models like Whisper and pyttsx3, though many teams report high accuracy and low-latency performance from cloud APIs like AssemblyAI, supported by internal testing data or real-time metrics.
