Voice Agents
Voice Agents
Voice agents are AI systems that conduct real-time spoken conversations. Unlike text agents that communicate through typed interfaces, a voice agent listens, reasons, and speaks, operating in a modality with fundamentally different constraints. Latency matters in milliseconds, not seconds. Interruption is a natural and expected behaviour, not an error. The user has no ability to scroll back or re-read.
The pipeline is not simply a text agent with microphone input grafted on. Speech-to-text, language model inference, and text-to-speech each introduce latency that compounds. End-to-end response times must stay inside a perceptual budget of roughly 600ms for conversation to feel natural. Every architectural decision (framework choice, tool call design, filler strategy) is ultimately a latency trade-off.
This page covers the modality taxonomy, the V2V pipeline, the major platforms, how tool calling works in a voice context, and the constraints that differentiate voice from text agent design.
Modality Types
Not every voice-adjacent system is a full V2V agent. These are the distinct pipeline types, what each one covers, and the use cases each serves.
Voice-to-Voice
Full conversational pipeline. The user speaks, the agent understands and reasons, the agent speaks back. End-to-end latency must stay inside ~600ms for natural feel. The most complex modality; every stage of the pipeline must be optimised.
Uses: Conversational assistants, interactive agents, real-time support
Text-to-Speech
Text input, audio output. The system generates content as text and the TTS engine renders it as speech. No listening component. Simpler pipeline with fewer latency constraints; audio can often be pre-generated or cached.
Uses: Notifications, read-aloud, narration, accessibility features
Speech-to-Text
Audio input, text output. Captures spoken audio and transcribes it to text for downstream processing. No generative component. Output can feed analytics, search, text agents, or structured extraction pipelines.
Uses: Transcription, dictation, compliance recording, call analytics
Voice-Triggered
User speaks to trigger; agent responds in text or executes an action. The voice interface is the input only; output is text, UI action, or side effect. Common in wake-word systems and voice command interfaces.
Uses: Wake-word systems, voice commands, smart device control
Hybrid / Multimodal
Agent handles both voice and text on the same underlying session. Routes behaviour based on channel type while maintaining shared context. More complex state management; the agent must adapt response style per modality.
Uses: Omnichannel assistants, escalation paths, accessibility-aware systems
Streaming Transcription
Real-time partial transcripts produced as audio arrives, rather than waiting for a complete utterance. Enables lower-latency STT by beginning LLM processing before the user has finished speaking. Requires handling transcript corrections mid-inference.
Uses: Low-latency V2V, live captioning, real-time analytics
V2V Pipeline Architecture
A full voice-to-voice pipeline chains five stages. Each introduces latency that stacks. The target end-to-end time from end-of-speech to first audio output is under 600ms; above that, conversational flow degrades noticeably.
Target end-to-end: <600ms
Voice Activity Detection
Determines when the user has finished speaking. Triggers the STT stage. Too aggressive and it cuts off the user mid-sentence; too lenient and it adds unnecessary latency. Most platforms expose a configurable silence threshold.
Speech-to-Text
Transcribes the captured utterance. Streaming STT begins generating partial transcripts before VAD fires, allowing LLM pre-processing to start earlier. Accuracy and latency trade off against each other; faster models are less accurate.
LLM Inference
Receives the transcript and generates a response, potentially calling tools mid-generation. Streaming output: first token latency matters more than total generation time, because TTS can begin on the first sentence while the rest is still being generated.
Text-to-Speech
Converts LLM text output to audio chunks. Streaming TTS begins playing the first sentence while subsequent sentences are still being synthesised. The first audio chunk latency (not total synthesis time) is the critical metric.
Interruption Handling
VAD runs continuously during agent speech. When the user speaks over the agent, the barge-in signal cancels the current TTS stream, discards any buffered LLM output, and resets the pipeline for the new utterance. Handling this cleanly is one of the harder parts of V2V.
Context Threading
Each turn appends the transcript and response to the conversation history passed to the LLM on the next turn. Tool results, prior questions, and user preferences must all be carried forward. Long sessions need truncation or summarisation strategies.
Frameworks & Platforms
These platforms abstract the V2V pipeline to varying degrees. Some manage the entire stack from telephony to TTS; others give you infrastructure and expect you to compose the layers yourself.
ElevenLabs Conversational AI
Full-stack V2V platformFull-stack V2V platform with proprietary high-quality TTS at its core. Provides STT, VAD, turn detection, tool calling, conversation transcripts, and Twilio telephony integration. Well-suited when voice quality and branded persona are the primary requirements.
Vapi
Developer-first V2V platformDeveloper-focused platform that lets you bring your own LLM and TTS. Strong telephony (inbound and outbound), structured tool calling with webhooks, and fine-grained control over the pipeline. Good choice when you need a custom model or TTS stack rather than a vertically integrated one.
Retell AI
V2V platform with workflow builderV2V platform with a visual conversation workflow builder for non-code path definition. Supports multiple TTS and STT providers, real-time latency optimisation, and both WebRTC and PSTN. Faster to deploy than fully code-driven approaches when conversation structure is well-defined upfront.
Bland AI
Outbound-focused V2V platformSpecialises in high-volume automated outbound calling. Provides conversation script A/B testing, campaign management, and enterprise telephony integrations. Best suited for scenarios where you are initiating large numbers of outbound calls rather than handling inbound.
OpenAI Realtime API
Native GPT-4o V2V APIDirect WebSocket-based V2V API built into GPT-4o. Native VAD, barge-in, and function calling with sub-second response times. Tightly coupled to the OpenAI model stack: you cannot swap the underlying model, but the latency and integration simplicity are hard to match within that constraint.
LiveKit
Open-source media infrastructureOpen-source real-time audio and video infrastructure with an agent SDK on top. Provides WebRTC transport and media handling while leaving STT, LLM, and TTS choices entirely to you. Used when full control over every layer is required or when self-hosting is a constraint.
Tool Calling in Voice
Voice agents call tools the same way text agents do, but latency budgets are unforgiving. A tool call that takes 2 seconds in a text interface is fine. The same call in a voice conversation means two seconds of dead air, which users interpret as a dropped connection. Tool design in voice is fundamentally a latency problem.
Synchronous Tool
Fast lookups where the result is available within the latency budget (~300ms). The agent waits for the result before generating a response. Suitable for account queries, knowledge base lookups, and simple database reads.
Filler Injection
The agent speaks a bridging phrase immediately while the tool call executes in the background. Buys 200–500ms without dead air. The filler must be natural and not presuppose the tool result: "let me check that" not "I can see that...".
Async Tool
Long-running operations that cannot complete within the response budget. The agent acknowledges, continues the conversation, and delivers the result when ready, potentially interrupting its own speech. Requires careful state management to re-enter the conversation naturally.
Streaming Result
Tools that yield partial results incrementally. The agent begins composing a response as soon as enough data arrives, rather than waiting for the complete result. Effective for search results, database cursors, and streaming API responses.
Barge-in Cancellation
When the user interrupts while a tool is executing, the platform must cancel the pending tool response, discard buffered LLM output, and re-enter the pipeline with the new utterance. Tools that are already in-flight must be handled gracefully, idempotent where possible.
Silent Tools
Tools that execute without acknowledgement: logging, CRM updates, intent tagging. The agent neither informs the user nor waits for the result before responding. Suitable for fire-and-forget side effects that do not affect the response.
Worked Example: Tool Call with Filler
User utterance
"What's my account balance?"
Agent: filler (immediate)
"Let me pull that up for you."
Tool call (background)
get_account_balance(user_id: "u_1234")Tool result (~400ms)
{ "balance": 2847.50, "currency": "GBP" }Agent: completes response
"Your current balance is two thousand, eight hundred and forty-seven pounds fifty."
The filler phrase buys ~400ms. The user hears no silence.
Latency & Design Constraints
Voice agents operate under constraints that text agents do not face. Each one has architectural implications.
| Constraint | Implication | Common mitigation |
|---|---|---|
| Response latency | Silence above ~600ms feels like a dropped call. Users hang up or repeat themselves. | Streaming STT + streaming TTS + filler injection. Optimise each stage independently. |
| Interruption | Users naturally speak over agents. An agent that cannot be interrupted feels robotic and frustrating. | Continuous VAD during agent speech. Barge-in must cancel TTS and reset pipeline within 50ms. |
| No visual affordances | Users cannot see lists, tables, or formatted output. Agents that generate text-style responses sound unnatural. | Constrain LLM output style to spoken-word vocabulary. Short sentences. No lists, markdown, or structured data. |
| Ambient noise | Background noise degrades STT accuracy. Low-confidence transcripts can cause agents to misunderstand or hallucinate intent. | Confidence-threshold gating on STT output. Graceful clarification prompts on low-confidence input. |
| Network jitter | Audio packet loss or buffering introduces gaps in both input and output. VAD end-of-speech detection can misfire on gaps. | Jitter buffers, adaptive silence thresholds, and tolerant VAD configuration. Handle partial transcripts gracefully. |
| Session length | Long conversations accumulate context that exceeds LLM token limits, or produce latency growth as the prompt grows. | Rolling conversation window, session summarisation, or retrieval-augmented context rather than full history injection. |
Design Principles
Latency is UX
Every millisecond of silence is perceived as failure. Design the pipeline around latency targets first, feature completeness second. A slow agent with more capabilities is worse than a fast agent with fewer.
Stream Everything
Streaming STT, streaming LLM output, and streaming TTS compose into a pipeline where audio starts playing before transcription is complete. Non-streaming at any stage introduces a hard latency floor that cannot be optimised away.
Filler is Structural
Bridging phrases are not just conversational politeness; they are a latency management mechanism. Design them into the tool calling pattern from the start, not as an afterthought.
Speak, Don't Format
LLM output that works in text does not work in speech. Constrain the system prompt to produce naturally spoken responses: no bullet points, no headers, no numbers that would be read as "one colon". Short sentences that land well when heard, not read.
Design for Interruption
Barge-in is a feature, not an edge case. Test interruption handling as a primary flow. The agent's ability to be cut off mid-sentence and re-engage naturally is what separates a voice agent from an IVR.
Tools Must Be Fast
Any synchronous tool call adds directly to response latency. Budget each tool call explicitly. If a tool cannot return within ~300ms reliably, design it as an async pattern with filler, not a blocking call.
