Will Percey — Portfolio

Voice Agents

> >

headset_mic

Voice Agents

Voice agents are AI systems that conduct real-time spoken conversations. Unlike text agents that communicate through typed interfaces, a voice agent listens, reasons, and speaks, operating in a modality with fundamentally different constraints. Latency matters in milliseconds, not seconds. Interruption is a natural and expected behaviour, not an error. The user has no ability to scroll back or re-read.

The pipeline is not simply a text agent with microphone input grafted on. Speech-to-text, language model inference, and text-to-speech each introduce latency that compounds. End-to-end response times must stay inside a perceptual budget of roughly 600ms for conversation to feel natural. Every architectural decision (framework choice, tool call design, filler strategy) is ultimately a latency trade-off.

This page covers the modality taxonomy, the V2V pipeline, the major platforms, how tool calling works in a voice context, and the constraints that differentiate voice from text agent design.

Modality Types

Not every voice-adjacent system is a full V2V agent. These are the distinct pipeline types, what each one covers, and the use cases each serves.

sync_altV2V / STS

Voice-to-Voice

Full conversational pipeline. The user speaks, the agent understands and reasons, the agent speaks back. End-to-end latency must stay inside ~600ms for natural feel. The most complex modality; every stage of the pipeline must be optimised.

Uses: Conversational assistants, interactive agents, real-time support

record_voice_overTTS

Text-to-Speech

Text input, audio output. The system generates content as text and the TTS engine renders it as speech. No listening component. Simpler pipeline with fewer latency constraints; audio can often be pre-generated or cached.

Uses: Notifications, read-aloud, narration, accessibility features

hearingSTT

Speech-to-Text

Audio input, text output. Captures spoken audio and transcribes it to text for downstream processing. No generative component. Output can feed analytics, search, text agents, or structured extraction pipelines.

Uses: Transcription, dictation, compliance recording, call analytics

spatial_audioSTT + LLM

Voice-Triggered

User speaks to trigger; agent responds in text or executes an action. The voice interface is the input only; output is text, UI action, or side effect. Common in wake-word systems and voice command interfaces.

Uses: Wake-word systems, voice commands, smart device control

devicesV2V + Text

Hybrid / Multimodal

Agent handles both voice and text on the same underlying session. Routes behaviour based on channel type while maintaining shared context. More complex state management; the agent must adapt response style per modality.

Uses: Omnichannel assistants, escalation paths, accessibility-aware systems

streamLive STT

Streaming Transcription

Real-time partial transcripts produced as audio arrives, rather than waiting for a complete utterance. Enables lower-latency STT by beginning LLM processing before the user has finished speaking. Requires handling transcript corrections mid-inference.

Uses: Low-latency V2V, live captioning, real-time analytics

account_tree

V2V Pipeline Architecture

A full voice-to-voice pipeline chains five stages. Each introduces latency that stacks. The target end-to-end time from end-of-speech to first audio output is under 600ms; above that, conversational flow degrades noticeably.

micAudio In

→

20–50msVADEnd-of-speech

→

150–400msSTTTranscription

→

200–600msLLM + ToolsFirst token

→

80–200msTTSFirst chunk

→

volume_upAudio Out

Target end-to-end: <600ms

graphic_eq

Voice Activity Detection

Determines when the user has finished speaking. Triggers the STT stage. Too aggressive and it cuts off the user mid-sentence; too lenient and it adds unnecessary latency. Most platforms expose a configurable silence threshold.

transcribe

Speech-to-Text

Transcribes the captured utterance. Streaming STT begins generating partial transcripts before VAD fires, allowing LLM pre-processing to start earlier. Accuracy and latency trade off against each other; faster models are less accurate.

psychology

LLM Inference

Receives the transcript and generates a response, potentially calling tools mid-generation. Streaming output: first token latency matters more than total generation time, because TTS can begin on the first sentence while the rest is still being generated.

volume_up

Text-to-Speech

Converts LLM text output to audio chunks. Streaming TTS begins playing the first sentence while subsequent sentences are still being synthesised. The first audio chunk latency (not total synthesis time) is the critical metric.

front_hand

Interruption Handling

VAD runs continuously during agent speech. When the user speaks over the agent, the barge-in signal cancels the current TTS stream, discards any buffered LLM output, and resets the pipeline for the new utterance. Handling this cleanly is one of the harder parts of V2V.

forum

Context Threading

Each turn appends the transcript and response to the conversation history passed to the LLM on the next turn. Tool results, prior questions, and user preferences must all be carried forward. Long sessions need truncation or summarisation strategies.

apps

Frameworks & Platforms

These platforms abstract the V2V pipeline to varying degrees. Some manage the entire stack from telephony to TTS; others give you infrastructure and expect you to compose the layers yourself.

mic

ElevenLabs Conversational AI

Full-stack V2V platform

TTSSTTVADTool callingTelephony

Full-stack V2V platform with proprietary high-quality TTS at its core. Provides STT, VAD, turn detection, tool calling, conversation transcripts, and Twilio telephony integration. Well-suited when voice quality and branded persona are the primary requirements.

code

Vapi

Developer-first V2V platform

Pluggable LLMPluggable TTSTelephonyWebhooksTool calling

Developer-focused platform that lets you bring your own LLM and TTS. Strong telephony (inbound and outbound), structured tool calling with webhooks, and fine-grained control over the pipeline. Good choice when you need a custom model or TTS stack rather than a vertically integrated one.

account_tree

Retell AI

V2V platform with workflow builder

Visual workflowsPluggable TTSTelephonyWebRTC

V2V platform with a visual conversation workflow builder for non-code path definition. Supports multiple TTS and STT providers, real-time latency optimisation, and both WebRTC and PSTN. Faster to deploy than fully code-driven approaches when conversation structure is well-defined upfront.

call_made

Bland AI

Outbound-focused V2V platform

Outbound callingHigh volumeA/B testingEnterprise telephony

Specialises in high-volume automated outbound calling. Provides conversation script A/B testing, campaign management, and enterprise telephony integrations. Best suited for scenarios where you are initiating large numbers of outbound calls rather than handling inbound.

bolt

OpenAI Realtime API

Native GPT-4o V2V API

GPT-4oWebSocketBuilt-in VADFunction callingLow latency

Direct WebSocket-based V2V API built into GPT-4o. Native VAD, barge-in, and function calling with sub-second response times. Tightly coupled to the OpenAI model stack: you cannot swap the underlying model, but the latency and integration simplicity are hard to match within that constraint.

hub

LiveKit

Open-source media infrastructure

Open sourceWebRTCAgent SDKSelf-hostableFull control

Open-source real-time audio and video infrastructure with an agent SDK on top. Provides WebRTC transport and media handling while leaving STT, LLM, and TTS choices entirely to you. Used when full control over every layer is required or when self-hosting is a constraint.

build

Tool Calling in Voice

Voice agents call tools the same way text agents do, but latency budgets are unforgiving. A tool call that takes 2 seconds in a text interface is fine. The same call in a voice conversation means two seconds of dead air, which users interpret as a dropped connection. Tool design in voice is fundamentally a latency problem.

speed

Synchronous Tool

Fast lookups where the result is available within the latency budget (~300ms). The agent waits for the result before generating a response. Suitable for account queries, knowledge base lookups, and simple database reads.

chat

Filler Injection

The agent speaks a bridging phrase immediately while the tool call executes in the background. Buys 200–500ms without dead air. The filler must be natural and not presuppose the tool result: "let me check that" not "I can see that...".

pending

Async Tool

Long-running operations that cannot complete within the response budget. The agent acknowledges, continues the conversation, and delivers the result when ready, potentially interrupting its own speech. Requires careful state management to re-enter the conversation naturally.

stream

Streaming Result

Tools that yield partial results incrementally. The agent begins composing a response as soon as enough data arrives, rather than waiting for the complete result. Effective for search results, database cursors, and streaming API responses.

front_hand

Barge-in Cancellation

When the user interrupts while a tool is executing, the platform must cancel the pending tool response, discard buffered LLM output, and re-enter the pipeline with the new utterance. Tools that are already in-flight must be handled gracefully, idempotent where possible.

visibility_off

Silent Tools

Tools that execute without acknowledgement: logging, CRM updates, intent tagging. The agent neither informs the user nor waits for the result before responding. Suitable for fire-and-forget side effects that do not affect the response.

Worked Example: Tool Call with Filler

User utterance

"What's my account balance?"

Agent: filler (immediate)

"Let me pull that up for you."

Tool call (background)

get_account_balance(user_id: "u_1234")

Tool result (~400ms)

{ "balance": 2847.50, "currency": "GBP" }

Agent: completes response

"Your current balance is two thousand, eight hundred and forty-seven pounds fifty."

The filler phrase buys ~400ms. The user hears no silence.

timer

Latency & Design Constraints

Voice agents operate under constraints that text agents do not face. Each one has architectural implications.

Constraint	Implication	Common mitigation
Response latency	Silence above ~600ms feels like a dropped call. Users hang up or repeat themselves.	Streaming STT + streaming TTS + filler injection. Optimise each stage independently.
Interruption	Users naturally speak over agents. An agent that cannot be interrupted feels robotic and frustrating.	Continuous VAD during agent speech. Barge-in must cancel TTS and reset pipeline within 50ms.
No visual affordances	Users cannot see lists, tables, or formatted output. Agents that generate text-style responses sound unnatural.	Constrain LLM output style to spoken-word vocabulary. Short sentences. No lists, markdown, or structured data.
Ambient noise	Background noise degrades STT accuracy. Low-confidence transcripts can cause agents to misunderstand or hallucinate intent.	Confidence-threshold gating on STT output. Graceful clarification prompts on low-confidence input.
Network jitter	Audio packet loss or buffering introduces gaps in both input and output. VAD end-of-speech detection can misfire on gaps.	Jitter buffers, adaptive silence thresholds, and tolerant VAD configuration. Handle partial transcripts gracefully.
Session length	Long conversations accumulate context that exceeds LLM token limits, or produce latency growth as the prompt grows.	Rolling conversation window, session summarisation, or retrieval-augmented context rather than full history injection.

lightbulb

Design Principles

timer

Latency is UX

Every millisecond of silence is perceived as failure. Design the pipeline around latency targets first, feature completeness second. A slow agent with more capabilities is worse than a fast agent with fewer.

stream

Stream Everything

Streaming STT, streaming LLM output, and streaming TTS compose into a pipeline where audio starts playing before transcription is complete. Non-streaming at any stage introduces a hard latency floor that cannot be optimised away.

chat_bubble

Filler is Structural

Bridging phrases are not just conversational politeness; they are a latency management mechanism. Design them into the tool calling pattern from the start, not as an afterthought.

record_voice_over

Speak, Don't Format

LLM output that works in text does not work in speech. Constrain the system prompt to produce naturally spoken responses: no bullet points, no headers, no numbers that would be read as "one colon". Short sentences that land well when heard, not read.

front_hand

Design for Interruption

Barge-in is a feature, not an edge case. Test interruption handling as a primary flow. The agent's ability to be cut off mid-sentence and re-engage naturally is what separates a voice agent from an IVR.

bolt

Tools Must Be Fast

Any synchronous tool call adds directly to response latency. Budget each tool call explicitly. If a tool cannot return within ~300ms reliably, design it as an async pattern with filler, not a blocking call.