ElevenLabs: Voice Testing
ElevenLabs Voice Agent Testing
ElevenLabs provides a dedicated testing framework for conversational voice agents. Unlike traditional software tests that validate deterministic outputs, voice agent testing must account for the probabilistic nature of LLM responses, correct tool invocation under varying inputs, and the branching structure of multi-turn dialogue. The platform addresses each of these with distinct test types and a simulation engine.
Tests can be run manually, triggered via API, or integrated into CI/CD pipelines so that every change to an agent's prompt or workflow is validated before it reaches production. This page covers the six core testing concepts: scenario evaluation, tool call validation, conversation simulation, CI/CD integration, test generation from real conversations, and the API surface.
Scenario Testing
LLM-evaluated tests that assess whether an agent's response meets defined success criteria, with support for labelled examples to anchor the judge.
Tool Call Testing
Validates that the agent invokes the right tool with the right parameters. Supports exact match validation for precision-critical parameters.
Conversation Simulation
Full start-to-finish or partial mid-conversation simulations with tool mocking and custom assertion criteria.
CI/CD Integration
API-driven test execution that slots into existing pipelines. Every prompt or workflow change can be gated on passing test results before deployment.
Tests from Conversations
Automatically generate test cases from real production conversations, grounding the test suite in actual interaction patterns.
API & SDK Support
Full REST API with Python and TypeScript SDKs. Create, run, and retrieve test results programmatically with typed interfaces.
Scenario Testing (LLM Evaluation)
Scenario tests simulate a complete interaction and assess the agent's responses against defined success criteria evaluated by an LLM judge. Because voice agent responses are generative rather than deterministic, success cannot be defined as an exact string match. Instead, criteria describe intent: "the agent confirmed the booking reference", "the agent did not reveal account balance before authentication".
Success & Failure Examples
Each criterion can be accompanied by labelled examples that anchor the evaluator. Without examples, borderline responses may be scored inconsistently. With a few annotated examples showing what "passed" and "failed" look like for that criterion, the evaluator calibrates to the nuances of your specific agent.
This is especially important for criteria that involve tone, phrasing, or implied meaning rather than factual correctness. Examples act as a few-shot prompt for the judge model itself.
Test Creation Sources
Scenario tests can be written from scratch using a structured form — defining the simulated user persona, turn-by-turn conversation goals, and the success criteria to evaluate. Alternatively, tests can be created directly from an existing conversation in the conversation history. The platform extracts the context and pre-populates the test structure, reducing the effort of formalising known-good or known-bad interactions.
Example success criteria
Example failure criteria
Tool Call Testing
Tool call tests verify that the agent invokes the correct tool with the correct parameters given a particular conversational context. This is distinct from scenario testing: instead of evaluating whether the agent's spoken response is correct, it validates the function call the agent would make before generating a response. For actions like transferring a call, looking up an account, or booking an appointment, the parameter values are not merely important — they are the outcome.
Tool selection validation
Did the agent call the right tool?
Asserts that given a specific conversational context, the agent chose the correct tool to invoke. A caller asking to "change my reservation" should trigger the update-booking tool, not the cancel-booking tool — even when the phrasing is ambiguous.
assert tool_name == "update_booking"Parameter value assertion
Did the agent pass the right values?
Exact match validation on specific parameters. For a call transfer the destination number must be exactly right. For a date-based lookup, the date format and value must match the system's expectations precisely.
assert params.transfer_to == "+441234567890"Parameter extraction accuracy
Did the agent parse spoken input correctly?
Validates that the agent correctly extracted a value from spoken language and mapped it to a structured parameter. Tests that "the fourteenth of March" becomes the expected ISO date, or that "my account ending in four two three one" populates the correct account field.
assert params.date == "2025-03-14"Tool non-invocation
Did the agent correctly not call a tool?
Some tests assert the absence of a tool call. If the caller has not yet provided required information, the agent must not invoke the booking tool. If the caller is not authenticated, the agent must not call the account-lookup tool.
assert tool_calls == []Exact match validation
The framework supports exact match validation on individual tool parameters. For parameters that must be precise — a transfer destination, an account identifier, a date — fuzzy or LLM-graded evaluation would accept values that would cause runtime failures in the downstream system.
| Parameter type | Validation method | Rationale |
|---|---|---|
| Call transfer destination | Exact match | Any deviation routes the call to the wrong recipient. No tolerance for partial matches. |
| Account or booking ID | Exact match | A near-match on an identifier will silently operate on the wrong record. |
| Date or time value | Exact match | Off-by-one errors in dates create incorrect bookings that are often not caught until the customer calls back. |
| Spoken name or address | LLM-graded | Legitimate phonetic variation exists. An LLM judge can determine whether the extracted value is a valid transcription. |
| Tone or empathy | LLM-graded | There is no single correct phrasing. Evaluation requires understanding of intent and register, not string comparison. |
| Binary flags (e.g. opt-in) | Exact match | True/false parameters have no valid middle ground. An incorrect value has a direct compliance or business consequence. |
Conversation Simulation API
The simulation API drives test execution programmatically. Rather than relying on a live human caller, it runs a synthetic user turn-by-turn against the agent, records the transcript, and exposes results for evaluation. Two simulation modes are supported depending on what you need to validate.
Start-to-finish
The synthetic caller begins the conversation from the opening utterance and drives it through to a defined termination condition (task completion, hang-up signal, or maximum turn count). Used for end-to-end regression tests that cover the full expected interaction path.
- Tests the full conversation lifecycle including greeting and closing
- Validates that the agent reaches the correct terminal state
- Catches regressions introduced by prompt changes at any turn
Mid-conversation injection
The simulation starts from an injected conversation state rather than from turn zero. A pre-existing transcript or a structured context payload positions the agent mid-conversation, then the simulation runs forward from that point. Used to validate specific decision points or sub-flows without re-running the entire preamble on every test run.
- Isolates a single decision branch for targeted testing
- Faster to execute when the path to a decision point is long
- Supports testing of edge-case sub-flows that are rare in full simulations
Tool mocking
The simulation API supports replacing live tool implementations with mock responses. Rather than calling a real booking system or CRM during a test run, the mock returns a configured payload. This keeps tests deterministic, eliminates external dependencies, and allows testing of error paths (e.g. tool returns a 404 or empty result) that would be difficult to trigger against a live system.
Custom evaluation criteria
Beyond pass/fail scenario evaluation, the simulation API accepts custom evaluation criteria expressed as structured assertions over the transcript. Criteria can target specific turns, tool call parameters, response content, or conversation-level properties such as total turn count or whether the agent asked a clarifying question before proceeding. This allows evaluation logic to be defined in code and version-controlled alongside the agent definition.
CI/CD Integration
Voice agent testing integrates into standard CI/CD pipelines via the REST API or the provided SDKs. Every change to the agent's system prompt, tool definitions, or conversation workflow can trigger a test suite run automatically before deployment. This prevents regressions from reaching production, which matters especially for voice agents where a degraded interaction is immediately perceived by callers.
Change is made
Developer modifies the system prompt, updates a tool definition, or changes a conversation workflow in the agent configuration.
Commit & push
The change is committed to version control. The CI pipeline detects the change and triggers the test workflow.
Test suite triggered
The pipeline calls POST /v1/convai/agents/{agent_id}/run-tests with the target agent ID. The platform begins executing the full test suite.
Results polled
The pipeline polls the results endpoint until the run completes. Test execution is asynchronous; polling allows the CI step to block on completion.
Pass/fail evaluated
The pipeline reads the test results. A configured failure threshold determines whether the pipeline proceeds or fails the build.
Deploy or block
On pass: the agent change is deployed to production. On fail: the pipeline blocks deployment and surfaces the failed test cases to the developer.
Example pipeline step (GitHub Actions)
- name: Run ElevenLabs agent tests
env:
ELEVENLABS_API_KEY: ${{ secrets.ELEVENLABS_API_KEY }}
AGENT_ID: ${{ vars.AGENT_ID }}
run: |
python scripts/run_agent_tests.py \
--agent-id "$AGENT_ID" \
--fail-threshold 0.95 \
--timeout 300Generating Tests from Real Conversations
One of the most efficient sources of test cases is the production conversation history. Rather than writing synthetic scenarios from scratch, the platform can analyse past conversations and automatically generate test cases from them. This grounds the test suite in real-world patterns rather than hypothetical ones.
Select source conversations
Choose conversations from the production history to use as generation inputs. Good candidates are completed interactions that reached a clear outcome — both successful completions and failures make useful test cases.
Platform extracts structure
The platform analyses the conversation transcript and infers the user intent, the agent's decision path, the tools invoked, and the final outcome. This structure becomes the scaffold for the generated test.
Success criteria inferred
Based on the outcome of the source conversation, the platform generates candidate success criteria. For a completed booking, it generates criteria around confirmation, correct date, and authenticated identity. These can be edited before the test is saved.
Test validated and saved
The generated test is run once against the current agent to establish a baseline, then saved to the test suite. From this point it becomes a repeatable regression test that will detect if future prompt changes break the behaviour observed in the source conversation.
Why this matters
Real conversations surface interaction patterns that test authors would never anticipate: unexpected phrasings, mid-conversation topic switches, users who give partial information, or callers who interrupt during critical tool-call sequences. A test suite drawn from production conversations will cover the actual distribution of inputs your agent encounters, not just the happy paths that were obvious at design time.
API Endpoints & SDK Support
The full testing lifecycle is accessible via the REST API. Test suites can be created, updated, executed, and their results retrieved programmatically. Both the Python and TypeScript SDKs wrap these endpoints for typed, idiomatic usage in each language.
| Method | Endpoint | Purpose |
|---|---|---|
| POST | /v1/convai/agents/{agent_id}/run-tests | Execute the full test suite for an agent. Returns a run ID for polling results. |
| GET | /v1/convai/agents/{agent_id}/test-runs/{run_id} | Retrieve the status and results of a specific test run. |
| GET | /v1/convai/agents/{agent_id}/tests | List all tests defined for an agent. |
| POST | /v1/convai/agents/{agent_id}/tests | Create a new test for an agent (scenario or tool call type). |
| PATCH | /v1/convai/agents/{agent_id}/tests/{test_id} | Update an existing test definition, criteria, or examples. |
| DELETE | /v1/convai/agents/{agent_id}/tests/{test_id} | Remove a test from the agent's test suite. |
Python SDK
from elevenlabs import ElevenLabs
import time
client = ElevenLabs(api_key="your-api-key")
# Trigger test suite
run = client.convai.agents.run_tests(
agent_id="your-agent-id"
)
# Poll until complete
while run.status in ("pending", "running"):
time.sleep(5)
run = client.convai.agents.get_test_run(
agent_id="your-agent-id",
run_id=run.id
)
# Evaluate results
passed = [t for t in run.tests if t.passed]
failed = [t for t in run.tests if not t.passed]
print(f"{len(passed)}/{len(run.tests)} tests passed")TypeScript SDK
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
const client = new ElevenLabsClient({
apiKey: "your-api-key",
});
// Trigger test suite
let run = await client.convai.agents.runTests({
agentId: "your-agent-id",
});
// Poll until complete
while (run.status === "pending" || run.status === "running") {
await new Promise(r => setTimeout(r, 5000));
run = await client.convai.agents.getTestRun({
agentId: "your-agent-id",
runId: run.id,
});
}
// Evaluate results
const passed = run.tests.filter(t => t.passed);
console.log(`${passed.length}/${run.tests.length} passed`);Choosing the Right Test Type
The test types are complementary, not competing. A robust test suite uses all of them together. This decision guide maps common validation goals to the appropriate test type.
Response quality is subjective — an LLM judge with labelled examples is the right evaluator for intent and tone.
Destination parameters must be precise. A near-match on a phone number silently routes calls incorrectly.
Injecting mid-conversation state avoids replaying the full preamble on every run and focuses execution on the branch being tested.
Real failure conversations capture the exact phrasing and context that caused the issue. Generated tests prevent the same regression recurring.
API-driven test execution in the pipeline automates validation on every change without requiring manual test runs.
A new flow has no prior conversation history to start from. Full simulation tests the complete path from greeting to resolution.
