Will Percey — Portfolio

ElevenLabs: Voice Testing

> >

bug_report

ElevenLabs Voice Agent Testing

ElevenLabs provides a dedicated testing framework for conversational voice agents. Unlike traditional software tests that validate deterministic outputs, voice agent testing must account for the probabilistic nature of LLM responses, correct tool invocation under varying inputs, and the branching structure of multi-turn dialogue. The platform addresses each of these with distinct test types and a simulation engine.

Tests can be run manually, triggered via API, or integrated into CI/CD pipelines so that every change to an agent's prompt or workflow is validated before it reaches production. This page covers the six core testing concepts: scenario evaluation, tool call validation, conversation simulation, CI/CD integration, test generation from real conversations, and the API surface.

checklist

Scenario Testing

LLM-evaluated tests that assess whether an agent's response meets defined success criteria, with support for labelled examples to anchor the judge.

build_circle

Tool Call Testing

Validates that the agent invokes the right tool with the right parameters. Supports exact match validation for precision-critical parameters.

forum

Conversation Simulation

Full start-to-finish or partial mid-conversation simulations with tool mocking and custom assertion criteria.

integration_instructions

CI/CD Integration

API-driven test execution that slots into existing pipelines. Every prompt or workflow change can be gated on passing test results before deployment.

auto_awesome

Tests from Conversations

Automatically generate test cases from real production conversations, grounding the test suite in actual interaction patterns.

api

API & SDK Support

Full REST API with Python and TypeScript SDKs. Create, run, and retrieve test results programmatically with typed interfaces.

checklist

Scenario Testing (LLM Evaluation)

Scenario tests simulate a complete interaction and assess the agent's responses against defined success criteria evaluated by an LLM judge. Because voice agent responses are generative rather than deterministic, success cannot be defined as an exact string match. Instead, criteria describe intent: "the agent confirmed the booking reference", "the agent did not reveal account balance before authentication".

thumb_up Success & Failure Examples

Each criterion can be accompanied by labelled examples that anchor the evaluator. Without examples, borderline responses may be scored inconsistently. With a few annotated examples showing what "passed" and "failed" look like for that criterion, the evaluator calibrates to the nuances of your specific agent.

This is especially important for criteria that involve tone, phrasing, or implied meaning rather than factual correctness. Examples act as a few-shot prompt for the judge model itself.

edit_note Test Creation Sources

Scenario tests can be written from scratch using a structured form — defining the simulated user persona, turn-by-turn conversation goals, and the success criteria to evaluate. Alternatively, tests can be created directly from an existing conversation in the conversation history. The platform extracts the context and pre-populates the test structure, reducing the effort of formalising known-good or known-bad interactions.

Example success criteria

check_circleAgent confirmed the appointment date and time before ending the call

check_circleAgent offered an alternative when the requested slot was unavailable

check_circleAgent authenticated the caller before providing account details

check_circleAgent provided the correct cancellation policy when asked

Example failure criteria

cancelAgent disclosed account balance without completing authentication

cancelAgent failed to ask for the caller's name before booking

cancelAgent gave contradictory information about the return policy

cancelAgent transferred the call without confirming the destination was correct

build_circle

Tool Call Testing

Tool call tests verify that the agent invokes the correct tool with the correct parameters given a particular conversational context. This is distinct from scenario testing: instead of evaluating whether the agent's spoken response is correct, it validates the function call the agent would make before generating a response. For actions like transferring a call, looking up an account, or booking an appointment, the parameter values are not merely important — they are the outcome.

ads_click

Tool selection validation

Did the agent call the right tool?

Asserts that given a specific conversational context, the agent chose the correct tool to invoke. A caller asking to "change my reservation" should trigger the update-booking tool, not the cancel-booking tool — even when the phrasing is ambiguous.

Example assertionassert tool_name == "update_booking"

data_object

Parameter value assertion

Did the agent pass the right values?

Exact match validation on specific parameters. For a call transfer the destination number must be exactly right. For a date-based lookup, the date format and value must match the system's expectations precisely.

Example assertionassert params.transfer_to == "+441234567890"

record_voice_over

Parameter extraction accuracy

Did the agent parse spoken input correctly?

Validates that the agent correctly extracted a value from spoken language and mapped it to a structured parameter. Tests that "the fourteenth of March" becomes the expected ISO date, or that "my account ending in four two three one" populates the correct account field.

Example assertionassert params.date == "2025-03-14"

block

Tool non-invocation

Did the agent correctly not call a tool?

Some tests assert the absence of a tool call. If the caller has not yet provided required information, the agent must not invoke the booking tool. If the caller is not authenticated, the agent must not call the account-lookup tool.

Example assertionassert tool_calls == []

Exact match validation

The framework supports exact match validation on individual tool parameters. For parameters that must be precise — a transfer destination, an account identifier, a date — fuzzy or LLM-graded evaluation would accept values that would cause runtime failures in the downstream system.

Parameter type	Validation method	Rationale
Call transfer destination	Exact match	Any deviation routes the call to the wrong recipient. No tolerance for partial matches.
Account or booking ID	Exact match	A near-match on an identifier will silently operate on the wrong record.
Date or time value	Exact match	Off-by-one errors in dates create incorrect bookings that are often not caught until the customer calls back.
Spoken name or address	LLM-graded	Legitimate phonetic variation exists. An LLM judge can determine whether the extracted value is a valid transcription.
Tone or empathy	LLM-graded	There is no single correct phrasing. Evaluation requires understanding of intent and register, not string comparison.
Binary flags (e.g. opt-in)	Exact match	True/false parameters have no valid middle ground. An incorrect value has a direct compliance or business consequence.

forum

Conversation Simulation API

The simulation API drives test execution programmatically. Rather than relying on a live human caller, it runs a synthetic user turn-by-turn against the agent, records the transcript, and exposes results for evaluation. Two simulation modes are supported depending on what you need to validate.

Full simulation

Start-to-finish

The synthetic caller begins the conversation from the opening utterance and drives it through to a defined termination condition (task completion, hang-up signal, or maximum turn count). Used for end-to-end regression tests that cover the full expected interaction path.

Tests the full conversation lifecycle including greeting and closing
Validates that the agent reaches the correct terminal state
Catches regressions introduced by prompt changes at any turn

Partial simulation

Mid-conversation injection

The simulation starts from an injected conversation state rather than from turn zero. A pre-existing transcript or a structured context payload positions the agent mid-conversation, then the simulation runs forward from that point. Used to validate specific decision points or sub-flows without re-running the entire preamble on every test run.

Isolates a single decision branch for targeted testing
Faster to execute when the path to a decision point is long
Supports testing of edge-case sub-flows that are rare in full simulations

build Tool mocking

The simulation API supports replacing live tool implementations with mock responses. Rather than calling a real booking system or CRM during a test run, the mock returns a configured payload. This keeps tests deterministic, eliminates external dependencies, and allows testing of error paths (e.g. tool returns a 404 or empty result) that would be difficult to trigger against a live system.

rule Custom evaluation criteria

Beyond pass/fail scenario evaluation, the simulation API accepts custom evaluation criteria expressed as structured assertions over the transcript. Criteria can target specific turns, tool call parameters, response content, or conversation-level properties such as total turn count or whether the agent asked a clarifying question before proceeding. This allows evaluation logic to be defined in code and version-controlled alongside the agent definition.

integration_instructions

CI/CD Integration

Voice agent testing integrates into standard CI/CD pipelines via the REST API or the provided SDKs. Every change to the agent's system prompt, tool definitions, or conversation workflow can trigger a test suite run automatically before deployment. This prevents regressions from reaching production, which matters especially for voice agents where a degraded interaction is immediately perceived by callers.

edit

Change is made

Developer modifies the system prompt, updates a tool definition, or changes a conversation workflow in the agent configuration.

commit

Commit & push

The change is committed to version control. The CI pipeline detects the change and triggers the test workflow.

play_circle

Test suite triggered

The pipeline calls POST /v1/convai/agents/{agent_id}/run-tests with the target agent ID. The platform begins executing the full test suite.

hourglass_empty

Results polled

The pipeline polls the results endpoint until the run completes. Test execution is asynchronous; polling allows the CI step to block on completion.

fact_check

Pass/fail evaluated

The pipeline reads the test results. A configured failure threshold determines whether the pipeline proceeds or fails the build.

rocket_launch

Deploy or block

On pass: the agent change is deployed to production. On fail: the pipeline blocks deployment and surfaces the failed test cases to the developer.

Example pipeline step (GitHub Actions)

- name: Run ElevenLabs agent tests
  env:
    ELEVENLABS_API_KEY: ${{ secrets.ELEVENLABS_API_KEY }}
    AGENT_ID: ${{ vars.AGENT_ID }}
  run: |
    python scripts/run_agent_tests.py \
      --agent-id "$AGENT_ID" \
      --fail-threshold 0.95 \
      --timeout 300

auto_awesome

Generating Tests from Real Conversations

One of the most efficient sources of test cases is the production conversation history. Rather than writing synthetic scenarios from scratch, the platform can analyse past conversations and automatically generate test cases from them. This grounds the test suite in real-world patterns rather than hypothetical ones.

manage_search

Select source conversations

Choose conversations from the production history to use as generation inputs. Good candidates are completed interactions that reached a clear outcome — both successful completions and failures make useful test cases.

account_tree

Platform extracts structure

The platform analyses the conversation transcript and infers the user intent, the agent's decision path, the tools invoked, and the final outcome. This structure becomes the scaffold for the generated test.

auto_awesome

Success criteria inferred

Based on the outcome of the source conversation, the platform generates candidate success criteria. For a completed booking, it generates criteria around confirmation, correct date, and authenticated identity. These can be edited before the test is saved.

save

Test validated and saved

The generated test is run once against the current agent to establish a baseline, then saved to the test suite. From this point it becomes a repeatable regression test that will detect if future prompt changes break the behaviour observed in the source conversation.

lightbulb

Why this matters

Real conversations surface interaction patterns that test authors would never anticipate: unexpected phrasings, mid-conversation topic switches, users who give partial information, or callers who interrupt during critical tool-call sequences. A test suite drawn from production conversations will cover the actual distribution of inputs your agent encounters, not just the happy paths that were obvious at design time.

api

API Endpoints & SDK Support

The full testing lifecycle is accessible via the REST API. Test suites can be created, updated, executed, and their results retrieved programmatically. Both the Python and TypeScript SDKs wrap these endpoints for typed, idiomatic usage in each language.

Method	Endpoint	Purpose
POST	`/v1/convai/agents/{agent_id}/run-tests`	Execute the full test suite for an agent. Returns a run ID for polling results.
GET	`/v1/convai/agents/{agent_id}/test-runs/{run_id}`	Retrieve the status and results of a specific test run.
GET	`/v1/convai/agents/{agent_id}/tests`	List all tests defined for an agent.
POST	`/v1/convai/agents/{agent_id}/tests`	Create a new test for an agent (scenario or tool call type).
PATCH	`/v1/convai/agents/{agent_id}/tests/{test_id}`	Update an existing test definition, criteria, or examples.
DELETE	`/v1/convai/agents/{agent_id}/tests/{test_id}`	Remove a test from the agent's test suite.

Python SDK

from elevenlabs import ElevenLabs
import time

client = ElevenLabs(api_key="your-api-key")

# Trigger test suite
run = client.convai.agents.run_tests(
    agent_id="your-agent-id"
)

# Poll until complete
while run.status in ("pending", "running"):
    time.sleep(5)
    run = client.convai.agents.get_test_run(
        agent_id="your-agent-id",
        run_id=run.id
    )

# Evaluate results
passed = [t for t in run.tests if t.passed]
failed = [t for t in run.tests if not t.passed]
print(f"{len(passed)}/{len(run.tests)} tests passed")

TypeScript SDK

import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";

const client = new ElevenLabsClient({
  apiKey: "your-api-key",
});

// Trigger test suite
let run = await client.convai.agents.runTests({
  agentId: "your-agent-id",
});

// Poll until complete
while (run.status === "pending" || run.status === "running") {
  await new Promise(r => setTimeout(r, 5000));
  run = await client.convai.agents.getTestRun({
    agentId: "your-agent-id",
    runId: run.id,
  });
}

// Evaluate results
const passed = run.tests.filter(t => t.passed);
console.log(`${passed.length}/${run.tests.length} passed`);

account_tree

Choosing the Right Test Type

The test types are complementary, not competing. A robust test suite uses all of them together. This decision guide maps common validation goals to the appropriate test type.

helpValidate that a spoken response is appropriate and on-brand

arrow_forward

Scenario test

Response quality is subjective — an LLM judge with labelled examples is the right evaluator for intent and tone.

helpEnsure a call transfer goes to the right phone number

arrow_forward

Tool call test (exact match)

Destination parameters must be precise. A near-match on a phone number silently routes calls incorrectly.

helpTest a specific edge case deep in the conversation flow

arrow_forward

Partial simulation

Injecting mid-conversation state avoids replaying the full preamble on every run and focuses execution on the branch being tested.

helpBuild regression coverage from production failures

arrow_forward

Generate from conversations

Real failure conversations capture the exact phrasing and context that caused the issue. Generated tests prevent the same regression recurring.

helpGate every prompt change before production deployment

arrow_forward

CI/CD integration

API-driven test execution in the pipeline automates validation on every change without requiring manual test runs.

helpValidate an entirely new conversation flow end-to-end

arrow_forward

Full simulation

A new flow has no prior conversation history to start from. Full simulation tests the complete path from greeting to resolution.