Voice Agents: V2V Risks
The V2V Risk Landscape
Voice-to-voice (V2V) agents listen, reason, and respond in natural speech. They operate across healthcare triage, financial services, customer-facing products, internal tooling, and anywhere a real-time voice interface replaces or augments a human. The property that makes them effective (a realistic voice in a live interaction) is the same property that makes them a target for fraud, impersonation, and injection attacks that text-based agents never face.
The risk surface is asymmetric. Generating convincing synthetic speech now costs cents and takes seconds. Detecting it reliably requires dedicated tooling that the voice platform itself does not typically provide. The generation layer and the verification layer are different products, from different vendors, with different threat models.
This catalogue maps what the V2V platform layer covers, where the structural gaps are, and which categories of tooling close them. The goal is a layered security stack where generation, liveness detection, biometrics, and deepfake detection compose into a coherent whole rather than independent silos.
What the Platform Layer Covers
V2V platforms handle generation and processing. These are the capabilities you typically get from the platform itself, and where platform responsibility ends.
TTS
Text-to-speech generation. The core platform capability: converts agent text output into natural speech with control over voice, style, and pacing.
STT
Speech-to-text transcription with word-level timestamps, confidence scores, and multi-speaker detection for parsing incoming audio.
Voice Cloning
Custom voice creation from audio samples, enabling branded or personalised agent personas without using a stock voice.
Consent Verification
Verification that the person submitting a voice sample has the right to clone it. Reduces unconsented enrolment of third-party voices.
Provenance Metadata
Cryptographic markers (e.g. C2PA) embedded in generated audio to enable post-incident attribution and chain-of-custody for synthetic speech.
Restricted Voice Policies
Automated blocking of clone attempts against known protected identities. Reduces impersonation of public figures at the platform level.
Voice Activity Detection
Per-frame detection of whether audio contains active speech. Drives accurate turn-taking, barge-in handling, and silence management.
Turn Detection
End-of-speech detection that identifies natural conversation boundaries, preventing the agent from cutting off mid-sentence or waiting too long.
Conversation Transcripts
Timestamped records of speaker turns, durations, and full transcript text. Supports audit, replay analysis, and downstream logging.
Transport Integration
Connectivity to telephony (SIP/PSTN), WebRTC, or audio streaming protocols. Handles the audio pipeline from wire to model.
V2V Risk Catalogue
Attack vectors and failure modes specific to voice-to-voice agents. None of these are covered by the platform generation layer; each requires dedicated tooling from the security stack.
Voice Cloning Attack
Attacker collects a few seconds of a target's voice from public audio, social media, or a prior interaction, clones it using freely available tools, and uses the synthetic voice to impersonate that person to the agent.
Liveness Injection
A pre-recorded audio file or TTS output is injected into the audio path to simulate a live speaker. Without liveness checks, the system has no way to distinguish a recording from a real person in real time.
Cross-Platform Deepfake
Audio generated by open-source or third-party TTS systems, such as Coqui, Tortoise-TTS, or XTTS, that platform-specific classifiers cannot detect. Any classifier trained only on the platform's own output has this blind spot.
Agent Impersonation
Synthetic voice engineered to sound like a legitimate agent or employee, used in social engineering to extract sensitive information from users who trust they are speaking to an authorised system or person.
Voiceprint Replay Attack
A genuine voiceprint sample extracted from a prior authenticated session is replayed against a passive biometric system that lacks anti-replay protection, authenticating the attacker as the legitimate user.
Transcript Poisoning
Adversarial audio crafted to cause the STT engine to produce attacker-controlled text, injecting instructions into the downstream LLM context. The agent acts on transcribed words that were never genuinely spoken.
Behavioural Evasion
Attacker studies a target's speech patterns, cadence, vocabulary, and filler words across multiple interactions to construct a synthetic voice that defeats behavioural-anomaly detection trained on that same profile.
Consent Bypass
Voice sample acquired without the subject's knowledge (from a recorded conversation, public video, or intercepted stream) and used to enrol a cloned voice, bypassing the consent verification the platform requires.
Channel Metadata Forgery
Transport-layer metadata (originating number, device ID, session token) is forged and combined with a cloned voice, creating a compound identity attack that defeats controls relying on either signal alone.
Session Replay Attack
Audio from a legitimate authenticated session is captured and replayed in a later interaction. Stateful agents that do not bind sessions to cryptographic tokens may accept the replay as the original authenticated user.
Secure Call Flow Architecture
A hardened V2V interaction passes through multiple verification layers before reaching agent logic. Each layer catches a different class of attack. Failing any layer routes to a challenge or rejection path before the agent ever responds.
Gap Analysis
A V2V platform can tell you this audio was produced by its own system; it cannot tell you the speaker is who they claim to be. The gaps below are where an attacker walks through an unaugmented V2V deployment.
| Gap | Attack it enables | Tooling required | Provider options |
|---|---|---|---|
| Voiceprint authentication | Speaker claims to be an enrolled user but is an impersonator or cloned voice | Voice biometric matching against an enrolled speaker profile | |
| Liveness detection | Replayed recording or TTS output injected in place of a live speaker | Anti-spoofing model distinguishing live speech from synthetic or replayed audio | |
| Cross-platform deepfake detection | Synthetic audio from open-source or third-party TTS not caught by platform classifier | Ensemble detection model covering all major TTS generators, not just the platform's own | |
| Speaker identity verification | Voice cloning attack against voice-based authentication | Matching speaker audio against enrolled profile captured at onboarding | |
| Fraud scoring | Compound attacks combining multiple vectors that no single check catches alone | Risk score combining voice features, device fingerprint, and behavioural signals | |
| Behavioural analysis | Slow-burn impersonation refined across multiple sessions to evade anomaly detection | Speech cadence, vocabulary, and pattern modelling against a historical baseline |
The Security Stack
Representative providers in each gap category. Each specialises in a different part of the verification problem; they compose rather than compete.
Pindrop
Voice biometrics, liveness detection, and fraud scoring built for real-time voice pipelines. Analyses 1,380+ audio features, including background noise, device fingerprint, and speech patterns, to produce a per-session fraud risk score. Widely deployed in high-assurance voice environments.
Reality Defender
Cross-platform deepfake audio detection using ensemble models trained across all major TTS systems. Provides a confidence score for synthetic speech from any source, not limited to any single vendor's output. Designed for real-time stream analysis at scale.
ID R&D
Passive liveness detection and voice biometric authentication with no challenge-response friction. Anti-spoofing models trained against replay, synthesis, and voice conversion attacks. Runs continuously in the background without requiring the user to take any special action.
Nuance (Microsoft)
Enterprise voice biometrics platform with a long deployment history in regulated industries. Integrates with the Microsoft identity and compliance ecosystem. Supports both text-dependent and text-independent voiceprint verification.
Design Principles
Generation ≠ Verification
A platform that generates voice cannot fully verify it. The generation layer and the security stack are distinct concerns, built by different vendors, with different threat models. They sit alongside each other, not inside each other.
Layered Defence
No single provider covers all V2V risk vectors. Liveness, voiceprint biometrics, and deepfake detection are distinct capabilities that must compose into a unified pipeline, each catching a different attack class.
Provenance by Default
C2PA metadata should be embedded in all agent-generated audio from the start, creating an auditable provenance chain for post-incident forensics before any incident ever occurs.
Enrol at the Edge
Voiceprint enrolment should happen at the earliest authenticated touchpoint (onboarding, not authentication) to build a genuine baseline before any attacker can attempt to enrol a cloned voice first.
Risk-Scored Routing
Fraud scores from voice analysis should feed into the same risk engine as transactional and behavioural signals, enabling compound risk decisions rather than independent siloed voice checks that can be individually bypassed.
Fail to Challenge
When any verification layer is uncertain, route to an out-of-band challenge rather than terminating the interaction outright. Legitimate users can complete it; most attackers cannot, and it avoids false positives turning away real users.
