Voice Agents: V2V Risks

mic

The V2V Risk Landscape

Voice-to-voice (V2V) agents listen, reason, and respond in natural speech. They operate across healthcare triage, financial services, customer-facing products, internal tooling, and anywhere a real-time voice interface replaces or augments a human. The property that makes them effective (a realistic voice in a live interaction) is the same property that makes them a target for fraud, impersonation, and injection attacks that text-based agents never face.

The risk surface is asymmetric. Generating convincing synthetic speech now costs cents and takes seconds. Detecting it reliably requires dedicated tooling that the voice platform itself does not typically provide. The generation layer and the verification layer are different products, from different vendors, with different threat models.

This catalogue maps what the V2V platform layer covers, where the structural gaps are, and which categories of tooling close them. The goal is a layered security stack where generation, liveness detection, biometrics, and deepfake detection compose into a coherent whole rather than independent silos.

check_circle

What the Platform Layer Covers

V2V platforms handle generation and processing. These are the capabilities you typically get from the platform itself, and where platform responsibility ends.

record_voice_over

TTS

Text-to-speech generation. The core platform capability: converts agent text output into natural speech with control over voice, style, and pacing.

hearing

STT

Speech-to-text transcription with word-level timestamps, confidence scores, and multi-speaker detection for parsing incoming audio.

person_add

Voice Cloning

Custom voice creation from audio samples, enabling branded or personalised agent personas without using a stock voice.

how_to_reg

Consent Verification

Verification that the person submitting a voice sample has the right to clone it. Reduces unconsented enrolment of third-party voices.

fingerprint

Provenance Metadata

Cryptographic markers (e.g. C2PA) embedded in generated audio to enable post-incident attribution and chain-of-custody for synthetic speech.

block

Restricted Voice Policies

Automated blocking of clone attempts against known protected identities. Reduces impersonation of public figures at the platform level.

graphic_eq

Voice Activity Detection

Per-frame detection of whether audio contains active speech. Drives accurate turn-taking, barge-in handling, and silence management.

swap_horiz

Turn Detection

End-of-speech detection that identifies natural conversation boundaries, preventing the agent from cutting off mid-sentence or waiting too long.

article

Conversation Transcripts

Timestamped records of speaker turns, durations, and full transcript text. Supports audit, replay analysis, and downstream logging.

cable

Transport Integration

Connectivity to telephony (SIP/PSTN), WebRTC, or audio streaming protocols. Handles the audio pipeline from wire to model.

warning

V2V Risk Catalogue

Attack vectors and failure modes specific to voice-to-voice agents. None of these are covered by the platform generation layer; each requires dedicated tooling from the security stack.

content_copyIdentity Fraud

Voice Cloning Attack

Attacker collects a few seconds of a target's voice from public audio, social media, or a prior interaction, clones it using freely available tools, and uses the synthetic voice to impersonate that person to the agent.

play_circleSpoofing

Liveness Injection

A pre-recorded audio file or TTS output is injected into the audio path to simulate a live speaker. Without liveness checks, the system has no way to distinguish a recording from a real person in real time.

hide_sourceDetection Gap

Cross-Platform Deepfake

Audio generated by open-source or third-party TTS systems, such as Coqui, Tortoise-TTS, or XTTS, that platform-specific classifiers cannot detect. Any classifier trained only on the platform's own output has this blind spot.

person_offVishing

Agent Impersonation

Synthetic voice engineered to sound like a legitimate agent or employee, used in social engineering to extract sensitive information from users who trust they are speaking to an authorised system or person.

replayBiometric Bypass

Voiceprint Replay Attack

A genuine voiceprint sample extracted from a prior authenticated session is replayed against a passive biometric system that lacks anti-replay protection, authenticating the attacker as the legitimate user.

sms_failedInjection

Transcript Poisoning

Adversarial audio crafted to cause the STT engine to produce attacker-controlled text, injecting instructions into the downstream LLM context. The agent acts on transcribed words that were never genuinely spoken.

psychologyFraud

Behavioural Evasion

Attacker studies a target's speech patterns, cadence, vocabulary, and filler words across multiple interactions to construct a synthetic voice that defeats behavioural-anomaly detection trained on that same profile.

gpp_badGovernance

Consent Bypass

Voice sample acquired without the subject's knowledge (from a recorded conversation, public video, or intercepted stream) and used to enrol a cloned voice, bypassing the consent verification the platform requires.

phonelink_eraseCompound Fraud

Channel Metadata Forgery

Transport-layer metadata (originating number, device ID, session token) is forged and combined with a cloned voice, creating a compound identity attack that defeats controls relying on either signal alone.

historySpoofing

Session Replay Attack

Audio from a legitimate authenticated session is captured and replayed in a later interaction. Stateful agents that do not bind sessions to cryptographic tokens may accept the replay as the original authenticated user.

account_tree

Secure Call Flow Architecture

A hardened V2V interaction passes through multiple verification layers before reaching agent logic. Each layer catches a different class of attack. Failing any layer routes to a challenge or rejection path before the agent ever responds.

personIncoming Audio
Voice LayerVAD Check
Liveness APILiveness Detection
Detection APIDeepfake Scan
Biometric APIVoiceprint Auth
Risk EngineFraud Score
Voice LayerAgent Logic
Voice LayerTTS Response
Voice platform
Security layer
table_chart

Gap Analysis

A V2V platform can tell you this audio was produced by its own system; it cannot tell you the speaker is who they claim to be. The gaps below are where an attacker walks through an unaugmented V2V deployment.

GapAttack it enablesTooling requiredProvider options
Voiceprint authenticationSpeaker claims to be an enrolled user but is an impersonator or cloned voiceVoice biometric matching against an enrolled speaker profilePindropNuanceID R&D
Liveness detectionReplayed recording or TTS output injected in place of a live speakerAnti-spoofing model distinguishing live speech from synthetic or replayed audioPindropID R&DReality Defender
Cross-platform deepfake detectionSynthetic audio from open-source or third-party TTS not caught by platform classifierEnsemble detection model covering all major TTS generators, not just the platform's ownReality Defender
Speaker identity verificationVoice cloning attack against voice-based authenticationMatching speaker audio against enrolled profile captured at onboardingPindropNuance
Fraud scoringCompound attacks combining multiple vectors that no single check catches aloneRisk score combining voice features, device fingerprint, and behavioural signalsPindrop
Behavioural analysisSlow-burn impersonation refined across multiple sessions to evade anomaly detectionSpeech cadence, vocabulary, and pattern modelling against a historical baselinePindropNuance
security

The Security Stack

Representative providers in each gap category. Each specialises in a different part of the verification problem; they compose rather than compete.

fingerprint

Pindrop

LivenessVoiceprint authFraud scoringBehavioural analysis

Voice biometrics, liveness detection, and fraud scoring built for real-time voice pipelines. Analyses 1,380+ audio features, including background noise, device fingerprint, and speech patterns, to produce a per-session fraud risk score. Widely deployed in high-assurance voice environments.

policy

Reality Defender

Cross-platform deepfake detectionLiveness

Cross-platform deepfake audio detection using ensemble models trained across all major TTS systems. Provides a confidence score for synthetic speech from any source, not limited to any single vendor's output. Designed for real-time stream analysis at scale.

shield

ID R&D

LivenessVoiceprint authAnti-spoofing

Passive liveness detection and voice biometric authentication with no challenge-response friction. Anti-spoofing models trained against replay, synthesis, and voice conversion attacks. Runs continuously in the background without requiring the user to take any special action.

corporate_fare

Nuance (Microsoft)

Voiceprint authBehavioural analysisSpeaker verification

Enterprise voice biometrics platform with a long deployment history in regulated industries. Integrates with the Microsoft identity and compliance ecosystem. Supports both text-dependent and text-independent voiceprint verification.

lightbulb

Design Principles

sync_alt

Generation ≠ Verification

A platform that generates voice cannot fully verify it. The generation layer and the security stack are distinct concerns, built by different vendors, with different threat models. They sit alongside each other, not inside each other.

layers

Layered Defence

No single provider covers all V2V risk vectors. Liveness, voiceprint biometrics, and deepfake detection are distinct capabilities that must compose into a unified pipeline, each catching a different attack class.

verified

Provenance by Default

C2PA metadata should be embedded in all agent-generated audio from the start, creating an auditable provenance chain for post-incident forensics before any incident ever occurs.

login

Enrol at the Edge

Voiceprint enrolment should happen at the earliest authenticated touchpoint (onboarding, not authentication) to build a genuine baseline before any attacker can attempt to enrol a cloned voice first.

alt_route

Risk-Scored Routing

Fraud scores from voice analysis should feed into the same risk engine as transactional and behavioural signals, enabling compound risk decisions rather than independent siloed voice checks that can be individually bypassed.

quiz

Fail to Challenge

When any verification layer is uncertain, route to an out-of-band challenge rather than terminating the interaction outright. Legitimate users can complete it; most attackers cannot, and it avoids false positives turning away real users.