Hallucinations & Grounding

category

Types of Hallucinations

Factual Hallucinations

The model generates statements that contradict established facts or real-world knowledge. This includes incorrect dates, false statistics, non-existent citations, made-up historical events, or wrong scientific claims.

False ClaimsWrong FactsFake Citations

Contextual Hallucinations

The model generates information not supported by the provided context, even if factually true in general. In RAG systems, this occurs when the model ignores retrieved documents and generates from parametric memory.

UngroundedIgnores ContextBreaks Citations

Instruction Hallucinations

The model fails to follow explicit instructions while appearing to comply. It may claim to have performed a task it didn't do, fabricate tool outputs, or misrepresent its own capabilities.

False ComplianceFake Tool OutputPhantom Actions

Entity Hallucinations

The model invents non-existent entities such as fake people, organizations, products, URLs, or API endpoints. These plausible-sounding inventions can lead to broken links and failed integrations.

Fake PeopleInvalid URLsMade-up APIs
biotech

The Mechanism Behind Entity Hallucinations

Anthropic's interpretability research traced the internal circuitry of Claude to find out why entity hallucinations happen. The answer inverts the conventional explanation.

Refusal is the Default

The conventional view is that models hallucinate because they're trained to always produce output — completion machines that fill gaps with plausible text. Interpretability research found the opposite. Claude has a circuit that is on by default that causes it to state it lacks sufficient information. The natural state is to decline.

Known Entity Recognition Overrides It

What allows Claude to answer at all is a separate "known answer" feature. When Claude recognises a well-known entity — say Michael Jordan — this feature activates and inhibits the default refusal circuit. Hallucinations happen when this recognition system misfires: an unfamiliar name triggers enough familiarity that the "known entity" feature incorrectly activates and suppresses refusal, leaving Claude with no actual knowledge but no refusal either.

science

Anthropic confirmed this by intervention: artificially activating the "known answer" features while asking about unknown entities reliably caused hallucination. Inhibiting the "can't answer" features did the same. Hallucination isn't recklessness — it's a safety default being incorrectly overridden by a misfiring recognition system.

psychology

Causes in Agentic Applications

Multi-Turn Context Drift

In long conversations, the model's understanding of context degrades. Early information may be forgotten or conflated. The attention mechanism favors recent tokens, causing 'lost in the middle' effects.

Tool Output Misinterpretation

Agents frequently call external tools and must interpret their outputs. Hallucinations occur when models misread responses, assume success when tools failed, or fabricate outputs entirely.

Planning Overconfidence

When generating multi-step plans, models may hallucinate capabilities they don't have, assume tool availability that doesn't exist, or create logically impossible sequences.

Memory System Corruption

Agents using external memory can hallucinate when retrieving or updating memories. They may recall events that didn't happen, merge distinct memories, or write corrupted summaries.

Inter-Agent Communication

In multi-agent systems, hallucinations can propagate between agents. One agent's fabricated output becomes another's trusted input, causing errors to compound across the system.

Reward Hacking Behaviors

Models optimized to appear helpful may generate confident-sounding but false information rather than admitting uncertainty. This produces fluent, convincing hallucinations.

shield

Prevention Techniques

Retrieval-Augmented Generation (RAG)

Ground model outputs in retrieved documents from trusted knowledge bases. RAG reduces hallucinations by providing factual context the model must synthesize rather than generate from memory.

Document GroundingVector Search

Citation Requirements

Require the model to cite specific sources for each claim. Citations create accountability and enable verification. Prompting like 'cite the exact passage supporting this claim' forces grounding.

Inline CitationsSource Traceability

Structured Output Constraints

Use JSON schemas, Pydantic models, or other structured formats to constrain generation. Structured outputs reduce hallucination surface area with explicit fields like 'source_document' or 'confidence_score'.

Schema EnforcementType Safety

Uncertainty Expression

Train and prompt models to express uncertainty rather than hallucinating confident answers. Calibrated models should say 'I don't know' or 'I'm not certain' when appropriate.

Calibrated ConfidenceHedged Language
search

Detection Methods

Self-Consistency Checking

Generate multiple responses to the same query and check for consistency. Hallucinations often vary across samples while factual information remains stable.

Entailment Verification

Use NLI models to verify that generated claims are entailed by source documents. Claims labeled as 'contradiction' or 'neutral' indicate potential hallucinations.

External Knowledge Verification

Verify factual claims against external knowledge bases, search engines, or fact-checking APIs. Catches factual hallucinations that internal checks might miss.

LLM-as-Judge Detection

Use a separate LLM to evaluate whether responses are grounded in provided context. The judge model analyzes source documents and generated text, flagging unsupported claims.

Confessions Integration

OpenAI's confession technique can detect hallucinations with 84% accuracy. After generation, prompt the model to honestly assess whether it fabricated information.

Token Probability Analysis

Analyze token-level probabilities during generation. Low-probability tokens or high entropy sequences may indicate uncertainty and potential hallucination.

Detection Methods Comparison

MethodLatency ImpactAccuracyBest For
Self-ConsistencyHigh (multiple samples)MediumFactual queries, QA
NLI EntailmentLow-MediumHigh for RAGDocument-grounded responses
External VerificationMedium-HighHigh for factsFactual claims, citations
LLM-as-JudgeMediumHighComplex grounding assessment
ConfessionsLow84%Post-hoc monitoring
Token ProbabilityVery LowMediumReal-time flagging
smart_toy

Agentic-Specific Patterns

Tool Output Verification

Never trust agent-generated tool outputs without verification. Parse and validate actual tool responses. Implement schemas for expected outputs. Log discrepancies between claimed and actual tool results.

  • Schema validation for tool outputs
  • Claimed vs actual comparison
  • Structured error handling
  • Output logging and audit

Memory Validation

Implement verification when reading from and writing to agent memory systems. Hash or version memory entries. Cross-check retrieved memories against other sources. Implement decay scoring for older memories.

  • Memory entry versioning
  • Read/write verification
  • Confidence decay over time
  • Periodic memory audits

Multi-Agent Consensus

In multi-agent systems, require consensus before acting on potentially hallucinated information. Multiple agents independently verify claims before propagation. Trace information provenance through the system.

  • Independent verification
  • Consensus requirements
  • Confidence aggregation
  • Provenance tracking

Action Verification Loops

Before executing consequential actions, verify the agent's understanding and plan. Implement 'think-verify-act' patterns where plans are validated before execution. Use human-in-the-loop for high-stakes decisions.

  • Pre-action verification
  • Think-verify-act pattern
  • Human-in-the-loop for high stakes
  • Rollback capabilities
monitoring

Production Monitoring

Hallucination Rate Tracking

Track hallucination rates as a key quality metric. Sample production outputs for human evaluation. Implement automated detection on all outputs. Alert on rate increases.

Citation Verification Logging

Log all citations and verify against source documents. Track citation accuracy over time. Identify frequently hallucinated sources. Build dashboards showing grounding quality.

User Feedback Integration

Collect and analyze user feedback on factual accuracy. Implement 'flag as incorrect' mechanisms. Use feedback to identify hallucination patterns and improve detection.

security

Defense in Depth: No single technique eliminates hallucinations. Combine prevention (RAG, citations, uncertainty), detection (NLI, confessions, LLM-as-judge), and monitoring (rate tracking, user feedback) for comprehensive coverage.