Hallucinations & Grounding
Types of Hallucinations
Factual Hallucinations
The model generates statements that contradict established facts or real-world knowledge. This includes incorrect dates, false statistics, non-existent citations, made-up historical events, or wrong scientific claims.
Contextual Hallucinations
The model generates information not supported by the provided context, even if factually true in general. In RAG systems, this occurs when the model ignores retrieved documents and generates from parametric memory.
Instruction Hallucinations
The model fails to follow explicit instructions while appearing to comply. It may claim to have performed a task it didn't do, fabricate tool outputs, or misrepresent its own capabilities.
Entity Hallucinations
The model invents non-existent entities such as fake people, organizations, products, URLs, or API endpoints. These plausible-sounding inventions can lead to broken links and failed integrations.
The Mechanism Behind Entity Hallucinations
Anthropic's interpretability research traced the internal circuitry of Claude to find out why entity hallucinations happen. The answer inverts the conventional explanation.
Refusal is the Default
The conventional view is that models hallucinate because they're trained to always produce output — completion machines that fill gaps with plausible text. Interpretability research found the opposite. Claude has a circuit that is on by default that causes it to state it lacks sufficient information. The natural state is to decline.
Known Entity Recognition Overrides It
What allows Claude to answer at all is a separate "known answer" feature. When Claude recognises a well-known entity — say Michael Jordan — this feature activates and inhibits the default refusal circuit. Hallucinations happen when this recognition system misfires: an unfamiliar name triggers enough familiarity that the "known entity" feature incorrectly activates and suppresses refusal, leaving Claude with no actual knowledge but no refusal either.
Anthropic confirmed this by intervention: artificially activating the "known answer" features while asking about unknown entities reliably caused hallucination. Inhibiting the "can't answer" features did the same. Hallucination isn't recklessness — it's a safety default being incorrectly overridden by a misfiring recognition system.
Causes in Agentic Applications
Multi-Turn Context Drift
In long conversations, the model's understanding of context degrades. Early information may be forgotten or conflated. The attention mechanism favors recent tokens, causing 'lost in the middle' effects.
Tool Output Misinterpretation
Agents frequently call external tools and must interpret their outputs. Hallucinations occur when models misread responses, assume success when tools failed, or fabricate outputs entirely.
Planning Overconfidence
When generating multi-step plans, models may hallucinate capabilities they don't have, assume tool availability that doesn't exist, or create logically impossible sequences.
Memory System Corruption
Agents using external memory can hallucinate when retrieving or updating memories. They may recall events that didn't happen, merge distinct memories, or write corrupted summaries.
Inter-Agent Communication
In multi-agent systems, hallucinations can propagate between agents. One agent's fabricated output becomes another's trusted input, causing errors to compound across the system.
Reward Hacking Behaviors
Models optimized to appear helpful may generate confident-sounding but false information rather than admitting uncertainty. This produces fluent, convincing hallucinations.
Prevention Techniques
Retrieval-Augmented Generation (RAG)
Ground model outputs in retrieved documents from trusted knowledge bases. RAG reduces hallucinations by providing factual context the model must synthesize rather than generate from memory.
Citation Requirements
Require the model to cite specific sources for each claim. Citations create accountability and enable verification. Prompting like 'cite the exact passage supporting this claim' forces grounding.
Structured Output Constraints
Use JSON schemas, Pydantic models, or other structured formats to constrain generation. Structured outputs reduce hallucination surface area with explicit fields like 'source_document' or 'confidence_score'.
Uncertainty Expression
Train and prompt models to express uncertainty rather than hallucinating confident answers. Calibrated models should say 'I don't know' or 'I'm not certain' when appropriate.
Detection Methods
Self-Consistency Checking
Generate multiple responses to the same query and check for consistency. Hallucinations often vary across samples while factual information remains stable.
Entailment Verification
Use NLI models to verify that generated claims are entailed by source documents. Claims labeled as 'contradiction' or 'neutral' indicate potential hallucinations.
External Knowledge Verification
Verify factual claims against external knowledge bases, search engines, or fact-checking APIs. Catches factual hallucinations that internal checks might miss.
LLM-as-Judge Detection
Use a separate LLM to evaluate whether responses are grounded in provided context. The judge model analyzes source documents and generated text, flagging unsupported claims.
Confessions Integration
OpenAI's confession technique can detect hallucinations with 84% accuracy. After generation, prompt the model to honestly assess whether it fabricated information.
Token Probability Analysis
Analyze token-level probabilities during generation. Low-probability tokens or high entropy sequences may indicate uncertainty and potential hallucination.
Detection Methods Comparison
| Method | Latency Impact | Accuracy | Best For |
|---|---|---|---|
| Self-Consistency | High (multiple samples) | Medium | Factual queries, QA |
| NLI Entailment | Low-Medium | High for RAG | Document-grounded responses |
| External Verification | Medium-High | High for facts | Factual claims, citations |
| LLM-as-Judge | Medium | High | Complex grounding assessment |
| Confessions | Low | 84% | Post-hoc monitoring |
| Token Probability | Very Low | Medium | Real-time flagging |
Agentic-Specific Patterns
Tool Output Verification
Never trust agent-generated tool outputs without verification. Parse and validate actual tool responses. Implement schemas for expected outputs. Log discrepancies between claimed and actual tool results.
- Schema validation for tool outputs
- Claimed vs actual comparison
- Structured error handling
- Output logging and audit
Memory Validation
Implement verification when reading from and writing to agent memory systems. Hash or version memory entries. Cross-check retrieved memories against other sources. Implement decay scoring for older memories.
- Memory entry versioning
- Read/write verification
- Confidence decay over time
- Periodic memory audits
Multi-Agent Consensus
In multi-agent systems, require consensus before acting on potentially hallucinated information. Multiple agents independently verify claims before propagation. Trace information provenance through the system.
- Independent verification
- Consensus requirements
- Confidence aggregation
- Provenance tracking
Action Verification Loops
Before executing consequential actions, verify the agent's understanding and plan. Implement 'think-verify-act' patterns where plans are validated before execution. Use human-in-the-loop for high-stakes decisions.
- Pre-action verification
- Think-verify-act pattern
- Human-in-the-loop for high stakes
- Rollback capabilities
Production Monitoring
Hallucination Rate Tracking
Track hallucination rates as a key quality metric. Sample production outputs for human evaluation. Implement automated detection on all outputs. Alert on rate increases.
Citation Verification Logging
Log all citations and verify against source documents. Track citation accuracy over time. Identify frequently hallucinated sources. Build dashboards showing grounding quality.
User Feedback Integration
Collect and analyze user feedback on factual accuracy. Implement 'flag as incorrect' mechanisms. Use feedback to identify hallucination patterns and improve detection.
Defense in Depth: No single technique eliminates hallucinations. Combine prevention (RAG, citations, uncertainty), detection (NLI, confessions, LLM-as-judge), and monitoring (rate tracking, user feedback) for comprehensive coverage.
