AI Guardrails
What Are Guardrails?
Input Guardrails
Validate user inputs before processing:
- Detect prompt injection attempts
- Filter malicious or harmful content
- Redact PII before model processing
- Enforce topic boundaries
Output Guardrails
Validate model responses before delivery:
- Check for hallucinated content
- Ensure factual grounding
- Filter inappropriate responses
- Validate schema compliance
Implementation Approaches
| Approach | Description | Pros | Cons |
|---|---|---|---|
| Deterministic | Rule-based validation using regex, keywords, explicit checks | Fast, predictable, cost-effective | May miss nuanced violations |
| Model-based | LLMs or classifiers evaluate content semantically | Catches subtle issues, context-aware | Higher cost, added latency |
| Hybrid | Combine deterministic pre-filters with model-based deep checks | Best coverage, optimized cost | More complex to implement |
Guardrail Types
Content Moderation
Filter harmful, toxic, or inappropriate content including hate speech, violence, and explicit material.
PII Protection
Detect and handle personally identifiable information through redaction, masking, or blocking.
Prompt Attack Detection
Identify and block prompt injection attempts, jailbreaking, and prompt leaking attacks.
Topic Boundaries
Enforce domain-specific constraints and prevent off-topic or denied subject discussions.
Hallucination Detection
Verify factual accuracy and ensure responses are grounded in provided context or knowledge.
Output Validation
Ensure outputs conform to expected formats, schemas, and quality standards.
Execution Patterns
Parallel Mode
Run guardrails concurrently with model inference to optimize latency. Check results after model completes.
Trade-off: Lower latency, but uses tokens even on violation.
Blocking Mode
Run guardrails before model execution. Prevents token usage on policy violations.
Trade-off: Cost efficient, but higher latency.
Tripwire Pattern
Signal violations, halt execution, and raise exceptions. Guardrail functions return a tripwire result to indicate violations.
Use case: Hard stops for critical policy violations.
Shadow Mode
Monitor violations without blocking. Log triggers for analysis and tuning before enforcing policies.
Use case: Safe rollout and threshold tuning.
Execution Flow
PII Handling Strategies
| Strategy | Behavior | Example | Use Case |
|---|---|---|---|
| Redact | Replace with [REDACTED_TYPE] | [email protected] → [REDACTED_EMAIL] | Full privacy protection |
| Mask | Partially obscure the value | 4111111111111111 → ****1111 | Verification while protecting |
| Hash | Replace with deterministic hash | SSN: 123-45-6789 → SSN: a1b2c3d4 | De-identification with consistency |
| Block | Raise exception if detected | Request rejected with error | Strict compliance environments |
Guardrails Platforms & Tools
| Tool | Provider | Key Features | Type |
|---|---|---|---|
| Bedrock Guardrails | AWS | 6 safeguards: content, prompt attacks, topics, PII, grounding, automated reasoning | Managed Service |
| Guardrails AI | Open Source | Validators for hallucination, PII, toxicity, tone, facts | Framework |
| NeMo Guardrails | NVIDIA | Programmable rails, topical/safety/security controls | Framework |
| OpenAI Agents SDK | OpenAI | Input/output guardrails, tripwire pattern | SDK |
| LangChain Guardrails | LangChain | PII detection, human-in-loop, custom hooks | Framework |
| Cloudflare AI Gateway | Cloudflare | Proxy-based, cross-model moderation, rate limiting | Managed Service |
| LLM Guard | Open Source | Input/output scanners, prompt injection detection | Library |
| Rebuff | Open Source | Prompt injection detection, self-hardening | Library |
Amazon Bedrock Guardrails
Content Moderation
Filter harmful content in text and images across configurable categories.
Prompt Attack Detection
Detect and block prompt injection and jailbreak attempts.
Topic Classification
Define denied topics to keep conversations within approved domains.
PII Redaction
Automatically detect and redact 30+ types of personally identifiable information.
Contextual Grounding
Detect hallucinations by verifying responses against provided context.
Automated Reasoning
Use formal logic to verify policy compliance and response accuracy.
Implementation Best Practices
Defense in Depth
Layer multiple guardrails at different execution points. Combine deterministic pre-filters with model-based semantic checks.
Shadow Mode First
Deploy new guardrails in monitoring mode before enforcement. Collect data to tune thresholds and reduce false positives.
Balance Security & UX
Overly aggressive guardrails frustrate users. Find the right balance between protection and usability.
Comprehensive Logging
Log all guardrail triggers with full context for analysis, debugging, and compliance auditing.
Regular Tuning
Review guardrail performance regularly. Adjust thresholds based on false positive/negative rates.
Human-in-the-Loop
For high-stakes decisions, route uncertain cases to human reviewers rather than auto-blocking.
