AI Guardrails

shield

What Are Guardrails?

Input Guardrails

Validate user inputs before processing:

  • Detect prompt injection attempts
  • Filter malicious or harmful content
  • Redact PII before model processing
  • Enforce topic boundaries

Output Guardrails

Validate model responses before delivery:

  • Check for hallucinated content
  • Ensure factual grounding
  • Filter inappropriate responses
  • Validate schema compliance
tune

Implementation Approaches

ApproachDescriptionProsCons
DeterministicRule-based validation using regex, keywords, explicit checksFast, predictable, cost-effectiveMay miss nuanced violations
Model-basedLLMs or classifiers evaluate content semanticallyCatches subtle issues, context-awareHigher cost, added latency
HybridCombine deterministic pre-filters with model-based deep checksBest coverage, optimized costMore complex to implement
category

Guardrail Types

Content Moderation

Filter harmful, toxic, or inappropriate content including hate speech, violence, and explicit material.

ToxicityHate SpeechViolence

PII Protection

Detect and handle personally identifiable information through redaction, masking, or blocking.

DetectionRedactionMasking

Prompt Attack Detection

Identify and block prompt injection attempts, jailbreaking, and prompt leaking attacks.

InjectionJailbreakingPrompt Leaking

Topic Boundaries

Enforce domain-specific constraints and prevent off-topic or denied subject discussions.

Denied TopicsDomain Lock

Hallucination Detection

Verify factual accuracy and ensure responses are grounded in provided context or knowledge.

GroundingFact Verification

Output Validation

Ensure outputs conform to expected formats, schemas, and quality standards.

Schema CheckQuality Gates
account_tree

Execution Patterns

Parallel Mode

Run guardrails concurrently with model inference to optimize latency. Check results after model completes.

Trade-off: Lower latency, but uses tokens even on violation.

Blocking Mode

Run guardrails before model execution. Prevents token usage on policy violations.

Trade-off: Cost efficient, but higher latency.

Tripwire Pattern

Signal violations, halt execution, and raise exceptions. Guardrail functions return a tripwire result to indicate violations.

Use case: Hard stops for critical policy violations.

Shadow Mode

Monitor violations without blocking. Log triggers for analysis and tuning before enforcing policies.

Use case: Safe rollout and threshold tuning.

Execution Flow

1User Input
2Input Guardrails
3LLM Processing
4Output Guardrails
5Response
person_off

PII Handling Strategies

StrategyBehaviorExampleUse Case
RedactReplace with [REDACTED_TYPE][email protected][REDACTED_EMAIL]Full privacy protection
MaskPartially obscure the value4111111111111111****1111Verification while protecting
HashReplace with deterministic hashSSN: 123-45-6789SSN: a1b2c3d4De-identification with consistency
BlockRaise exception if detectedRequest rejected with errorStrict compliance environments
build

Guardrails Platforms & Tools

ToolProviderKey FeaturesType
Bedrock GuardrailsAWS6 safeguards: content, prompt attacks, topics, PII, grounding, automated reasoningManaged Service
Guardrails AIOpen SourceValidators for hallucination, PII, toxicity, tone, factsFramework
NeMo GuardrailsNVIDIAProgrammable rails, topical/safety/security controlsFramework
OpenAI Agents SDKOpenAIInput/output guardrails, tripwire patternSDK
LangChain GuardrailsLangChainPII detection, human-in-loop, custom hooksFramework
Cloudflare AI GatewayCloudflareProxy-based, cross-model moderation, rate limitingManaged Service
LLM GuardOpen SourceInput/output scanners, prompt injection detectionLibrary
RebuffOpen SourcePrompt injection detection, self-hardeningLibrary
cloud

Amazon Bedrock Guardrails

Content Moderation

Filter harmful content in text and images across configurable categories.

Prompt Attack Detection

Detect and block prompt injection and jailbreak attempts.

Topic Classification

Define denied topics to keep conversations within approved domains.

PII Redaction

Automatically detect and redact 30+ types of personally identifiable information.

Contextual Grounding

Detect hallucinations by verifying responses against provided context.

Automated Reasoning

Use formal logic to verify policy compliance and response accuracy.

checklist

Implementation Best Practices

Defense in Depth

Layer multiple guardrails at different execution points. Combine deterministic pre-filters with model-based semantic checks.

Shadow Mode First

Deploy new guardrails in monitoring mode before enforcement. Collect data to tune thresholds and reduce false positives.

Balance Security & UX

Overly aggressive guardrails frustrate users. Find the right balance between protection and usability.

Comprehensive Logging

Log all guardrail triggers with full context for analysis, debugging, and compliance auditing.

Regular Tuning

Review guardrail performance regularly. Adjust thresholds based on false positive/negative rates.

Human-in-the-Loop

For high-stakes decisions, route uncertain cases to human reviewers rather than auto-blocking.