Will Percey — Portfolio

AI Guardrails

> > Updated Dec 2025

shield

What Are Guardrails?

Input Guardrails

Validate user inputs before processing:

Detect prompt injection attempts
Filter malicious or harmful content
Redact PII before model processing
Enforce topic boundaries

Output Guardrails

Validate model responses before delivery:

Check for hallucinated content
Ensure factual grounding
Filter inappropriate responses
Validate schema compliance

tune

Implementation Approaches

Approach	Description	Pros	Cons
Deterministic	Rule-based validation using regex, keywords, explicit checks	Fast, predictable, cost-effective	May miss nuanced violations
Model-based	LLMs or classifiers evaluate content semantically	Catches subtle issues, context-aware	Higher cost, added latency
Hybrid	Combine deterministic pre-filters with model-based deep checks	Best coverage, optimized cost	More complex to implement

Guardrail Types

Content Moderation

Filter harmful, toxic, or inappropriate content including hate speech, violence, and explicit material.

ToxicityHate SpeechViolence

PII Protection

Detect and handle personally identifiable information through redaction, masking, or blocking.

DetectionRedactionMasking

Prompt Attack Detection

Identify and block prompt injection attempts, jailbreaking, and prompt leaking attacks.

InjectionJailbreakingPrompt Leaking

Topic Boundaries

Enforce domain-specific constraints and prevent off-topic or denied subject discussions.

Denied TopicsDomain Lock

Hallucination Detection

Verify factual accuracy and ensure responses are grounded in provided context or knowledge.

GroundingFact Verification

Output Validation

Ensure outputs conform to expected formats, schemas, and quality standards.

Schema CheckQuality Gates

account_tree

Execution Patterns

Parallel Mode

Run guardrails concurrently with model inference to optimize latency. Check results after model completes.

Trade-off: Lower latency, but uses tokens even on violation.

Blocking Mode

Run guardrails before model execution. Prevents token usage on policy violations.

Trade-off: Cost efficient, but higher latency.

Tripwire Pattern

Signal violations, halt execution, and raise exceptions. Guardrail functions return a tripwire result to indicate violations.

Use case: Hard stops for critical policy violations.

Shadow Mode

Monitor violations without blocking. Log triggers for analysis and tuning before enforcing policies.

Use case: Safe rollout and threshold tuning.

Execution Flow

1User Input

→

2Input Guardrails

→

3LLM Processing

→

4Output Guardrails

→

5Response

person_off

PII Handling Strategies

Strategy	Behavior	Example	Use Case
Redact	Replace with `[REDACTED_TYPE]`	`[email protected]` → `[REDACTED_EMAIL]`	Full privacy protection
Mask	Partially obscure the value	`4111111111111111` → `****1111`	Verification while protecting
Hash	Replace with deterministic hash	`SSN: 123-45-6789` → `SSN: a1b2c3d4`	De-identification with consistency
Block	Raise exception if detected	Request rejected with error	Strict compliance environments

build

Guardrails Platforms & Tools

Tool	Provider	Key Features	Type
Bedrock Guardrails	AWS	6 safeguards: content, prompt attacks, topics, PII, grounding, automated reasoning	Managed Service
Guardrails AI	Open Source	Validators for hallucination, PII, toxicity, tone, facts	Framework
NeMo Guardrails	NVIDIA	Programmable rails, topical/safety/security controls	Framework
OpenAI Agents SDK	OpenAI	Input/output guardrails, tripwire pattern	SDK
LangChain Guardrails	LangChain	PII detection, human-in-loop, custom hooks	Framework
Cloudflare AI Gateway	Cloudflare	Proxy-based, cross-model moderation, rate limiting	Managed Service
LLM Guard	Open Source	Input/output scanners, prompt injection detection	Library
Rebuff	Open Source	Prompt injection detection, self-hardening	Library

cloud

Amazon Bedrock Guardrails

Content Moderation

Filter harmful content in text and images across configurable categories.

Prompt Attack Detection

Detect and block prompt injection and jailbreak attempts.

Topic Classification

Define denied topics to keep conversations within approved domains.

PII Redaction

Automatically detect and redact 30+ types of personally identifiable information.

Contextual Grounding

Detect hallucinations by verifying responses against provided context.

Automated Reasoning

Use formal logic to verify policy compliance and response accuracy.

checklist

Implementation Best Practices

Defense in Depth

Layer multiple guardrails at different execution points. Combine deterministic pre-filters with model-based semantic checks.

Shadow Mode First

Deploy new guardrails in monitoring mode before enforcement. Collect data to tune thresholds and reduce false positives.

Balance Security & UX

Overly aggressive guardrails frustrate users. Find the right balance between protection and usability.

Comprehensive Logging

Log all guardrail triggers with full context for analysis, debugging, and compliance auditing.

Regular Tuning

Review guardrail performance regularly. Adjust thresholds based on false positive/negative rates.

Human-in-the-Loop

For high-stakes decisions, route uncertain cases to human reviewers rather than auto-blocking.