AI Security
OWASP LLM Top 10 Vulnerabilities (2025)
| Rank | Vulnerability | Description | Mitigation |
|---|---|---|---|
| LLM01 | Prompt Injection | User prompts alter the LLM's behavior by overriding or manipulating system instructions | Input validation, prompt isolation, output filtering |
| LLM02 | Sensitive Information Disclosure | LLM reveals training data, PII, or confidential information affecting both model and application | Data filtering, PII detection, output monitoring, redaction |
| LLM03 | Supply Chain | LLM supply chains susceptible to vulnerabilities from compromised models, datasets, or dependencies | Model verification, dependency scanning, provenance tracking |
| LLM04 | Data and Model Poisoning | Manipulation of pre-training, fine-tuning, or embedding data to introduce vulnerabilities or biases | Data provenance, validation, anomaly detection, sanitization |
| LLM05 | Improper Output Handling | Insufficient validation, sanitization, and handling of LLM outputs before downstream use | Sanitize outputs, validate before execution, sandbox environments |
| LLM06 | Excessive Agency | LLM-based systems granted excessive autonomy or permissions beyond intended scope | Principle of least privilege, human-in-loop, action guardrails |
| LLM07 | System Prompt Leakage | Exposure of system prompts that reveal internal instructions, configurations, or sensitive logic | Prompt protection, input filtering, access controls, monitoring |
| LLM08 | Vector and Embedding Weaknesses | Vulnerabilities in vector databases and embeddings present security risks in RAG systems | Vector validation, embedding security, access controls, monitoring |
| LLM09 | Misinformation | LLMs generate false, misleading, or fabricated information (hallucinations) in responses | Human review, fact-checking, confidence scores, source citations |
| LLM10 | Unbounded Consumption | LLM processes consume excessive resources leading to DoS, cost overruns, or service degradation | Rate limiting, input length limits, cost monitoring, auto-scaling |
AI-Specific Attack Vectors
Adversarial Attacks
Definition: Crafted inputs designed to fool ML models
Types:
- Evasion: Bypass detection (e.g., spam filters)
- Poisoning: Corrupt training data
- Model Inversion: Extract training data
- Model Extraction: Steal model via queries
Defense: Adversarial training, input validation, robustness testing
Prompt Injection
Definition: Manipulate LLM behavior via malicious prompts
Types:
- Direct: User input overrides system prompt
- Indirect: Malicious content in retrieved data
- Jailbreaking: Bypass safety guardrails
- Prompt Leaking: Expose system prompts
Defense: Prompt isolation, input sanitization, output filtering
Data Poisoning
Definition: Inject malicious data into training set
Impact:
- Backdoors: Trigger specific behaviors
- Bias Injection: Introduce systematic bias
- Performance Degradation: Reduce accuracy
- Targeted Attacks: Misclassify specific inputs
Defense: Data provenance, anomaly detection, sanitization
Model Extraction
Definition: Steal model by querying it repeatedly
Techniques:
- Equation-solving: Reconstruct decision boundaries
- Path-finding: Discover model structure
- Knowledge distillation: Train copy via queries
- API abuse: Extract via excessive queries
Defense: Query monitoring, rate limiting, watermarking
AI Security Best Practices
Input Security
- Validate and sanitize all inputs
- Enforce input length limits
- Detect and block malicious patterns
- Use allowlists over blocklists
- Implement rate limiting per user/IP
- Log suspicious inputs for analysis
Output Security
- Sanitize outputs before use
- Validate before code execution
- Filter sensitive information (PII, secrets)
- Implement output guardrails
- Monitor for data leakage
- Use confidence thresholds
Model Security
- Verify model provenance
- Scan for embedded malware
- Use model signing/checksums
- Restrict model access (RBAC)
- Encrypt models at rest and in transit
- Monitor for model theft attempts
Data Security
- Encrypt training data
- Implement data access controls
- Validate data provenance
- Detect poisoned data
- Use differential privacy
- Anonymize/pseudonymize PII
Infrastructure Security
- Isolate ML workloads (VPC, containers)
- Use least privilege access
- Enable audit logging
- Implement network segmentation
- Regular security patching
- DDoS protection and WAF
Monitoring & Response
- Monitor for anomalous queries
- Track model behavior changes
- Alert on security events
- Implement incident response plan
- Regular security audits
- Threat modeling and pen testing
AI Security Tools & Frameworks
Guardrails Frameworks
- NVIDIA NeMo Guardrails: LLM safety rails, content filtering
- Guardrails AI: Validate LLM outputs against specs
- AWS Bedrock Guardrails: Content filtering, PII detection
- LangKit: LLM observability and security
- Rebuff: Prompt injection detection
Security Scanning
- Garak: LLM vulnerability scanner
- PyRIT: Microsoft's AI red team tool
- PromptInject: Prompt injection tester
- Adversarial Robustness Toolbox (ART): IBM's defense toolkit
- CleverHans: Adversarial example library
Model Security
- ModelScan: Malware scanning for ML models
- Modelsafe: Model integrity verification
- Hugging Face Security Scanner: Model card validation
- ONNX Model Zoo Security: Verified models
Privacy Tools
- Opacus: PyTorch differential privacy
- TensorFlow Privacy: DP-SGD implementation
- PySyft: Federated learning and encrypted ML
- Microsoft Presidio: PII detection and anonymization
