Prompt Engineering
Prompting Techniques
Directly ask the model to perform a task without providing examples. Relies on the model's pre-trained knowledge and instruction-following capabilities. Works best for straightforward tasks where the model has strong prior knowledge.
- No examples needed in prompt
- Fastest to implement
- Lower token usage
- Works for common tasks
- May lack precision for complex tasks
Provide 2-5 examples of input-output pairs before the actual query. Helps the model understand the expected format, style, and reasoning pattern. Critical for tasks requiring specific output formats or domain knowledge.
- Examples guide model behavior
- Improves output consistency
- Demonstrates expected format
- Better for complex/novel tasks
- Higher token usage
Encourage step-by-step reasoning by asking the model to 'think through' the problem or by showing reasoning examples. Dramatically improves performance on math, logic, and multi-step reasoning tasks.
- Explicit reasoning steps
- Better for math/logic problems
- Reduces errors in complex tasks
- Can be zero-shot ('Let's think step by step')
- Higher latency due to longer outputs
Generate multiple reasoning paths and select the most common answer. Samples diverse Chain-of-Thought responses and uses majority voting. Significantly improves accuracy at the cost of multiple API calls.
- Multiple reasoning samples
- Majority voting for answer
- Higher accuracy on reasoning tasks
- Temperature > 0 for diversity
- Higher cost (multiple calls)
Explore multiple reasoning branches systematically, evaluating and pruning paths. Combines deliberate search with LLM reasoning. Best for complex problems requiring exploration like puzzles, planning, or creative tasks.
- Branching reasoning paths
- Self-evaluation of branches
- Backtracking capability
- BFS or DFS search strategies
- High token usage, best for hard problems
Interleave reasoning traces with actions (tool calls). Model thinks about what to do, executes an action, observes the result, and continues. Foundation for most modern AI agents and tool-using systems.
- Thought-Action-Observation loop
- Integrates with external tools
- Transparent reasoning process
- Handles multi-step tasks
- Core pattern for AI agents
When Chain-of-Thought Isn't What It Seems
CoT improves performance on many tasks, but interpretability research from Anthropic found that the relationship between written reasoning and actual internal computation isn't always faithful.
Faithful CoT
On tractable problems, the written reasoning trace genuinely reflects internal computation. Asked to compute the square root of 0.64, Claude's internal features represented the intermediate step of computing √64 — the explanation matched the process.
Post-Hoc Reconstruction
On harder problems, the model can generate a plausible-looking derivation after the fact, without any corresponding internal calculation. The chain-of-thought is a performance — constructed to look like reasoning rather than recording it. When given a hint about an expected answer, models engage in motivated reasoning: working backwards from the target to construct justifying steps.
When you ask a model to "show its work," you may be getting a plausible reconstruction rather than a faithful record. CoT-based evaluation is most reliable for tasks where the reasoning trace can be independently verified — and least reliable for hard problems where the model might not actually know the answer.
Prompt Patterns
| Pattern | Description | Example Use | Best For |
|---|---|---|---|
| Persona | Assign a role or character to the model | "You are an expert Python developer..." | Domain expertise, tone control |
| Template | Structured format with placeholders | "Given {context}, answer {question}" | Consistent outputs, automation |
| Structured Output | Request specific format (JSON, XML, Markdown) | "Respond in valid JSON with keys: name, description" | API integration, parsing |
| System Prompt | Persistent instructions for conversation context | Setting behavior, constraints, guardrails | Chatbots, assistants |
| Delimiter | Use markers to separate prompt sections | ###, ```, <context></context> | Long prompts, multi-part inputs |
| Output Primer | Start the response to guide format | "The answer is: {" | Forcing specific formats |
Advanced Techniques
Use an LLM to generate or optimize prompts for another task. Have the model analyze, critique, and improve prompts iteratively. Enables automated prompt engineering at scale.
- LLM generates prompts
- Automated optimization
- Prompt critique and refinement
- A/B testing at scale
- Self-improving systems
Break complex tasks into sequential prompts where each output feeds into the next. Enables sophisticated workflows, error handling between steps, and specialized prompts per stage.
- Sequential prompt execution
- Output becomes next input
- Error handling per step
- Specialized prompts per stage
- Complex workflow orchestration
Define principles (a 'constitution') the model should follow, then have it self-critique and revise responses. Used by Anthropic for Claude's safety training. Can be applied in prompts for safer outputs.
- Define behavioral principles
- Self-critique against rules
- Iterative revision
- Harmlessness training
- Values alignment
Provide hints or keywords that guide the model toward a desired direction without fully specifying the answer. Useful for creative tasks where you want influence without over-constraining.
- Keyword hints
- Directional guidance
- Maintains creativity
- Subtle steering
- Good for generation tasks
Algorithmically search for optimal prompts using techniques like evolutionary search or gradient-based optimization. Tools like DSPy enable programmatic prompt optimization with evaluation metrics.
- Automated prompt search
- Evolutionary optimization
- Metric-driven selection
- DSPy framework
- Requires evaluation dataset
Decompose complex problems into simpler subproblems, solve them in order from easiest to hardest, with each solution informing the next. Effective for compositional reasoning.
- Problem decomposition
- Easiest to hardest ordering
- Progressive complexity
- Compositional reasoning
- Better than CoT for some tasks
Prompt Optimization
Systematically measure prompt quality using automated metrics and human evaluation. Track accuracy, relevance, format compliance, latency, and cost. Build evaluation datasets and run regression tests.
- Accuracy/correctness measurement
- Format compliance checking
- Latency and cost tracking
- Human evaluation workflows
- Regression test suites
Compare prompt variants systematically in production. Track key metrics, statistical significance, and user feedback. Iterate based on data rather than intuition.
- Variant comparison
- Statistical significance testing
- Production traffic splitting
- User feedback integration
- Continuous improvement
Version control prompts like code. Track changes, enable rollbacks, maintain audit trails, and manage deployment across environments. Critical for production prompt management.
- Git-like version control
- Change tracking and diffs
- Rollback capability
- Environment management
- Audit trails
Minimize token usage while maintaining quality. Compress verbose prompts, remove redundancy, use efficient encodings, and cache common prompt prefixes.
- Prompt compression
- Redundancy removal
- Efficient instruction writing
- Prompt caching (Anthropic)
- Cost reduction
Prompt Security
| Threat | Description | Mitigation | Tools |
|---|---|---|---|
| Prompt Injection | User input manipulates system behavior | Input sanitization, delimiters, instruction hierarchy | Guardrails AI, Rebuff, LLM Guard |
| Jailbreaking | Bypassing safety guidelines | Multi-layer filtering, output validation | OpenAI Moderation, Perspective API |
| Data Leakage | Extracting training data or system prompts | Don't include secrets in prompts, output filtering | Presidio, custom regex filters |
| Indirect Injection | Malicious instructions in retrieved content | Content sanitization, source verification | Input validation, content scanning |
Prompt Management Tools
LangChain's platform for prompt management, tracing, evaluation, and monitoring. Hub for sharing prompts, datasets for testing, and production observability for LLM applications.
- Prompt hub and versioning
- Trace visualization
- Evaluation datasets
- Production monitoring
- LangChain integration
Open-source tool for testing and evaluating prompts. Define test cases in YAML, run against multiple providers, compare outputs, and catch regressions. CI/CD integration for prompt testing.
- YAML test definitions
- Multi-provider testing
- Assertion-based evaluation
- CI/CD integration
- Open source
Enterprise platform for prompt management with collaboration features. Version control, A/B testing, fine-tuning management, and analytics. Built for teams managing prompts at scale.
- Collaborative editing
- Version control
- A/B testing
- Fine-tuning integration
- Enterprise analytics
Middleware for logging and managing prompts. Wraps API calls to track all requests, responses, and latency. Template management, versioning, and analytics dashboard.
- Request/response logging
- Template management
- Latency tracking
- Analytics dashboard
- Easy integration
Stanford framework for programmatic prompt optimization. Define modules and signatures, then compile to optimized prompts using training data. Enables systematic prompt engineering with code.
- Programmatic prompts
- Signature-based modules
- Automatic optimization
- Training data compilation
- Reproducible pipelines
Microsoft's library for constrained generation. Define output structure with templates, enforce JSON schemas, control generation token-by-token. More reliable structured outputs.
- Template-based generation
- Schema enforcement
- Token-level control
- Interleaved generation
- Reliable JSON output
