LLM Evaluation
Evaluation Frameworks
| Framework | Provider | Key Features | Best For |
|---|---|---|---|
| HELM | Stanford | Holistic evaluation across 42 scenarios, 7 metrics categories | Comprehensive model comparison |
| lm-evaluation-harness | EleutherAI | 200+ tasks, standardized prompts, reproducible benchmarking | Research benchmarking |
| DeepEval | Confident AI | 14+ metrics, pytest integration, CI/CD ready | Production testing |
| Inspect AI | UK AI Safety | Agent evaluation, tool use testing, safety benchmarks | Agent & safety evaluation |
| RAGAS | Explodinggradients | RAG-specific metrics: faithfulness, relevance, context | RAG pipeline evaluation |
| promptfoo | Open Source | Prompt testing, A/B comparisons, assertions | Prompt engineering |
| Strands Evals SDK | AWS | Output evaluation, trajectory analysis, tool usage assessment, LLM-as-Judge, dynamic simulators for multi-turn testing | Agent evaluation |
Benchmark Datasets
MMLU
57 subjects from STEM to humanities. Tests world knowledge and problem-solving.
HumanEval
164 hand-written Python problems. Tests code generation with pass@k metric.
GSM8K
8.5K grade school math problems. Tests multi-step reasoning.
HellaSwag
Commonsense reasoning about everyday situations. Tests world model.
MT-Bench
Multi-turn conversation quality. 80 questions across 8 categories.
TruthfulQA
817 questions testing truthfulness vs common misconceptions.
Evaluation Metrics
| Metric | Type | What It Measures | Range |
|---|---|---|---|
| BLEU | N-gram | N-gram overlap with reference text | 0-1 (higher = better) |
| ROUGE-L | N-gram | Longest common subsequence with reference | 0-1 (higher = better) |
| BERTScore | Semantic | Semantic similarity using BERT embeddings | 0-1 (higher = better) |
| Perplexity | Probabilistic | How surprised the model is by the text | 1-∞ (lower = better) |
| pass@k | Functional | Probability of correct code in k samples | 0-1 (higher = better) |
| Exact Match | Deterministic | Exact string match with expected answer | 0-1 (higher = better) |
LLM-as-Judge Patterns
Single Judge
One LLM scores output against criteria (1-5 scale or pass/fail). Simple but may have biases.
Use: Quick quality checks, filtering
Pairwise Comparison
LLM compares two outputs and picks the better one. More reliable than absolute scoring.
Use: A/B testing, model comparison
Reference-Guided
LLM evaluates output against a gold reference answer. Good for factual accuracy.
Use: QA evaluation, fact checking
Multi-Judge Panel
Multiple LLMs vote or average scores. Reduces individual model biases.
Use: High-stakes evaluation, research
Judge Prompt Template
You are evaluating an AI response. Score from 1-5 on:
- Relevance: Does it address the question?
- Accuracy: Is the information correct?
- Completeness: Does it cover all aspects?
Question: {question}
Response: {response}
Provide scores and brief justification.RAG Evaluation Metrics
| Metric | Component | What It Measures |
|---|---|---|
| Context Precision | Retrieval | Are retrieved chunks relevant to the query? |
| Context Recall | Retrieval | Did we retrieve all relevant information? |
| Faithfulness | Generation | Is the answer grounded in retrieved context? |
| Answer Relevance | Generation | Does the answer address the original question? |
| Answer Correctness | End-to-End | Is the final answer factually correct? |
See RAG Architecture for detailed RAG evaluation patterns.
Evaluation Best Practices
Avoid Contamination
Ensure test data wasn't in training set. Use held-out sets and recent data.
Multiple Metrics
No single metric captures everything. Use a combination of automated and human eval.
Statistical Significance
Run multiple trials. Report confidence intervals, not just point estimates.
Task-Specific Evals
Generic benchmarks don't predict task performance. Build evals for your use case.
Human Baseline
Compare to human performance where possible. Understand ceiling and floor.
Versioned Datasets
Track eval dataset versions. Results aren't comparable across different versions.
