Will Percey — Portfolio

LLM Evaluation

> > Updated Feb 2026

science

Evaluation Frameworks

Framework	Provider	Key Features	Best For
HELM	Stanford	Holistic evaluation across 42 scenarios, 7 metrics categories	Comprehensive model comparison
lm-evaluation-harness	EleutherAI	200+ tasks, standardized prompts, reproducible benchmarking	Research benchmarking
DeepEval	Confident AI	14+ metrics, pytest integration, CI/CD ready	Production testing
Inspect AI	UK AI Safety	Agent evaluation, tool use testing, safety benchmarks	Agent & safety evaluation
RAGAS	Explodinggradients	RAG-specific metrics: faithfulness, relevance, context	RAG pipeline evaluation
promptfoo	Open Source	Prompt testing, A/B comparisons, assertions	Prompt engineering
Strands Evals SDK	AWS	Output evaluation, trajectory analysis, tool usage assessment, LLM-as-Judge, dynamic simulators for multi-turn testing	Agent evaluation

dataset

Benchmark Datasets

MMLU

57 subjects from STEM to humanities. Tests world knowledge and problem-solving.

Knowledge57 Tasks

HumanEval

164 hand-written Python problems. Tests code generation with pass@k metric.

CodingPython

GSM8K

8.5K grade school math problems. Tests multi-step reasoning.

MathReasoning

HellaSwag

Commonsense reasoning about everyday situations. Tests world model.

CommonsenseCompletion

MT-Bench

Multi-turn conversation quality. 80 questions across 8 categories.

ChatMulti-turn

TruthfulQA

817 questions testing truthfulness vs common misconceptions.

TruthfulnessFactuality

analytics

Evaluation Metrics

Metric	Type	What It Measures	Range
BLEU	N-gram	N-gram overlap with reference text	0-1 (higher = better)
ROUGE-L	N-gram	Longest common subsequence with reference	0-1 (higher = better)
BERTScore	Semantic	Semantic similarity using BERT embeddings	0-1 (higher = better)
Perplexity	Probabilistic	How surprised the model is by the text	1-∞ (lower = better)
pass@k	Functional	Probability of correct code in k samples	0-1 (higher = better)
Exact Match	Deterministic	Exact string match with expected answer	0-1 (higher = better)

gavel

LLM-as-Judge Patterns

Single Judge

One LLM scores output against criteria (1-5 scale or pass/fail). Simple but may have biases.

Use: Quick quality checks, filtering

Pairwise Comparison

LLM compares two outputs and picks the better one. More reliable than absolute scoring.

Use: A/B testing, model comparison

Reference-Guided

LLM evaluates output against a gold reference answer. Good for factual accuracy.

Use: QA evaluation, fact checking

Multi-Judge Panel

Multiple LLMs vote or average scores. Reduces individual model biases.

Use: High-stakes evaluation, research

Judge Prompt Template

You are evaluating an AI response. Score from 1-5 on:
- Relevance: Does it address the question?
- Accuracy: Is the information correct?
- Completeness: Does it cover all aspects?

Question: {question}
Response: {response}

Provide scores and brief justification.

auto_awesome

RAG Evaluation Metrics

Metric	Component	What It Measures
Context Precision	Retrieval	Are retrieved chunks relevant to the query?
Context Recall	Retrieval	Did we retrieve all relevant information?
Faithfulness	Generation	Is the answer grounded in retrieved context?
Answer Relevance	Generation	Does the answer address the original question?
Answer Correctness	End-to-End	Is the final answer factually correct?

See RAG Architecture for detailed RAG evaluation patterns.

checklist

Evaluation Best Practices

Avoid Contamination

Ensure test data wasn't in training set. Use held-out sets and recent data.

Multiple Metrics

No single metric captures everything. Use a combination of automated and human eval.

Statistical Significance

Run multiple trials. Report confidence intervals, not just point estimates.

Task-Specific Evals

Generic benchmarks don't predict task performance. Build evals for your use case.

Human Baseline

Compare to human performance where possible. Understand ceiling and floor.

Versioned Datasets

Track eval dataset versions. Results aren't comparable across different versions.