LLM Evaluation

science

Evaluation Frameworks

FrameworkProviderKey FeaturesBest For
HELMStanfordHolistic evaluation across 42 scenarios, 7 metrics categoriesComprehensive model comparison
lm-evaluation-harnessEleutherAI200+ tasks, standardized prompts, reproducible benchmarkingResearch benchmarking
DeepEvalConfident AI14+ metrics, pytest integration, CI/CD readyProduction testing
Inspect AIUK AI SafetyAgent evaluation, tool use testing, safety benchmarksAgent & safety evaluation
RAGASExplodinggradientsRAG-specific metrics: faithfulness, relevance, contextRAG pipeline evaluation
promptfooOpen SourcePrompt testing, A/B comparisons, assertionsPrompt engineering
Strands Evals SDKAWSOutput evaluation, trajectory analysis, tool usage assessment, LLM-as-Judge, dynamic simulators for multi-turn testingAgent evaluation
dataset

Benchmark Datasets

MMLU

57 subjects from STEM to humanities. Tests world knowledge and problem-solving.

Knowledge57 Tasks

HumanEval

164 hand-written Python problems. Tests code generation with pass@k metric.

CodingPython

GSM8K

8.5K grade school math problems. Tests multi-step reasoning.

MathReasoning

HellaSwag

Commonsense reasoning about everyday situations. Tests world model.

CommonsenseCompletion

MT-Bench

Multi-turn conversation quality. 80 questions across 8 categories.

ChatMulti-turn

TruthfulQA

817 questions testing truthfulness vs common misconceptions.

TruthfulnessFactuality
analytics

Evaluation Metrics

MetricTypeWhat It MeasuresRange
BLEUN-gramN-gram overlap with reference text0-1 (higher = better)
ROUGE-LN-gramLongest common subsequence with reference0-1 (higher = better)
BERTScoreSemanticSemantic similarity using BERT embeddings0-1 (higher = better)
PerplexityProbabilisticHow surprised the model is by the text1-∞ (lower = better)
pass@kFunctionalProbability of correct code in k samples0-1 (higher = better)
Exact MatchDeterministicExact string match with expected answer0-1 (higher = better)
gavel

LLM-as-Judge Patterns

Single Judge

One LLM scores output against criteria (1-5 scale or pass/fail). Simple but may have biases.

Use: Quick quality checks, filtering

Pairwise Comparison

LLM compares two outputs and picks the better one. More reliable than absolute scoring.

Use: A/B testing, model comparison

Reference-Guided

LLM evaluates output against a gold reference answer. Good for factual accuracy.

Use: QA evaluation, fact checking

Multi-Judge Panel

Multiple LLMs vote or average scores. Reduces individual model biases.

Use: High-stakes evaluation, research

Judge Prompt Template

You are evaluating an AI response. Score from 1-5 on:
- Relevance: Does it address the question?
- Accuracy: Is the information correct?
- Completeness: Does it cover all aspects?

Question: {question}
Response: {response}

Provide scores and brief justification.
auto_awesome

RAG Evaluation Metrics

MetricComponentWhat It Measures
Context PrecisionRetrievalAre retrieved chunks relevant to the query?
Context RecallRetrievalDid we retrieve all relevant information?
FaithfulnessGenerationIs the answer grounded in retrieved context?
Answer RelevanceGenerationDoes the answer address the original question?
Answer CorrectnessEnd-to-EndIs the final answer factually correct?

See RAG Architecture for detailed RAG evaluation patterns.

checklist

Evaluation Best Practices

Avoid Contamination

Ensure test data wasn't in training set. Use held-out sets and recent data.

Multiple Metrics

No single metric captures everything. Use a combination of automated and human eval.

Statistical Significance

Run multiple trials. Report confidence intervals, not just point estimates.

Task-Specific Evals

Generic benchmarks don't predict task performance. Build evals for your use case.

Human Baseline

Compare to human performance where possible. Understand ceiling and floor.

Versioned Datasets

Track eval dataset versions. Results aren't comparable across different versions.