evaluation-metrics
Automatically applies when evaluating LLM performance. Ensures proper eval datasets, metrics computation, A/B testing, LLM-as-judge patterns, and experiment tracking.
When & Why to Use This Skill
This Claude skill provides a comprehensive framework for evaluating Large Language Model (LLM) performance. It enables developers to build rigorous evaluation pipelines using structured datasets, diverse metrics (Exact Match, F1, Semantic Similarity), and advanced patterns like LLM-as-judge, A/B testing, and experiment tracking to ensure production-grade reliability and reproducibility.
Use Cases
- Benchmarking LLM Performance: Systematically evaluate model outputs against gold-standard datasets using quantitative metrics like F1, BLEU, and Semantic Similarity.
- Qualitative Assessment with LLM-as-Judge: Automate the evaluation of complex criteria such as tone, reasoning, and helpfulness by using a secondary LLM to grade model responses.
- A/B Testing Model Variants: Compare different models or prompt versions in a controlled environment to measure improvements in both output quality and system latency.
- Experiment Tracking and Versioning: Maintain a detailed log of evaluation runs, configurations, and results to ensure reproducibility and track performance trends over time.
- CI/CD Integration for LLMs: Implement automated evaluation checks in the development lifecycle to prevent performance regression before deploying new model updates.
| name | evaluation-metrics |
|---|---|
| description | Automatically applies when evaluating LLM performance. Ensures proper eval datasets, metrics computation, A/B testing, LLM-as-judge patterns, and experiment tracking. |
| category | ai-llm |
Evaluation Metrics for LLM Applications
When evaluating LLM performance, follow these patterns for rigorous, reproducible evaluation.
Trigger Keywords: evaluation, eval, metrics, benchmark, test set, A/B test, LLM judge, performance testing, accuracy, precision, recall, F1, BLEU, ROUGE, experiment tracking
Agent Integration: Used by ml-system-architect, performance-and-cost-engineer-llm, llm-app-engineer
✅ Correct Pattern: Evaluation Dataset
from typing import List, Dict, Optional
from pydantic import BaseModel, Field
from datetime import datetime
import json
class EvalExample(BaseModel):
"""Single evaluation example."""
id: str
input: str
expected_output: str
metadata: Dict[str, any] = Field(default_factory=dict)
tags: List[str] = Field(default_factory=list)
class EvalDataset(BaseModel):
"""Evaluation dataset with metadata."""
name: str
description: str
version: str
created_at: datetime = Field(default_factory=datetime.utcnow)
examples: List[EvalExample]
def save(self, path: str):
"""Save dataset to JSON file."""
with open(path, "w") as f:
json.dump(self.model_dump(), f, indent=2, default=str)
@classmethod
def load(cls, path: str) -> "EvalDataset":
"""Load dataset from JSON file."""
with open(path) as f:
data = json.load(f)
return cls(**data)
def filter_by_tag(self, tag: str) -> "EvalDataset":
"""Filter dataset by tag."""
filtered = [ex for ex in self.examples if tag in ex.tags]
return EvalDataset(
name=f"{self.name}_{tag}",
description=f"Filtered by tag: {tag}",
version=self.version,
examples=filtered
)
# Create evaluation dataset
eval_dataset = EvalDataset(
name="summarization_eval",
description="Evaluation set for document summarization",
version="1.0",
examples=[
EvalExample(
id="sum_001",
input="Long document text...",
expected_output="Concise summary...",
tags=["short", "technical"]
),
EvalExample(
id="sum_002",
input="Another document...",
expected_output="Another summary...",
tags=["long", "business"]
)
]
)
eval_dataset.save("eval_data/summarization_v1.json")
Evaluation Metrics
from typing import Protocol, List
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import re
class Metric(Protocol):
"""Protocol for evaluation metrics."""
def compute(
self,
predictions: List[str],
references: List[str]
) -> float:
"""Compute metric score."""
...
class ExactMatch:
"""Exact match metric (case-insensitive)."""
def compute(
self,
predictions: List[str],
references: List[str]
) -> float:
"""
Compute exact match accuracy.
Returns:
Fraction of exact matches (0-1)
"""
matches = sum(
p.strip().lower() == r.strip().lower()
for p, r in zip(predictions, references)
)
return matches / len(predictions)
class TokenOverlap:
"""Token overlap metric (precision, recall, F1)."""
def tokenize(self, text: str) -> set:
"""Simple whitespace tokenization."""
return set(text.lower().split())
def compute_f1(
self,
prediction: str,
reference: str
) -> Dict[str, float]:
"""
Compute precision, recall, F1 for single example.
Returns:
Dict with precision, recall, f1 scores
"""
pred_tokens = self.tokenize(prediction)
ref_tokens = self.tokenize(reference)
if not pred_tokens or not ref_tokens:
return {"precision": 0.0, "recall": 0.0, "f1": 0.0}
overlap = pred_tokens & ref_tokens
precision = len(overlap) / len(pred_tokens)
recall = len(overlap) / len(ref_tokens)
if precision + recall == 0:
f1 = 0.0
else:
f1 = 2 * (precision * recall) / (precision + recall)
return {
"precision": precision,
"recall": recall,
"f1": f1
}
def compute(
self,
predictions: List[str],
references: List[str]
) -> Dict[str, float]:
"""
Compute average metrics across all examples.
Returns:
Dict with average precision, recall, f1
"""
scores = [
self.compute_f1(p, r)
for p, r in zip(predictions, references)
]
return {
"precision": np.mean([s["precision"] for s in scores]),
"recall": np.mean([s["recall"] for s in scores]),
"f1": np.mean([s["f1"] for s in scores])
}
class SemanticSimilarity:
"""Semantic similarity using embeddings."""
def __init__(self, embedding_model):
self.embedding_model = embedding_model
async def compute(
self,
predictions: List[str],
references: List[str]
) -> float:
"""
Compute average cosine similarity.
Returns:
Average similarity score (0-1)
"""
# Embed predictions and references
pred_embeddings = await self.embedding_model.embed(predictions)
ref_embeddings = await self.embedding_model.embed(references)
# Compute cosine similarities
similarities = []
for pred_emb, ref_emb in zip(pred_embeddings, ref_embeddings):
similarity = np.dot(pred_emb, ref_emb) / (
np.linalg.norm(pred_emb) * np.linalg.norm(ref_emb)
)
similarities.append(similarity)
return float(np.mean(similarities))
# Usage
exact_match = ExactMatch()
token_overlap = TokenOverlap()
predictions = ["The cat sat on mat", "Python is great"]
references = ["The cat sat on the mat", "Python is awesome"]
em_score = exact_match.compute(predictions, references)
overlap_scores = token_overlap.compute(predictions, references)
print(f"Exact Match: {em_score:.2f}")
print(f"F1 Score: {overlap_scores['f1']:.2f}")
LLM-as-Judge Evaluation
class LLMJudge:
"""Use LLM to evaluate outputs."""
def __init__(self, llm_client):
self.llm = llm_client
async def judge_single(
self,
input: str,
prediction: str,
reference: Optional[str] = None,
criteria: List[str] = None
) -> Dict[str, any]:
"""
Evaluate single prediction using LLM.
Args:
input: Original input
prediction: Model prediction
reference: Optional reference answer
criteria: Evaluation criteria
Returns:
Dict with score and reasoning
"""
criteria = criteria or [
"accuracy",
"relevance",
"completeness",
"clarity"
]
prompt = self._build_judge_prompt(
input, prediction, reference, criteria
)
response = await self.llm.complete(prompt, temperature=0.0)
# Parse response (expects JSON)
import json
try:
result = json.loads(response)
return result
except json.JSONDecodeError:
return {
"score": 0,
"reasoning": "Failed to parse response",
"raw_response": response
}
def _build_judge_prompt(
self,
input: str,
prediction: str,
reference: Optional[str],
criteria: List[str]
) -> str:
"""Build prompt for LLM judge."""
criteria_str = ", ".join(criteria)
prompt = f"""Evaluate this model output on: {criteria_str}
Input:
{input}
Model Output:
{prediction}"""
if reference:
prompt += f"""
Reference Answer:
{reference}"""
prompt += """
Provide evaluation as JSON:
{
"score": <1-10>,
"reasoning": "<explanation>",
"criteria_scores": {
"accuracy": <1-10>,
"relevance": <1-10>,
...
}
}"""
return prompt
async def batch_judge(
self,
examples: List[Dict[str, str]],
criteria: List[str] = None
) -> List[Dict[str, any]]:
"""
Judge multiple examples in batch.
Args:
examples: List of dicts with input, prediction, reference
criteria: Evaluation criteria
Returns:
List of judgment results
"""
import asyncio
tasks = [
self.judge_single(
input=ex["input"],
prediction=ex["prediction"],
reference=ex.get("reference"),
criteria=criteria
)
for ex in examples
]
return await asyncio.gather(*tasks)
# Usage
judge = LLMJudge(llm_client)
result = await judge.judge_single(
input="What is Python?",
prediction="Python is a programming language.",
reference="Python is a high-level programming language.",
criteria=["accuracy", "completeness", "clarity"]
)
print(f"Score: {result['score']}/10")
print(f"Reasoning: {result['reasoning']}")
A/B Testing Framework
from typing import Callable, Dict, List
from dataclasses import dataclass
from datetime import datetime
import random
@dataclass
class Variant:
"""A/B test variant."""
name: str
model_fn: Callable
traffic_weight: float = 0.5
@dataclass
class ABTestResult:
"""Result from A/B test."""
variant_name: str
example_id: str
prediction: str
metrics: Dict[str, float]
latency_ms: float
timestamp: datetime
class ABTest:
"""A/B testing framework for LLM variants."""
def __init__(
self,
name: str,
variants: List[Variant],
metrics: List[Metric]
):
self.name = name
self.variants = variants
self.metrics = metrics
self.results: List[ABTestResult] = []
# Normalize weights
total_weight = sum(v.traffic_weight for v in variants)
for v in variants:
v.traffic_weight /= total_weight
def select_variant(self) -> Variant:
"""Select variant based on traffic weight."""
r = random.random()
cumulative = 0.0
for variant in self.variants:
cumulative += variant.traffic_weight
if r <= cumulative:
return variant
return self.variants[-1]
async def run_test(
self,
eval_dataset: EvalDataset,
samples_per_variant: Optional[int] = None
) -> Dict[str, any]:
"""
Run A/B test on evaluation dataset.
Args:
eval_dataset: Evaluation dataset
samples_per_variant: Samples per variant (None = all)
Returns:
Test results with metrics per variant
"""
import time
samples = samples_per_variant or len(eval_dataset.examples)
# Run predictions for each variant
for variant in self.variants:
for i, example in enumerate(eval_dataset.examples[:samples]):
start = time.time()
# Get prediction from variant
prediction = await variant.model_fn(example.input)
latency = (time.time() - start) * 1000
# Compute metrics
variant_metrics = {}
for metric in self.metrics:
score = metric.compute([prediction], [example.expected_output])
variant_metrics[metric.__class__.__name__] = score
# Store result
self.results.append(ABTestResult(
variant_name=variant.name,
example_id=example.id,
prediction=prediction,
metrics=variant_metrics,
latency_ms=latency,
timestamp=datetime.utcnow()
))
return self.analyze_results()
def analyze_results(self) -> Dict[str, any]:
"""
Analyze A/B test results.
Returns:
Statistics per variant
"""
variant_stats = {}
for variant in self.variants:
variant_results = [
r for r in self.results
if r.variant_name == variant.name
]
if not variant_results:
continue
# Aggregate metrics
metric_names = variant_results[0].metrics.keys()
avg_metrics = {}
for metric_name in metric_names:
scores = [r.metrics[metric_name] for r in variant_results]
avg_metrics[metric_name] = {
"mean": np.mean(scores),
"std": np.std(scores),
"min": np.min(scores),
"max": np.max(scores)
}
# Latency stats
latencies = [r.latency_ms for r in variant_results]
variant_stats[variant.name] = {
"samples": len(variant_results),
"metrics": avg_metrics,
"latency": {
"mean_ms": np.mean(latencies),
"p50_ms": np.percentile(latencies, 50),
"p95_ms": np.percentile(latencies, 95),
"p99_ms": np.percentile(latencies, 99)
}
}
return variant_stats
# Usage
variants = [
Variant(
name="baseline",
model_fn=lambda x: model_v1.complete(x),
traffic_weight=0.5
),
Variant(
name="candidate",
model_fn=lambda x: model_v2.complete(x),
traffic_weight=0.5
)
]
ab_test = ABTest(
name="summarization_v1_vs_v2",
variants=variants,
metrics=[ExactMatch(), TokenOverlap()]
)
results = await ab_test.run_test(eval_dataset, samples_per_variant=100)
Experiment Tracking
from typing import Dict, Any, Optional
import json
from pathlib import Path
class ExperimentTracker:
"""Track experiments and results."""
def __init__(self, experiments_dir: str = "experiments"):
self.experiments_dir = Path(experiments_dir)
self.experiments_dir.mkdir(exist_ok=True)
def log_experiment(
self,
name: str,
config: Dict[str, Any],
metrics: Dict[str, float],
metadata: Optional[Dict[str, Any]] = None
) -> str:
"""
Log experiment configuration and results.
Args:
name: Experiment name
config: Model configuration
metrics: Evaluation metrics
metadata: Additional metadata
Returns:
Experiment ID
"""
from datetime import datetime
import uuid
experiment_id = str(uuid.uuid4())[:8]
timestamp = datetime.utcnow()
experiment = {
"id": experiment_id,
"name": name,
"timestamp": timestamp.isoformat(),
"config": config,
"metrics": metrics,
"metadata": metadata or {}
}
# Save to file
filename = f"{timestamp.strftime('%Y%m%d_%H%M%S')}_{name}_{experiment_id}.json"
filepath = self.experiments_dir / filename
with open(filepath, "w") as f:
json.dump(experiment, f, indent=2)
return experiment_id
def load_experiment(self, experiment_id: str) -> Optional[Dict[str, Any]]:
"""Load experiment by ID."""
for filepath in self.experiments_dir.glob(f"*_{experiment_id}.json"):
with open(filepath) as f:
return json.load(f)
return None
def list_experiments(
self,
name: Optional[str] = None
) -> List[Dict[str, Any]]:
"""List all experiments, optionally filtered by name."""
experiments = []
for filepath in sorted(self.experiments_dir.glob("*.json")):
with open(filepath) as f:
exp = json.load(f)
if name is None or exp["name"] == name:
experiments.append(exp)
return experiments
def compare_experiments(
self,
experiment_ids: List[str]
) -> Dict[str, Any]:
"""Compare multiple experiments."""
experiments = [
self.load_experiment(exp_id)
for exp_id in experiment_ids
]
# Extract metrics for comparison
comparison = {
"experiments": []
}
for exp in experiments:
if exp:
comparison["experiments"].append({
"id": exp["id"],
"name": exp["name"],
"metrics": exp["metrics"]
})
return comparison
# Usage
tracker = ExperimentTracker()
exp_id = tracker.log_experiment(
name="summarization_v2",
config={
"model": "claude-sonnet-4",
"temperature": 0.3,
"max_tokens": 512,
"prompt_version": "2.0"
},
metrics={
"exact_match": 0.45,
"f1": 0.78,
"semantic_similarity": 0.85
},
metadata={
"dataset": "summarization_v1.json",
"num_examples": 100
}
)
print(f"Logged experiment: {exp_id}")
❌ Anti-Patterns
# ❌ No evaluation dataset
def test_model():
result = model("test this") # Single example!
print("Works!")
# ✅ Better: Use proper eval dataset
eval_dataset = EvalDataset.load("eval_data.json")
results = await evaluator.run(model, eval_dataset)
# ❌ Only exact match metric
score = sum(p == r for p, r in zip(preds, refs)) / len(preds)
# ✅ Better: Multiple metrics
metrics = {
"exact_match": ExactMatch().compute(preds, refs),
"f1": TokenOverlap().compute(preds, refs)["f1"],
"semantic_sim": await SemanticSimilarity().compute(preds, refs)
}
# ❌ No experiment tracking
model_v2_score = 0.78 # Lost context!
# ✅ Better: Track all experiments
tracker.log_experiment(
name="model_v2",
config={"version": "2.0"},
metrics={"f1": 0.78}
)
# ❌ Cherry-picking examples
good_examples = [ex for ex in dataset if model(ex) == expected]
# ✅ Better: Use full representative dataset
results = evaluate_on_full_dataset(model, dataset)
Best Practices Checklist
- ✅ Create representative evaluation datasets
- ✅ Version control eval datasets
- ✅ Use multiple complementary metrics
- ✅ Include LLM-as-judge for qualitative evaluation
- ✅ Run A/B tests for variant comparison
- ✅ Track all experiments with config and metrics
- ✅ Measure latency alongside quality metrics
- ✅ Use statistical significance testing
- ✅ Evaluate on diverse examples (easy, medium, hard)
- ✅ Include edge cases and adversarial examples
- ✅ Document evaluation methodology
- ✅ Set up automated evaluation in CI/CD
Auto-Apply
When evaluating LLM systems:
- Create EvalDataset with representative examples
- Compute multiple metrics (exact match, F1, semantic similarity)
- Use LLM-as-judge for qualitative assessment
- Run A/B tests comparing variants
- Track experiments with ExperimentTracker
- Measure latency alongside quality
- Save results for reproducibility
Related Skills
prompting-patterns- For prompt engineeringllm-app-architecture- For LLM integrationmonitoring-alerting- For production metricsmodel-selection- For choosing modelsperformance-profiling- For optimization