What is evaluation-metrics?

This Claude skill provides a comprehensive framework for evaluating Large Language Model (LLM) performance. It enables developers to build rigorous evaluation pipelines using structured datasets, diverse metrics (Exact Match, F1, Semantic Similarity), and advanced patterns like LLM-as-judge, A/B testing, and experiment tracking to ensure production-grade reliability and reproducibility.

When should I use evaluation-metrics?

evaluation-metrics is useful in the following scenarios: • Benchmarking LLM Performance: Systematically evaluate model outputs against gold-standard datasets using quantitative metrics like F1, BLEU, and Semantic Similarity. • Qualitative Assessment with LLM-as-Judge: Automate the evaluation of complex criteria such as tone, reasoning, and helpfulness by using a secondary LLM to grade model responses. • A/B Testing Model Variants: Compare different models or prompt versions in a controlled environment to measure improvements in both output quality and system latency. • Experiment Tracking and Versioning: Maintain a detailed log of evaluation runs, configurations, and results to ensure reproducibility and track performance trends over time. • CI/CD Integration for LLMs: Implement automated evaluation checks in the development lifecycle to prevent performance regression before deploying new model updates.

name	evaluation-metrics
description	Automatically applies when evaluating LLM performance. Ensures proper eval datasets, metrics computation, A/B testing, LLM-as-judge patterns, and experiment tracking.
category	ai-llm

Evaluation Metrics for LLM Applications

When evaluating LLM performance, follow these patterns for rigorous, reproducible evaluation.

Trigger Keywords: evaluation, eval, metrics, benchmark, test set, A/B test, LLM judge, performance testing, accuracy, precision, recall, F1, BLEU, ROUGE, experiment tracking

Agent Integration: Used by ml-system-architect, performance-and-cost-engineer-llm, llm-app-engineer

✅ Correct Pattern: Evaluation Dataset

from typing import List, Dict, Optional
from pydantic import BaseModel, Field
from datetime import datetime
import json


class EvalExample(BaseModel):
    """Single evaluation example."""

    id: str
    input: str
    expected_output: str
    metadata: Dict[str, any] = Field(default_factory=dict)
    tags: List[str] = Field(default_factory=list)


class EvalDataset(BaseModel):
    """Evaluation dataset with metadata."""

    name: str
    description: str
    version: str
    created_at: datetime = Field(default_factory=datetime.utcnow)
    examples: List[EvalExample]

    def save(self, path: str):
        """Save dataset to JSON file."""
        with open(path, "w") as f:
            json.dump(self.model_dump(), f, indent=2, default=str)

    @classmethod
    def load(cls, path: str) -> "EvalDataset":
        """Load dataset from JSON file."""
        with open(path) as f:
            data = json.load(f)
        return cls(**data)

    def filter_by_tag(self, tag: str) -> "EvalDataset":
        """Filter dataset by tag."""
        filtered = [ex for ex in self.examples if tag in ex.tags]
        return EvalDataset(
            name=f"{self.name}_{tag}",
            description=f"Filtered by tag: {tag}",
            version=self.version,
            examples=filtered
        )


# Create evaluation dataset
eval_dataset = EvalDataset(
    name="summarization_eval",
    description="Evaluation set for document summarization",
    version="1.0",
    examples=[
        EvalExample(
            id="sum_001",
            input="Long document text...",
            expected_output="Concise summary...",
            tags=["short", "technical"]
        ),
        EvalExample(
            id="sum_002",
            input="Another document...",
            expected_output="Another summary...",
            tags=["long", "business"]
        )
    ]
)

eval_dataset.save("eval_data/summarization_v1.json")

Evaluation Metrics

from typing import Protocol, List
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import re


class Metric(Protocol):
    """Protocol for evaluation metrics."""

    def compute(
        self,
        predictions: List[str],
        references: List[str]
    ) -> float:
        """Compute metric score."""
        ...


class ExactMatch:
    """Exact match metric (case-insensitive)."""

    def compute(
        self,
        predictions: List[str],
        references: List[str]
    ) -> float:
        """
        Compute exact match accuracy.

        Returns:
            Fraction of exact matches (0-1)
        """
        matches = sum(
            p.strip().lower() == r.strip().lower()
            for p, r in zip(predictions, references)
        )
        return matches / len(predictions)


class TokenOverlap:
    """Token overlap metric (precision, recall, F1)."""

    def tokenize(self, text: str) -> set:
        """Simple whitespace tokenization."""
        return set(text.lower().split())

    def compute_f1(
        self,
        prediction: str,
        reference: str
    ) -> Dict[str, float]:
        """
        Compute precision, recall, F1 for single example.

        Returns:
            Dict with precision, recall, f1 scores
        """
        pred_tokens = self.tokenize(prediction)
        ref_tokens = self.tokenize(reference)

        if not pred_tokens or not ref_tokens:
            return {"precision": 0.0, "recall": 0.0, "f1": 0.0}

        overlap = pred_tokens & ref_tokens

        precision = len(overlap) / len(pred_tokens)
        recall = len(overlap) / len(ref_tokens)

        if precision + recall == 0:
            f1 = 0.0
        else:
            f1 = 2 * (precision * recall) / (precision + recall)

        return {
            "precision": precision,
            "recall": recall,
            "f1": f1
        }

    def compute(
        self,
        predictions: List[str],
        references: List[str]
    ) -> Dict[str, float]:
        """
        Compute average metrics across all examples.

        Returns:
            Dict with average precision, recall, f1
        """
        scores = [
            self.compute_f1(p, r)
            for p, r in zip(predictions, references)
        ]

        return {
            "precision": np.mean([s["precision"] for s in scores]),
            "recall": np.mean([s["recall"] for s in scores]),
            "f1": np.mean([s["f1"] for s in scores])
        }


class SemanticSimilarity:
    """Semantic similarity using embeddings."""

    def __init__(self, embedding_model):
        self.embedding_model = embedding_model

    async def compute(
        self,
        predictions: List[str],
        references: List[str]
    ) -> float:
        """
        Compute average cosine similarity.

        Returns:
            Average similarity score (0-1)
        """
        # Embed predictions and references
        pred_embeddings = await self.embedding_model.embed(predictions)
        ref_embeddings = await self.embedding_model.embed(references)

        # Compute cosine similarities
        similarities = []
        for pred_emb, ref_emb in zip(pred_embeddings, ref_embeddings):
            similarity = np.dot(pred_emb, ref_emb) / (
                np.linalg.norm(pred_emb) * np.linalg.norm(ref_emb)
            )
            similarities.append(similarity)

        return float(np.mean(similarities))


# Usage
exact_match = ExactMatch()
token_overlap = TokenOverlap()

predictions = ["The cat sat on mat", "Python is great"]
references = ["The cat sat on the mat", "Python is awesome"]

em_score = exact_match.compute(predictions, references)
overlap_scores = token_overlap.compute(predictions, references)

print(f"Exact Match: {em_score:.2f}")
print(f"F1 Score: {overlap_scores['f1']:.2f}")

LLM-as-Judge Evaluation

class LLMJudge:
    """Use LLM to evaluate outputs."""

    def __init__(self, llm_client):
        self.llm = llm_client

    async def judge_single(
        self,
        input: str,
        prediction: str,
        reference: Optional[str] = None,
        criteria: List[str] = None
    ) -> Dict[str, any]:
        """
        Evaluate single prediction using LLM.

        Args:
            input: Original input
            prediction: Model prediction
            reference: Optional reference answer
            criteria: Evaluation criteria

        Returns:
            Dict with score and reasoning
        """
        criteria = criteria or [
            "accuracy",
            "relevance",
            "completeness",
            "clarity"
        ]

        prompt = self._build_judge_prompt(
            input, prediction, reference, criteria
        )

        response = await self.llm.complete(prompt, temperature=0.0)

        # Parse response (expects JSON)
        import json
        try:
            result = json.loads(response)
            return result
        except json.JSONDecodeError:
            return {
                "score": 0,
                "reasoning": "Failed to parse response",
                "raw_response": response
            }

    def _build_judge_prompt(
        self,
        input: str,
        prediction: str,
        reference: Optional[str],
        criteria: List[str]
    ) -> str:
        """Build prompt for LLM judge."""
        criteria_str = ", ".join(criteria)

        prompt = f"""Evaluate this model output on: {criteria_str}

Input:
{input}

Model Output:
{prediction}"""

        if reference:
            prompt += f"""

Reference Answer:
{reference}"""

        prompt += """

Provide evaluation as JSON:
{
  "score": <1-10>,
  "reasoning": "<explanation>",
  "criteria_scores": {
    "accuracy": <1-10>,
    "relevance": <1-10>,
    ...
  }
}"""

        return prompt

    async def batch_judge(
        self,
        examples: List[Dict[str, str]],
        criteria: List[str] = None
    ) -> List[Dict[str, any]]:
        """
        Judge multiple examples in batch.

        Args:
            examples: List of dicts with input, prediction, reference
            criteria: Evaluation criteria

        Returns:
            List of judgment results
        """
        import asyncio

        tasks = [
            self.judge_single(
                input=ex["input"],
                prediction=ex["prediction"],
                reference=ex.get("reference"),
                criteria=criteria
            )
            for ex in examples
        ]

        return await asyncio.gather(*tasks)


# Usage
judge = LLMJudge(llm_client)

result = await judge.judge_single(
    input="What is Python?",
    prediction="Python is a programming language.",
    reference="Python is a high-level programming language.",
    criteria=["accuracy", "completeness", "clarity"]
)

print(f"Score: {result['score']}/10")
print(f"Reasoning: {result['reasoning']}")

A/B Testing Framework

from typing import Callable, Dict, List
from dataclasses import dataclass
from datetime import datetime
import random


@dataclass
class Variant:
    """A/B test variant."""

    name: str
    model_fn: Callable
    traffic_weight: float = 0.5


@dataclass
class ABTestResult:
    """Result from A/B test."""

    variant_name: str
    example_id: str
    prediction: str
    metrics: Dict[str, float]
    latency_ms: float
    timestamp: datetime


class ABTest:
    """A/B testing framework for LLM variants."""

    def __init__(
        self,
        name: str,
        variants: List[Variant],
        metrics: List[Metric]
    ):
        self.name = name
        self.variants = variants
        self.metrics = metrics
        self.results: List[ABTestResult] = []

        # Normalize weights
        total_weight = sum(v.traffic_weight for v in variants)
        for v in variants:
            v.traffic_weight /= total_weight

    def select_variant(self) -> Variant:
        """Select variant based on traffic weight."""
        r = random.random()
        cumulative = 0.0

        for variant in self.variants:
            cumulative += variant.traffic_weight
            if r <= cumulative:
                return variant

        return self.variants[-1]

    async def run_test(
        self,
        eval_dataset: EvalDataset,
        samples_per_variant: Optional[int] = None
    ) -> Dict[str, any]:
        """
        Run A/B test on evaluation dataset.

        Args:
            eval_dataset: Evaluation dataset
            samples_per_variant: Samples per variant (None = all)

        Returns:
            Test results with metrics per variant
        """
        import time

        samples = samples_per_variant or len(eval_dataset.examples)

        # Run predictions for each variant
        for variant in self.variants:
            for i, example in enumerate(eval_dataset.examples[:samples]):
                start = time.time()

                # Get prediction from variant
                prediction = await variant.model_fn(example.input)

                latency = (time.time() - start) * 1000

                # Compute metrics
                variant_metrics = {}
                for metric in self.metrics:
                    score = metric.compute([prediction], [example.expected_output])
                    variant_metrics[metric.__class__.__name__] = score

                # Store result
                self.results.append(ABTestResult(
                    variant_name=variant.name,
                    example_id=example.id,
                    prediction=prediction,
                    metrics=variant_metrics,
                    latency_ms=latency,
                    timestamp=datetime.utcnow()
                ))

        return self.analyze_results()

    def analyze_results(self) -> Dict[str, any]:
        """
        Analyze A/B test results.

        Returns:
            Statistics per variant
        """
        variant_stats = {}

        for variant in self.variants:
            variant_results = [
                r for r in self.results
                if r.variant_name == variant.name
            ]

            if not variant_results:
                continue

            # Aggregate metrics
            metric_names = variant_results[0].metrics.keys()
            avg_metrics = {}

            for metric_name in metric_names:
                scores = [r.metrics[metric_name] for r in variant_results]
                avg_metrics[metric_name] = {
                    "mean": np.mean(scores),
                    "std": np.std(scores),
                    "min": np.min(scores),
                    "max": np.max(scores)
                }

            # Latency stats
            latencies = [r.latency_ms for r in variant_results]

            variant_stats[variant.name] = {
                "samples": len(variant_results),
                "metrics": avg_metrics,
                "latency": {
                    "mean_ms": np.mean(latencies),
                    "p50_ms": np.percentile(latencies, 50),
                    "p95_ms": np.percentile(latencies, 95),
                    "p99_ms": np.percentile(latencies, 99)
                }
            }

        return variant_stats


# Usage
variants = [
    Variant(
        name="baseline",
        model_fn=lambda x: model_v1.complete(x),
        traffic_weight=0.5
    ),
    Variant(
        name="candidate",
        model_fn=lambda x: model_v2.complete(x),
        traffic_weight=0.5
    )
]

ab_test = ABTest(
    name="summarization_v1_vs_v2",
    variants=variants,
    metrics=[ExactMatch(), TokenOverlap()]
)

results = await ab_test.run_test(eval_dataset, samples_per_variant=100)

Experiment Tracking

from typing import Dict, Any, Optional
import json
from pathlib import Path


class ExperimentTracker:
    """Track experiments and results."""

    def __init__(self, experiments_dir: str = "experiments"):
        self.experiments_dir = Path(experiments_dir)
        self.experiments_dir.mkdir(exist_ok=True)

    def log_experiment(
        self,
        name: str,
        config: Dict[str, Any],
        metrics: Dict[str, float],
        metadata: Optional[Dict[str, Any]] = None
    ) -> str:
        """
        Log experiment configuration and results.

        Args:
            name: Experiment name
            config: Model configuration
            metrics: Evaluation metrics
            metadata: Additional metadata

        Returns:
            Experiment ID
        """
        from datetime import datetime
        import uuid

        experiment_id = str(uuid.uuid4())[:8]
        timestamp = datetime.utcnow()

        experiment = {
            "id": experiment_id,
            "name": name,
            "timestamp": timestamp.isoformat(),
            "config": config,
            "metrics": metrics,
            "metadata": metadata or {}
        }

        # Save to file
        filename = f"{timestamp.strftime('%Y%m%d_%H%M%S')}_{name}_{experiment_id}.json"
        filepath = self.experiments_dir / filename

        with open(filepath, "w") as f:
            json.dump(experiment, f, indent=2)

        return experiment_id

    def load_experiment(self, experiment_id: str) -> Optional[Dict[str, Any]]:
        """Load experiment by ID."""
        for filepath in self.experiments_dir.glob(f"*_{experiment_id}.json"):
            with open(filepath) as f:
                return json.load(f)
        return None

    def list_experiments(
        self,
        name: Optional[str] = None
    ) -> List[Dict[str, Any]]:
        """List all experiments, optionally filtered by name."""
        experiments = []

        for filepath in sorted(self.experiments_dir.glob("*.json")):
            with open(filepath) as f:
                exp = json.load(f)
                if name is None or exp["name"] == name:
                    experiments.append(exp)

        return experiments

    def compare_experiments(
        self,
        experiment_ids: List[str]
    ) -> Dict[str, Any]:
        """Compare multiple experiments."""
        experiments = [
            self.load_experiment(exp_id)
            for exp_id in experiment_ids
        ]

        # Extract metrics for comparison
        comparison = {
            "experiments": []
        }

        for exp in experiments:
            if exp:
                comparison["experiments"].append({
                    "id": exp["id"],
                    "name": exp["name"],
                    "metrics": exp["metrics"]
                })

        return comparison


# Usage
tracker = ExperimentTracker()

exp_id = tracker.log_experiment(
    name="summarization_v2",
    config={
        "model": "claude-sonnet-4",
        "temperature": 0.3,
        "max_tokens": 512,
        "prompt_version": "2.0"
    },
    metrics={
        "exact_match": 0.45,
        "f1": 0.78,
        "semantic_similarity": 0.85
    },
    metadata={
        "dataset": "summarization_v1.json",
        "num_examples": 100
    }
)

print(f"Logged experiment: {exp_id}")

❌ Anti-Patterns

# ❌ No evaluation dataset
def test_model():
    result = model("test this")  # Single example!
    print("Works!")

# ✅ Better: Use proper eval dataset
eval_dataset = EvalDataset.load("eval_data.json")
results = await evaluator.run(model, eval_dataset)


# ❌ Only exact match metric
score = sum(p == r for p, r in zip(preds, refs)) / len(preds)

# ✅ Better: Multiple metrics
metrics = {
    "exact_match": ExactMatch().compute(preds, refs),
    "f1": TokenOverlap().compute(preds, refs)["f1"],
    "semantic_sim": await SemanticSimilarity().compute(preds, refs)
}


# ❌ No experiment tracking
model_v2_score = 0.78  # Lost context!

# ✅ Better: Track all experiments
tracker.log_experiment(
    name="model_v2",
    config={"version": "2.0"},
    metrics={"f1": 0.78}
)


# ❌ Cherry-picking examples
good_examples = [ex for ex in dataset if model(ex) == expected]

# ✅ Better: Use full representative dataset
results = evaluate_on_full_dataset(model, dataset)

Best Practices Checklist

✅ Create representative evaluation datasets
✅ Version control eval datasets
✅ Use multiple complementary metrics
✅ Include LLM-as-judge for qualitative evaluation
✅ Run A/B tests for variant comparison
✅ Track all experiments with config and metrics
✅ Measure latency alongside quality metrics
✅ Use statistical significance testing
✅ Evaluate on diverse examples (easy, medium, hard)
✅ Include edge cases and adversarial examples
✅ Document evaluation methodology
✅ Set up automated evaluation in CI/CD

Auto-Apply

When evaluating LLM systems:

Create EvalDataset with representative examples
Compute multiple metrics (exact match, F1, semantic similarity)
Use LLM-as-judge for qualitative assessment
Run A/B tests comparing variants
Track experiments with ExperimentTracker
Measure latency alongside quality
Save results for reproducibility

Related Skills

prompting-patterns - For prompt engineering
llm-app-architecture - For LLM integration
monitoring-alerting - For production metrics
model-selection - For choosing models
performance-profiling - For optimization

evaluation-metrics

When & Why to Use This Skill

Use Cases