evaluation-methodology
Methods for evaluating AI model outputs - exact match, semantic similarity, LLM-as-judge, comparative evaluation, ELO ranking. Use when measuring AI quality, building eval pipelines, or comparing models.
When & Why to Use This Skill
This Claude skill provides a comprehensive framework for evaluating AI model outputs using diverse methodologies such as exact match, semantic similarity, and LLM-as-judge. It enables developers to build robust evaluation pipelines, perform comparative analysis via ELO ranking, and ensure the quality and reliability of foundation model responses through systematic benchmarking.
Use Cases
- Case 1: Building automated evaluation pipelines to measure the accuracy, helpfulness, and safety of AI-generated content across different versions.
- Case 2: Comparing multiple LLM outputs using ELO ranking and comparative evaluation to determine the superior model for specific business use cases.
- Case 3: Implementing 'LLM-as-judge' workflows to provide scalable, rubric-based grading for open-ended queries where traditional metrics fail.
- Case 4: Assessing technical performance in specialized domains like coding or translation using functional correctness and semantic similarity metrics.
| name | evaluation-methodology |
|---|---|
| description | Methods for evaluating AI model outputs - exact match, semantic similarity, LLM-as-judge, comparative evaluation, ELO ranking. Use when measuring AI quality, building eval pipelines, or comparing models. |
Evaluation Methodology
Methods for evaluating Foundation Model outputs.
Evaluation Approaches
1. Exact Evaluation
| Method | Use Case | Example |
|---|---|---|
| Exact Match | QA, Math | "5" == "5" |
| Functional Correctness | Code | Pass test cases |
| BLEU/ROUGE | Translation | N-gram overlap |
| Semantic Similarity | Open-ended | Embedding cosine |
# Semantic Similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
emb1 = model.encode([generated])
emb2 = model.encode([reference])
similarity = cosine_similarity(emb1, emb2)[0][0]
2. AI as Judge
JUDGE_PROMPT = """Rate the response on a scale of 1-5.
Criteria:
- Accuracy: Is information correct?
- Helpfulness: Does it address the need?
- Clarity: Is it easy to understand?
Query: {query}
Response: {response}
Return JSON: {"score": N, "reasoning": "..."}"""
# Multi-judge for reliability
judges = ["gpt-4", "claude-3"]
scores = [get_score(judge, response) for judge in judges]
final_score = sum(scores) / len(scores)
3. Comparative Evaluation (ELO)
COMPARE_PROMPT = """Compare these responses.
Query: {query}
A: {response_a}
B: {response_b}
Which is better? Return: A, B, or tie"""
def update_elo(rating_a, rating_b, winner, k=32):
expected_a = 1 / (1 + 10**((rating_b - rating_a) / 400))
score_a = 1 if winner == "A" else 0 if winner == "B" else 0.5
return rating_a + k * (score_a - expected_a)
Evaluation Pipeline
1. Define Criteria (accuracy, helpfulness, safety)
↓
2. Create Scoring Rubric with Examples
↓
3. Select Methods (exact + AI judge + human)
↓
4. Create Evaluation Dataset
↓
5. Run Evaluation
↓
6. Analyze & Iterate
Best Practices
- Use multiple evaluation methods
- Calibrate AI judges with human data
- Include both automatic and human evaluation
- Version your evaluation datasets
- Track metrics over time
- Test for position bias in comparisons