When & Why to Use This Skill
This Claude skill provides a comprehensive framework for end-to-end AI system evaluation, assisting developers and architects in model selection, performance benchmarking, and cost-benefit analysis. It covers critical metrics such as generation quality, latency (TTFT/TPOT), and strategic 'build vs. buy' decision-making to ensure optimal AI infrastructure and deployment.
Use Cases
- Model Selection & Comparison: Evaluating different LLMs (proprietary vs. open-source) based on task-specific requirements, quality thresholds, and budget constraints.
- Performance Benchmarking: Designing and running evaluation pipelines using domain-specific datasets like GSM-8K for reasoning or HumanEval for coding to measure accuracy and reliability.
- Cost and Latency Optimization: Analyzing operational metrics including Time to First Token (TTFT) and throughput to balance user experience with infrastructure expenses.
- Architectural Decision Making: Conducting 'Build vs. Buy' assessments to determine whether to utilize managed APIs or self-hosted models based on data privacy and customization needs.
| name | ai-system-evaluation |
|---|
| description | End-to-end AI system evaluation - model selection, benchmarks, cost/latency analysis, build vs buy decisions. Use when selecting models, designing eval pipelines, or making architecture decisions. |
|---|
AI System Evaluation
Evaluating AI systems end-to-end.
Evaluation Criteria
1. Domain-Specific Capability
| Domain |
Benchmarks |
| Math & Reasoning |
GSM-8K, MATH |
| Code |
HumanEval, MBPP |
| Knowledge |
MMLU, ARC |
| Multi-turn Chat |
MT-Bench |
2. Generation Quality
| Criterion |
Measurement |
| Factual Consistency |
NLI, SAFE, SelfCheckGPT |
| Coherence |
AI judge rubric |
| Relevance |
Semantic similarity |
| Fluency |
Perplexity |
3. Cost & Latency
@dataclass
class PerformanceMetrics:
ttft: float # Time to First Token (seconds)
tpot: float # Time Per Output Token
throughput: float # Tokens/second
def cost(self, input_tokens, output_tokens, prices):
return input_tokens * prices["input"] + output_tokens * prices["output"]
Model Selection Workflow
1. Define Requirements
├── Task type
├── Quality threshold
├── Latency requirements (<2s TTFT)
├── Cost budget
└── Deployment constraints
2. Filter Options
├── API vs Self-hosted
├── Open source vs Proprietary
└── Size constraints
3. Benchmark on Your Data
├── Create eval dataset (100+ examples)
├── Run experiments
└── Analyze results
4. Make Decision
└── Balance quality, cost, latency
Build vs Buy
| Factor |
API |
Self-Host |
| Data Privacy |
Less control |
Full control |
| Performance |
Best models |
Slightly behind |
| Cost at Scale |
Expensive |
Amortized |
| Customization |
Limited |
Full control |
| Maintenance |
Zero |
Significant |
Public Benchmarks
| Benchmark |
Focus |
| MMLU |
Knowledge (57 subjects) |
| HumanEval |
Code generation |
| GSM-8K |
Math reasoning |
| TruthfulQA |
Factuality |
| MT-Bench |
Multi-turn chat |
Caution: Benchmarks can be gamed. Data contamination is common. Always evaluate on YOUR data.
Best Practices
- Test on domain-specific data
- Measure both quality and cost
- Consider latency requirements
- Plan for fallback models
- Re-evaluate periodically