ai-system-evaluation

doanchienthangdev's avatarfrom doanchienthangdev

End-to-end AI system evaluation - model selection, benchmarks, cost/latency analysis, build vs buy decisions. Use when selecting models, designing eval pipelines, or making architecture decisions.

0stars🔀0forks📁View on GitHub🕐Updated Jan 8, 2026

When & Why to Use This Skill

This Claude skill provides a comprehensive framework for end-to-end AI system evaluation, assisting developers and architects in model selection, performance benchmarking, and cost-benefit analysis. It covers critical metrics such as generation quality, latency (TTFT/TPOT), and strategic 'build vs. buy' decision-making to ensure optimal AI infrastructure and deployment.

Use Cases

  • Model Selection & Comparison: Evaluating different LLMs (proprietary vs. open-source) based on task-specific requirements, quality thresholds, and budget constraints.
  • Performance Benchmarking: Designing and running evaluation pipelines using domain-specific datasets like GSM-8K for reasoning or HumanEval for coding to measure accuracy and reliability.
  • Cost and Latency Optimization: Analyzing operational metrics including Time to First Token (TTFT) and throughput to balance user experience with infrastructure expenses.
  • Architectural Decision Making: Conducting 'Build vs. Buy' assessments to determine whether to utilize managed APIs or self-hosted models based on data privacy and customization needs.
nameai-system-evaluation
descriptionEnd-to-end AI system evaluation - model selection, benchmarks, cost/latency analysis, build vs buy decisions. Use when selecting models, designing eval pipelines, or making architecture decisions.

AI System Evaluation

Evaluating AI systems end-to-end.

Evaluation Criteria

1. Domain-Specific Capability

Domain Benchmarks
Math & Reasoning GSM-8K, MATH
Code HumanEval, MBPP
Knowledge MMLU, ARC
Multi-turn Chat MT-Bench

2. Generation Quality

Criterion Measurement
Factual Consistency NLI, SAFE, SelfCheckGPT
Coherence AI judge rubric
Relevance Semantic similarity
Fluency Perplexity

3. Cost & Latency

@dataclass
class PerformanceMetrics:
    ttft: float      # Time to First Token (seconds)
    tpot: float      # Time Per Output Token
    throughput: float # Tokens/second

    def cost(self, input_tokens, output_tokens, prices):
        return input_tokens * prices["input"] + output_tokens * prices["output"]

Model Selection Workflow

1. Define Requirements
   ├── Task type
   ├── Quality threshold
   ├── Latency requirements (<2s TTFT)
   ├── Cost budget
   └── Deployment constraints

2. Filter Options
   ├── API vs Self-hosted
   ├── Open source vs Proprietary
   └── Size constraints

3. Benchmark on Your Data
   ├── Create eval dataset (100+ examples)
   ├── Run experiments
   └── Analyze results

4. Make Decision
   └── Balance quality, cost, latency

Build vs Buy

Factor API Self-Host
Data Privacy Less control Full control
Performance Best models Slightly behind
Cost at Scale Expensive Amortized
Customization Limited Full control
Maintenance Zero Significant

Public Benchmarks

Benchmark Focus
MMLU Knowledge (57 subjects)
HumanEval Code generation
GSM-8K Math reasoning
TruthfulQA Factuality
MT-Bench Multi-turn chat

Caution: Benchmarks can be gamed. Data contamination is common. Always evaluate on YOUR data.

Best Practices

  1. Test on domain-specific data
  2. Measure both quality and cost
  3. Consider latency requirements
  4. Plan for fallback models
  5. Re-evaluate periodically