What is eval-framework?

The Evaluation Framework skill provides a structured meta-framework for capturing, storing, and comparing AI-generated evaluations. It solves the problem of AI inconsistency and 'hallucination' in reviews by standardizing outputs into a comparable schema, enabling developers to measure the consistency, completeness, and reproducibility of AI audits, code reviews, and architecture assessments.

When should I use eval-framework?

eval-framework is useful in the following scenarios: • Reproducibility Testing: Running the same evaluation multiple times on the same codebase to ensure the AI identifies the same critical issues consistently. • Cross-Model Benchmarking: Comparing evaluation results between different models (e.g., Claude 3.5 Sonnet vs. Opus) to determine which model provides higher precision and recall for specific technical audits. • Regression Testing: Comparing current AI evaluations against historical baselines after code changes to identify which findings have been resolved and which new issues have been introduced. • Audit Trail Management: Creating a version-controlled repository of AI-generated security and architecture reviews for compliance and long-term quality tracking. • Evaluation Quality Scoring: Using automated metrics like Jaccard overlap, precision, and recall to quantitatively score the reliability of AI-generated findings.

name	eval-framework
description	\|
Use when	comparing reviews, measuring evaluation quality, running reproducibility tests,
Triggers	"compare evaluations", "measure consistency", "evaluation framework", "reproducible review",

Evaluation Framework Skill

A meta-framework for measuring the quality, consistency, and completeness of AI-generated evaluations.

Purpose

When you ask Claude to perform evaluations (architecture reviews, code reviews, security audits), how do you know the output is:

Consistent - Would it find the same issues if run again?
Complete - Is it missing important findings?
Accurate - Are severity ratings calibrated correctly?
Reproducible - Can another model/run replicate results?

This framework answers those questions by:

Standardizing evaluation output into a comparable schema
Storing results in version-controllable format
Providing tools to compare and score evaluations

Quick Start

1. Run an Evaluation with Structured Output

"Perform an architecture review of deployment/boot-manager/ using the eval-framework output format"

2. Compare Two Evaluations

"Compare evaluation results from .eval-results/arch-review-001.yaml and .eval-results/arch-review-002.yaml"

3. Generate Consistency Report

"Generate a consistency report for all architecture reviews in .eval-results/"

Output Protocol

When producing evaluations for comparison, you MUST output this exact structure:

Step 1: Evaluation Header

---
evaluation:
  id: "eval-[8-char-hex]"           # Unique identifier
  type: "architecture-review"        # Type of evaluation
  date: "2025-12-12T10:30:00Z"      # ISO 8601 timestamp
  model: "claude-opus-4-5-20251101" # Model that produced this

  target:
    path: "deployment/boot-manager/"  # What was evaluated
    commit: "abc1234"                  # Git commit if applicable
    description: "Boot manager for IoT devices"

  context:
    criteria: "IoT production readiness"  # Evaluation criteria
    scope: "full"                          # full, partial, focused
    time_spent_minutes: 45                 # Approximate time
---

Step 2: Findings Array

Each finding MUST have this structure:

findings:
  - id: "CRITICAL-001"
    severity: "critical"           # critical, high, medium, low, info
    category: "thread-safety"      # Normalized category

    location:
      file: "state/machine.py"
      line: 391
      function: "_notify_boot_callbacks"

    title: "Callbacks invoked inside lock"

    evidence: |
      The _notify_boot_callbacks method is called while holding
      self._lock, which can cause deadlocks if callbacks attempt
      to acquire the same lock.

    reasoning: |
      RLock is reentrant within the same thread, but callbacks may:
      - Block on I/O while holding lock
      - Call other locked methods
      - Spawn threads that need the lock
      IoT systems run for months - even rare deadlocks are unacceptable.

    impact: "System hang during state transitions"

    recommendation: |
      Copy callback list inside lock, invoke outside lock.

    fix_applied: true              # Was this fixed in the session?
    work_item: "AB#592"            # Associated work item if any

Step 3: Scores

scores:
  categories:
    thread_safety: 6
    resource_management: 8
    error_handling: 7
    state_management: 8
    external_operations: 7
    api_web_layer: 8
    configuration: 9
    code_consistency: 7

  overall: 7.5
  production_ready: false

Step 4: Summary

summary:
  total_findings: 15
  by_severity:
    critical: 5
    high: 5
    medium: 3
    low: 2
    info: 0

  top_issues:
    - "CRITICAL-001: Callbacks invoked inside lock"
    - "CRITICAL-002: ContainerManager missing thread safety"
    - "CRITICAL-003: HealthMonitor missing thread safety"

  positive_observations:
    - "Excellent separation of concerns"
    - "Good use of typing"
    - "Comprehensive logging"

  # Hash for quick comparison (hash of finding IDs + severities)
  fingerprint: "a1b2c3d4e5f6"

Storage Convention

Store evaluation results in the evaluated project:

[project-root]/
└── .eval-results/
    ├── arch-review-2025-12-12-eval-a1b2c3d4.yaml
    ├── arch-review-2025-12-12-eval-e5f6g7h8.yaml
    └── comparison-2025-12-12.md

Naming convention: [type]-[date]-eval-[id].yaml

Comparison Protocol

Matching Findings

Two findings are considered "matching" if:

Same location (file + function/line within 10 lines), OR
Similar title (>70% token overlap), OR
Same evidence pattern (key code snippet matches)

Comparison Metrics

Metric	Definition	Formula
Overlap	Findings in both evaluations	\|A ∩ B\| / \|A ∪ B\|
Precision	Of A's findings, how many in B?	\|A ∩ B\| / \|A\|
Recall	Of B's findings, how many in A?	\|A ∩ B\| / \|B\|
Severity Agreement	Matching findings with same severity	matching_severity / total_matched
Category Agreement	Matching findings with same category	matching_category / total_matched

Comparison Report Format

# Evaluation Comparison Report

## Evaluations Compared
- **A**: arch-review-2025-12-12-eval-a1b2c3d4
- **B**: arch-review-2025-12-12-eval-e5f6g7h8

## Metrics
| Metric | Value |
|--------|-------|
| Overlap (Jaccard) | 0.85 |
| Precision (A→B) | 0.90 |
| Recall (A→B) | 0.80 |
| Severity Agreement | 0.95 |
| Category Agreement | 0.88 |

## Finding Comparison

### Found in Both (Matched)
| A Finding | B Finding | Severity Match | Category Match |
|-----------|-----------|----------------|----------------|
| CRITICAL-001 | CRITICAL-001 | ✅ | ✅ |
| HIGH-002 | HIGH-003 | ✅ | ❌ |

### Only in A (Potentially Missed by B)
- HIGH-004: Some finding only A found

### Only in B (Potentially Missed by A)
- MEDIUM-007: Some finding only B found

## Consistency Score: 87%

Category Normalization

To enable comparison across different evaluation types, normalize categories:

Raw Category	Normalized
thread safety, concurrency, race condition, deadlock	`thread-safety`
resource leak, memory leak, fd leak, connection leak	`resource-management`
error handling, exception, recovery	`error-handling`
state machine, persistence, atomic	`state-management`
timeout, retry, external call, api	`external-operations`
validation, input, web, api endpoint	`api-web-layer`
config, secrets, credentials	`configuration`
pattern, consistency, dead code, naming	`code-consistency`
security, auth, injection, xss	`security`
performance, optimization, complexity	`performance`

Workflow Examples

Example 1: Reproducibility Test

Run the same review twice and compare:

User: "Perform an architecture review of src/ using eval-framework format, save to .eval-results/"
[Run 1 completes, saved as arch-review-...-eval-abc123.yaml]

User: "Perform the same architecture review again with eval-framework format"
[Run 2 completes, saved as arch-review-...-eval-def456.yaml]

User: "Compare the two architecture reviews and generate consistency report"
[Comparison report generated]

Example 2: Cross-Model Comparison

Compare Opus vs Sonnet evaluations:

User: "Using Opus, perform security review with eval-framework format"
User: "Using Sonnet, perform security review with eval-framework format"
User: "Compare the two security reviews"

Example 3: Regression Testing

After code changes, verify findings are still valid:

User: "Load previous evaluation .eval-results/arch-review-prev.yaml"
User: "Re-evaluate the same scope and compare to previous"
User: "Which findings are now fixed? Which are new?"

Integration with Other Skills

With code-review Skill

"Perform a code review of AuthService.cs with eval-framework output"

With architecture-review Skill (when created)

"Perform an architecture review with eval-framework format"

With security-review Skill (when created)

"Perform a security audit with eval-framework format"

Files Reference

File	Purpose
`SKILL.md`	This file - framework documentation
`schemas/evaluation.schema.yaml`	JSON Schema for evaluation output
`templates/comparison-report.md`	Template for comparison reports
`scripts/compare-evaluations.py`	Python script for comparison
`examples/architecture-review.md`	Example: architecture review with framework

Best Practices

Always include evidence - Code snippets make findings matchable
Use consistent categories - Refer to normalization table
Include reasoning - Explains why, not just what
Generate fingerprint - Enables quick change detection
Store in version control - Track evaluation evolution over time
Run multiple times - Single run may miss issues
Compare across models - Different models find different things

eval-framework

When & Why to Use This Skill

Use Cases