What is eval-patterns?

The eval-patterns skill is a comprehensive guide and toolkit for implementing robust evaluation workflows within the Claude environment. It provides standardized patterns for assessing both static content (like code and documentation) and dynamic agent behaviors. By leveraging project-local rubrics and integration strategies, it enables developers to automate quality assurance, ensure brand consistency, and maintain high security standards throughout the development lifecycle.

When should I use eval-patterns?

eval-patterns is useful in the following scenarios: • Automated Code Review: Integrate the judge agent to automatically evaluate new implementations against security, performance, and style rubrics before merging. • Brand Voice & Documentation Audit: Use project-local rubrics to verify that marketing copy and technical documentation adhere to specific brand guidelines and quality standards. • Pre-commit Quality Gates: Run manual or semi-automated checks on staged files to identify potential issues in configuration, API design, or test coverage before code is committed. • Agent Behavior Monitoring: Evaluate the actions and outputs of AI agents in real-time to ensure they follow prescribed workflows and meet behavioral expectations. • Cross-Plugin Integration: Programmatically invoke the evaluation framework from other Claude skills to provide structured, rubric-based feedback loops within complex agentic workflows.

name	eval-patterns
description	\|
version	1.0.0

Evaluation Patterns & Integration

Common patterns for using the eval-framework effectively in different contexts.

Evaluation Types

Content Evaluation

Evaluates static content: copy, documentation, code files.

Use for:

Marketing copy review
Documentation quality
Code style/patterns
Configuration validation

Invocation:

/eval-run brand-voice app/routes/sell-on-vouchline.tsx

Behavior Evaluation

Evaluates actions and outputs: what Claude did, not just what exists.

Use for:

Code review after implementation
Commit message quality
Test coverage verification
API response validation

Invocation:

Judge agent triggered: "Review what I just implemented against the code-security rubric"

Combined Evaluation

Evaluates both content and behavior together.

Use for:

Full code review (style + security + behavior)
Documentation with examples (accuracy + completeness)
Feature implementation review

Project-Local Setup

Directory Structure

your-project/
├── .claude/
│   └── evals/
│       ├── brand-voice.yaml      # Project rubrics
│       ├── code-security.yaml
│       └── api-design.yaml

Quick Setup

Create directory: mkdir -p .claude/evals
Create rubric: /eval-create brand-voice --from docs/brand/voice.md
Run evaluation: /eval-run brand-voice

Rubric Discovery

The judge agent automatically discovers rubrics in:

.claude/evals/*.yaml (project-local)
.claude/evals/*.yml (alternate extension)
Explicit paths passed to commands

Integration Patterns

Pattern 1: Post-Implementation Review

After completing significant work, invoke judge for quality check:

User: "I just finished the authentication module"
Claude: [Uses judge agent to evaluate against code-security rubric]

The judge agent's when_to_use description enables proactive triggering after code review requests.

Pattern 2: Command-Based Validation

Explicit validation during development:

/eval-run brand-voice app/routes/sell-on-vouchline.tsx

Returns structured feedback before committing.

Pattern 3: Plugin Integration

Other plugins can invoke the judge programmatically:

## In your plugin's agent/command:

Invoke the eval-framework judge agent with:
- Rubric: [name or path]
- Content: [what to evaluate]
- Context: [additional context]

The judge will return structured evaluation results.

Pattern 4: Pre-Commit Workflow

Manual pre-commit check (not automated hook):

User: "Check my changes before I commit"
Claude: [Runs relevant rubrics against staged files]

Choosing Rubrics

By Content Type

Content	Recommended Rubric
Marketing copy	brand-voice
API code	code-security, api-design
Documentation	docs-quality
Test files	test-coverage
Config files	config-validation

By Quality Gate

Gate	Threshold	Required Criteria
Draft review	60%	None
PR review	75%	Core criteria
Production	85%	All security

Rubric Composition

Layered Rubrics

Create focused rubrics that can be run together:

# code-style.yaml - formatting, naming
# code-security.yaml - vulnerabilities
# code-perf.yaml - performance patterns

Run multiple: /eval-run code-style && /eval-run code-security

Domain-Specific Rubrics

Create rubrics for specific features:

# auth-flow.yaml - authentication patterns
# payment-handling.yaml - financial code
# user-input.yaml - input validation

Best Practices

Start Simple: Begin with 2-3 criteria, add more as needed.

Iterate Rubrics: Version your rubrics and refine based on false positives/negatives.

Context Matters: Include file patterns in scope to auto-filter relevant files.

Required vs Optional: Use required_criteria for must-pass items, let others contribute to score.

Actionable Feedback: Every check message should tell how to fix the issue.

Troubleshooting

Rubric not found: Check .claude/evals/ exists and rubric name matches file.

False positives: Refine regex patterns or use custom checks for nuance.

Score too low: Review thresholds - they might be too strict for your context.

Slow evaluation: Reduce custom checks (LLM-evaluated) where pattern checks work.

Reference Files

See references/ for additional patterns:

integration-examples.md - Real-world integration examples

eval-patterns

When & Why to Use This Skill

Use Cases

Evaluation Patterns & Integration

Evaluation Types

Content Evaluation

Behavior Evaluation

Combined Evaluation

Project-Local Setup

Directory Structure

Quick Setup

Rubric Discovery

Integration Patterns

Pattern 1: Post-Implementation Review

Pattern 2: Command-Based Validation

Pattern 3: Plugin Integration

Pattern 4: Pre-Commit Workflow

Choosing Rubrics

By Content Type

By Quality Gate

Rubric Composition

Layered Rubrics

Domain-Specific Rubrics

Best Practices

Troubleshooting

Reference Files