rubric-design
This skill provides guidance on designing effective evaluation rubrics. Use when:- Creating criteria for content or code quality assessment- Defining weights and thresholds for evaluation- Designing check types (pattern-based vs custom)- Structuring rubrics for maintainability and reusability
When & Why to Use This Skill
The Rubric Design skill provides a comprehensive framework for creating structured, measurable, and actionable evaluation rubrics. It enables users to move beyond subjective assessments by defining clear criteria, weighted importance, and specific thresholds for content and code quality. By utilizing both deterministic pattern-based checks and nuanced LLM-evaluated custom checks, this skill ensures consistent feedback and objective performance benchmarking for any output.
Use Cases
- Software Quality Assurance: Designing rubrics to evaluate code security, maintainability, and documentation standards during the development lifecycle.
- Content Strategy & Marketing: Creating brand voice rubrics to ensure marketing copy aligns with specific tone, directness, and audience engagement goals.
- AI Agent Benchmarking: Developing systematic scoring systems to measure the accuracy, safety, and reliability of AI-generated responses against predefined benchmarks.
- Technical Writing Audits: Establishing criteria for technical documentation to verify structural hierarchy, clarity of instructions, and adherence to style guides.
| name | rubric-design |
|---|---|
| description | | |
| version | 1.0.0 |
Designing Effective Evaluation Rubrics
Design rubrics that are clear, measurable, and actionable. Good rubrics produce consistent results and provide useful feedback.
Core Principles
Measurable over subjective: Every criterion should have concrete checks that can be evaluated consistently. "Writes well" is bad. "Uses active voice and leads with verbs" is good.
Weighted by importance: Not all criteria are equal. Assign weights that reflect actual impact. Security issues might be 40% of a code review rubric while style is 10%.
Thresholds reflect reality: Set thresholds that match real-world expectations. A brand voice rubric for marketing copy might require 80%, while a security rubric for production code might require 95%.
Actionable feedback: Every check should produce feedback that tells the user exactly how to fix the issue.
Rubric Schema
See references/schema.md for the complete YAML schema reference.
Essential Fields
name: rubric-name # Identifier (kebab-case)
version: 1.0.0 # Semantic version
description: | # When to use this rubric
Evaluates marketing copy for brand voice alignment
scope:
type: content # content | behavior | both
file_patterns: # Optional file filters
- "*.md"
- "app/routes/**/*.tsx"
criteria:
criterion-name:
weight: 25 # Percentage (all weights sum to 100)
threshold: 80 # Minimum score to pass this criterion
description: "..." # What this measures
checks: [...] # Evaluation checks
examples: # Pass/fail examples
pass: "Good example"
fail: "Bad example"
passing:
min_score: 75 # Overall minimum
required_criteria: [] # Must-pass regardless of score
Check Types
Pattern-Based Checks (Fast, Deterministic)
Use for objective, pattern-matchable criteria:
checks:
# Absence: Content should NOT contain pattern
- type: absence
pattern: "\\b(might|could|potentially)\\b"
message: "Remove hedge words for more confident tone"
# Presence: Content MUST contain pattern
- type: presence
pattern: "\\b(you|your)\\b"
message: "Address the reader directly"
# Pattern: Content should match format
- type: pattern
pattern: "^[A-Z]"
message: "Headlines should start with capital letter"
When to use pattern checks:
- Detecting forbidden words/phrases
- Enforcing required elements
- Validating format/structure
- Checking naming conventions
Custom Checks (LLM-Evaluated)
Use for nuanced, context-dependent criteria:
checks:
- type: custom
prompt: "Does this content lead with action verbs and avoid passive voice?"
message: "Use active voice and lead with verbs"
- type: custom
prompt: "Is the tone confident without being arrogant?"
message: "Adjust tone: confident but approachable"
When to use custom checks:
- Evaluating tone/voice
- Assessing logical flow
- Checking context-appropriate content
- Nuanced quality assessments
Criterion Design
Good Criteria Have
- Clear name:
directness,error-handling,test-coverage - Focused scope: One quality dimension per criterion
- Multiple checks: 2-5 checks that together assess the criterion
- Concrete examples: Pass and fail that clarify expectations
Common Criterion Categories
Content criteria (what it says):
- Directness, specificity, accuracy, completeness
Style criteria (how it's written):
- Tone, voice, formatting, readability
Structural criteria (how it's organized):
- Hierarchy, flow, sections, navigation
Behavioral criteria (what it does):
- Error handling, logging, testing, security
Weight Distribution
Weights should reflect actual importance:
# Brand Voice Rubric Example
criteria:
directness: { weight: 30 } # Core brand attribute
specificity: { weight: 25 } # Important for trust
tone: { weight: 20 } # Supports brand
audience: { weight: 15 } # Enables connection
formatting: { weight: 10 } # Nice to have
Guidelines:
- All weights must sum to 100
- Most important criterion: 25-40%
- Supporting criteria: 15-25%
- Minor criteria: 5-15%
- Avoid equal weights (shows lack of prioritization)
Threshold Setting
Choose thresholds based on context:
| Context | Threshold Range | Rationale |
|---|---|---|
| Security-critical | 90-100% | Can't compromise |
| Production code | 80-90% | High standards |
| Marketing copy | 70-85% | Room for creativity |
| Draft content | 60-75% | Early feedback |
Examples
See examples/ for working rubric examples:
brand-voice.yaml- Marketing copy evaluationcode-security.yaml- Security audit rubricapi-design.yaml- API review criteria
Anti-Patterns
Avoid:
- Vague criteria: "Code quality" (unmeasurable)
- Overlapping checks: Testing same thing twice
- Extreme thresholds: 100% (nothing passes) or 50% (everything passes)
- Missing examples: Leaves room for interpretation
- Generic messages: "Fix this" (not actionable)
Prefer:
- Specific criteria: "Error messages include context and recovery steps"
- Distinct checks: Each check tests something unique
- Reasonable thresholds: Based on real-world expectations
- Clear examples: Both pass and fail cases
- Actionable messages: "Add error context: what failed and how to fix"