rubric-design

therealchrisrock's avatarfrom therealchrisrock

This skill provides guidance on designing effective evaluation rubrics. Use when:- Creating criteria for content or code quality assessment- Defining weights and thresholds for evaluation- Designing check types (pattern-based vs custom)- Structuring rubrics for maintainability and reusability

0stars🔀0forks📁View on GitHub🕐Updated Jan 8, 2026

When & Why to Use This Skill

The Rubric Design skill provides a comprehensive framework for creating structured, measurable, and actionable evaluation rubrics. It enables users to move beyond subjective assessments by defining clear criteria, weighted importance, and specific thresholds for content and code quality. By utilizing both deterministic pattern-based checks and nuanced LLM-evaluated custom checks, this skill ensures consistent feedback and objective performance benchmarking for any output.

Use Cases

  • Software Quality Assurance: Designing rubrics to evaluate code security, maintainability, and documentation standards during the development lifecycle.
  • Content Strategy & Marketing: Creating brand voice rubrics to ensure marketing copy aligns with specific tone, directness, and audience engagement goals.
  • AI Agent Benchmarking: Developing systematic scoring systems to measure the accuracy, safety, and reliability of AI-generated responses against predefined benchmarks.
  • Technical Writing Audits: Establishing criteria for technical documentation to verify structural hierarchy, clarity of instructions, and adherence to style guides.
namerubric-design
description|
version1.0.0

Designing Effective Evaluation Rubrics

Design rubrics that are clear, measurable, and actionable. Good rubrics produce consistent results and provide useful feedback.

Core Principles

Measurable over subjective: Every criterion should have concrete checks that can be evaluated consistently. "Writes well" is bad. "Uses active voice and leads with verbs" is good.

Weighted by importance: Not all criteria are equal. Assign weights that reflect actual impact. Security issues might be 40% of a code review rubric while style is 10%.

Thresholds reflect reality: Set thresholds that match real-world expectations. A brand voice rubric for marketing copy might require 80%, while a security rubric for production code might require 95%.

Actionable feedback: Every check should produce feedback that tells the user exactly how to fix the issue.

Rubric Schema

See references/schema.md for the complete YAML schema reference.

Essential Fields

name: rubric-name          # Identifier (kebab-case)
version: 1.0.0             # Semantic version
description: |             # When to use this rubric
  Evaluates marketing copy for brand voice alignment

scope:
  type: content            # content | behavior | both
  file_patterns:           # Optional file filters
    - "*.md"
    - "app/routes/**/*.tsx"

criteria:
  criterion-name:
    weight: 25             # Percentage (all weights sum to 100)
    threshold: 80          # Minimum score to pass this criterion
    description: "..."     # What this measures
    checks: [...]          # Evaluation checks
    examples:              # Pass/fail examples
      pass: "Good example"
      fail: "Bad example"

passing:
  min_score: 75            # Overall minimum
  required_criteria: []    # Must-pass regardless of score

Check Types

Pattern-Based Checks (Fast, Deterministic)

Use for objective, pattern-matchable criteria:

checks:
  # Absence: Content should NOT contain pattern
  - type: absence
    pattern: "\\b(might|could|potentially)\\b"
    message: "Remove hedge words for more confident tone"

  # Presence: Content MUST contain pattern
  - type: presence
    pattern: "\\b(you|your)\\b"
    message: "Address the reader directly"

  # Pattern: Content should match format
  - type: pattern
    pattern: "^[A-Z]"
    message: "Headlines should start with capital letter"

When to use pattern checks:

  • Detecting forbidden words/phrases
  • Enforcing required elements
  • Validating format/structure
  • Checking naming conventions

Custom Checks (LLM-Evaluated)

Use for nuanced, context-dependent criteria:

checks:
  - type: custom
    prompt: "Does this content lead with action verbs and avoid passive voice?"
    message: "Use active voice and lead with verbs"

  - type: custom
    prompt: "Is the tone confident without being arrogant?"
    message: "Adjust tone: confident but approachable"

When to use custom checks:

  • Evaluating tone/voice
  • Assessing logical flow
  • Checking context-appropriate content
  • Nuanced quality assessments

Criterion Design

Good Criteria Have

  1. Clear name: directness, error-handling, test-coverage
  2. Focused scope: One quality dimension per criterion
  3. Multiple checks: 2-5 checks that together assess the criterion
  4. Concrete examples: Pass and fail that clarify expectations

Common Criterion Categories

Content criteria (what it says):

  • Directness, specificity, accuracy, completeness

Style criteria (how it's written):

  • Tone, voice, formatting, readability

Structural criteria (how it's organized):

  • Hierarchy, flow, sections, navigation

Behavioral criteria (what it does):

  • Error handling, logging, testing, security

Weight Distribution

Weights should reflect actual importance:

# Brand Voice Rubric Example
criteria:
  directness:    { weight: 30 }  # Core brand attribute
  specificity:   { weight: 25 }  # Important for trust
  tone:          { weight: 20 }  # Supports brand
  audience:      { weight: 15 }  # Enables connection
  formatting:    { weight: 10 }  # Nice to have

Guidelines:

  • All weights must sum to 100
  • Most important criterion: 25-40%
  • Supporting criteria: 15-25%
  • Minor criteria: 5-15%
  • Avoid equal weights (shows lack of prioritization)

Threshold Setting

Choose thresholds based on context:

Context Threshold Range Rationale
Security-critical 90-100% Can't compromise
Production code 80-90% High standards
Marketing copy 70-85% Room for creativity
Draft content 60-75% Early feedback

Examples

See examples/ for working rubric examples:

  • brand-voice.yaml - Marketing copy evaluation
  • code-security.yaml - Security audit rubric
  • api-design.yaml - API review criteria

Anti-Patterns

Avoid:

  • Vague criteria: "Code quality" (unmeasurable)
  • Overlapping checks: Testing same thing twice
  • Extreme thresholds: 100% (nothing passes) or 50% (everything passes)
  • Missing examples: Leaves room for interpretation
  • Generic messages: "Fix this" (not actionable)

Prefer:

  • Specific criteria: "Error messages include context and recovery steps"
  • Distinct checks: Each check tests something unique
  • Reasonable thresholds: Based on real-world expectations
  • Clear examples: Both pass and fail cases
  • Actionable messages: "Add error context: what failed and how to fix"