What is rubric-design?

The Rubric Design skill provides a comprehensive framework for creating structured, measurable, and actionable evaluation rubrics. It enables users to move beyond subjective assessments by defining clear criteria, weighted importance, and specific thresholds for content and code quality. By utilizing both deterministic pattern-based checks and nuanced LLM-evaluated custom checks, this skill ensures consistent feedback and objective performance benchmarking for any output.

When should I use rubric-design?

rubric-design is useful in the following scenarios: • Software Quality Assurance: Designing rubrics to evaluate code security, maintainability, and documentation standards during the development lifecycle. • Content Strategy & Marketing: Creating brand voice rubrics to ensure marketing copy aligns with specific tone, directness, and audience engagement goals. • AI Agent Benchmarking: Developing systematic scoring systems to measure the accuracy, safety, and reliability of AI-generated responses against predefined benchmarks. • Technical Writing Audits: Establishing criteria for technical documentation to verify structural hierarchy, clarity of instructions, and adherence to style guides.

name	rubric-design
description	\|
version	1.0.0

Designing Effective Evaluation Rubrics

Design rubrics that are clear, measurable, and actionable. Good rubrics produce consistent results and provide useful feedback.

Core Principles

Measurable over subjective: Every criterion should have concrete checks that can be evaluated consistently. "Writes well" is bad. "Uses active voice and leads with verbs" is good.

Weighted by importance: Not all criteria are equal. Assign weights that reflect actual impact. Security issues might be 40% of a code review rubric while style is 10%.

Thresholds reflect reality: Set thresholds that match real-world expectations. A brand voice rubric for marketing copy might require 80%, while a security rubric for production code might require 95%.

Actionable feedback: Every check should produce feedback that tells the user exactly how to fix the issue.

Rubric Schema

See references/schema.md for the complete YAML schema reference.

Essential Fields

name: rubric-name          # Identifier (kebab-case)
version: 1.0.0             # Semantic version
description: |             # When to use this rubric
  Evaluates marketing copy for brand voice alignment

scope:
  type: content            # content | behavior | both
  file_patterns:           # Optional file filters
    - "*.md"
    - "app/routes/**/*.tsx"

criteria:
  criterion-name:
    weight: 25             # Percentage (all weights sum to 100)
    threshold: 80          # Minimum score to pass this criterion
    description: "..."     # What this measures
    checks: [...]          # Evaluation checks
    examples:              # Pass/fail examples
      pass: "Good example"
      fail: "Bad example"

passing:
  min_score: 75            # Overall minimum
  required_criteria: []    # Must-pass regardless of score

Check Types

Pattern-Based Checks (Fast, Deterministic)

Use for objective, pattern-matchable criteria:

checks:
  # Absence: Content should NOT contain pattern
  - type: absence
    pattern: "\\b(might|could|potentially)\\b"
    message: "Remove hedge words for more confident tone"

  # Presence: Content MUST contain pattern
  - type: presence
    pattern: "\\b(you|your)\\b"
    message: "Address the reader directly"

  # Pattern: Content should match format
  - type: pattern
    pattern: "^[A-Z]"
    message: "Headlines should start with capital letter"

When to use pattern checks:

Detecting forbidden words/phrases
Enforcing required elements
Validating format/structure
Checking naming conventions

Custom Checks (LLM-Evaluated)

Use for nuanced, context-dependent criteria:

checks:
  - type: custom
    prompt: "Does this content lead with action verbs and avoid passive voice?"
    message: "Use active voice and lead with verbs"

  - type: custom
    prompt: "Is the tone confident without being arrogant?"
    message: "Adjust tone: confident but approachable"

When to use custom checks:

Evaluating tone/voice
Assessing logical flow
Checking context-appropriate content
Nuanced quality assessments

Criterion Design

Good Criteria Have

Clear name: directness, error-handling, test-coverage
Focused scope: One quality dimension per criterion
Multiple checks: 2-5 checks that together assess the criterion
Concrete examples: Pass and fail that clarify expectations

Common Criterion Categories

Content criteria (what it says):

Directness, specificity, accuracy, completeness

Style criteria (how it's written):

Tone, voice, formatting, readability

Structural criteria (how it's organized):

Hierarchy, flow, sections, navigation

Behavioral criteria (what it does):

Error handling, logging, testing, security

Weight Distribution

Weights should reflect actual importance:

# Brand Voice Rubric Example
criteria:
  directness:    { weight: 30 }  # Core brand attribute
  specificity:   { weight: 25 }  # Important for trust
  tone:          { weight: 20 }  # Supports brand
  audience:      { weight: 15 }  # Enables connection
  formatting:    { weight: 10 }  # Nice to have

Guidelines:

All weights must sum to 100
Most important criterion: 25-40%
Supporting criteria: 15-25%
Minor criteria: 5-15%
Avoid equal weights (shows lack of prioritization)

Threshold Setting

Choose thresholds based on context:

Context	Threshold Range	Rationale
Security-critical	90-100%	Can't compromise
Production code	80-90%	High standards
Marketing copy	70-85%	Room for creativity
Draft content	60-75%	Early feedback

Examples

See examples/ for working rubric examples:

brand-voice.yaml - Marketing copy evaluation
code-security.yaml - Security audit rubric
api-design.yaml - API review criteria

Anti-Patterns

Avoid:

Vague criteria: "Code quality" (unmeasurable)
Overlapping checks: Testing same thing twice
Extreme thresholds: 100% (nothing passes) or 50% (everything passes)
Missing examples: Leaves room for interpretation
Generic messages: "Fix this" (not actionable)

Prefer:

Specific criteria: "Error messages include context and recovery steps"
Distinct checks: Each check tests something unique
Reasonable thresholds: Based on real-world expectations
Clear examples: Both pass and fail cases
Actionable messages: "Add error context: what failed and how to fix"

rubric-design

When & Why to Use This Skill

Use Cases