📏Evaluation and Benchmarks Skills
Browse skills in the Evaluation and Benchmarks category.
Skill Judge
A powerful skill for Claude agents.
fpf-skillmetric-evaluate-cslc
Evaluates a value against an FPF CSLC (Characteristic/Scale/Level/Coordinate) definition (A.18).
confidence-evaluator
Evaluate requirement clarity and completeness using ISO/IEC/IEEE 29148:2018 criteria. Use when user asks to implement features, fix bugs, or make changes. Automatically invoked when confidence_policy is enabled in ai-settings.json.
fpf-skillmetric-score-usability
Calculates the SkillUsabilityScore (U.Metric) for Zero-Shot Enactment.
fpf-skillhello-world
Minimal reference skill used to validate parsing and loading.
fpf-skillplanning-initialize-baseline
Creates an initial SlotFillingsPlanItem (A.15.3) baseline.
validation-test
A test skill to validate that SessionStart hooks can create symlinks before skill discovery. If you can see this skill, the hook timing works correctly.
edu-demo-evaluator-free
Watch educational demo like a learner (BLIND evaluation). No test cases. Nobenchmark. No rubric. Honest assessment of: impression, what works, what doesn't,learner impact, recommendation. Output: agent_X_free_eval.json
improve-skill
Analyze Claude Code session transcripts to improve existing skills or create new ones. Use when you want to review a past session to identify what worked, what didn't, and how to enhance skill documentation. Extracts session data and provides structured analysis prompts. Triggers on "improve skill", "analyze session", "review session", "skill improvement", "create skill from session", "skill not working", "skill missed", "skill didn't trigger", "enhance skill", "refine skill", "skill feedback", "session transcript", "what went wrong", "skill optimization", "better triggers".
prompt-iteration
Use when iteratively improving agent prompts through automated LLM-as-Judge evaluation. Runs eval→fix→commit loop with circuit breakers.
nixtla-benchmark-reporter
Generate comprehensive markdown benchmark reports from forecast accuracy metrics with model comparisons, statistical analysis, and regression detection. Use when analyzing baseline performance, comparing forecast models, or validating model quality. Trigger with 'generate benchmark report', 'analyze forecast metrics', or 'create performance summary'.
growth-learning
Analytics, feedback processing, and continuous improvement
eval-patterns
This skill provides common evaluation patterns and integration guidance. Use when:- Integrating eval-framework with other plugins- Designing evaluation workflows- Choosing between content vs behavior evaluation- Setting up project-local rubrics
nixtla-universal-validator
Validate Nixtla skills and plugins with deterministic evidence bundles and strict schema gates. Use when auditing changes or enforcing compliance. Trigger with 'run validation' or 'audit validators'.
org-universal-validator
Validate skills and plugins with deterministic evidence bundles and strict schema gates. Use when auditing changes or enforcing compliance. Trigger with 'run validation' or 'audit validators'.
edu-demo-evaluator
Evaluate educational demos using Chrome tools for E2E testing. Executes testcases from test_cases.json, captures screenshots, verifies learning outcomes.Scores QUALITY relative to benchmark. Uses real browser interaction viamcp__claude-in-chrome__* tools.
model-evaluator
Evaluate and compare ML model performance with rigorous testing methodologies
marker-engine-rl
Vertieft den Marker-Engine-Skill um SFT/RL-Feinabstimmung mit LeanDeep 4.0; lädt Marker aus Supabase/ZIP und lernt eine Policy zur präzisen, kontextualisierten Marker-Anwendung bei strikter Bottom-up-Logik.
rubric-design
This skill provides guidance on designing effective evaluation rubrics. Use when:- Creating criteria for content or code quality assessment- Defining weights and thresholds for evaluation- Designing check types (pattern-based vs custom)- Structuring rubrics for maintainability and reusability
data-training-manager
Manage AI training data, monitor content freshness, detect repetition, and update training samples for continuous learning. Use when managing training data, checking content quality, updating AI models, or preventing repetitive content.
ai-safety-auditor
Audit AI systems for safety, bias, and responsible deployment
org-verification-pipeline
Produces verified datasets, verified evaluation results, and a deployable contract bundle for a workflow. Use when you need provable correctness at data and evaluation boundaries. Trigger with 'verify workflow', 'validate contract', or 'run verification pipeline'.
use-skill-create
Use when creating new skills, editing existing skills, or verifying skills work before deployment
evaluation-methodology
Methods for evaluating AI model outputs - exact match, semantic similarity, LLM-as-judge, comparative evaluation, ELO ranking. Use when measuring AI quality, building eval pipelines, or comparing models.
ai-system-evaluation
End-to-end AI system evaluation - model selection, benchmarks, cost/latency analysis, build vs buy decisions. Use when selecting models, designing eval pipelines, or making architecture decisions.
prompt-quality-validator
Evaluate a prompt or Agent Skill (SKILL.md) for clarity, constraints, robustness, and operational fit; output a numeric score.
user-feedback
Collecting and using user feedback - explicit/implicit signals, feedback analysis, improvement loops, A/B testing. Use when improving AI systems, understanding user satisfaction, or iterating on quality.
skill-validator
Validate skills against production-level criteria. Use when reviewing, auditing, or improving skills to ensure they meet quality standards. Evaluates structure, content quality, user interaction patterns, documentation completeness, domain standards compliance, and technical robustness. Returns actionable validation report with scores and improvement recommendations.
reward
Reward model training for RLHF pipelines. Covers RewardTrainer, preference datasetpreparation, sequence classification heads, and reward scaling for stablereinforcement learning. Includes thinking quality scoring patterns.
skill-feedback
Generate improvement reports for skills or CLI packages you authored. Use when ending a session where you worked on your own skill, when the user mentions "skill-feedback", "capture improvements", "session learnings", or when friction was observed during skill/package usage.
skill-builder
Create, evaluate, and improve Agent skills to production quality (100/100). Use when the user wants to create a new skill, review an existing skill, score a skill against best practices, or improve a skill's quality. Also use when the user mentions skill development, skill templates, or skill optimization.
grading-claude-agents-md
Grades and improves CLAUDE.md (Claude Code) and AGENTS.md (Codex/OpenCode) configuration files. Use when asked to grade, score, evaluate, audit, review, improve, fix, optimize, or refactor agent config files. Triggers on 'grade my CLAUDE.md', 'score my AGENTS.md', 'is my CLAUDE.md too big', 'improve my agent config', 'fix my CLAUDE.md', 'optimize context usage', 'reduce tokens in CLAUDE.md', or 'audit my config files'. Automatically grades both files if present, generates improvement plan, and implements changes on approval.
content-evaluation-framework
This skill should be used when evaluating the quality of book chapters, lessons, or educational content. It provides a systematic 6-category rubric with weighted scoring (Technical Accuracy 30%, Pedagogical Effectiveness 25%, Writing Quality 20%, Structure & Organization 15%, AI-First Teaching 10%, Constitution Compliance Pass/Fail) and multi-tier assessment (Excellent/Good/Needs Work/Insufficient). Use this during iterative drafting, after content completion, on-demand review requests, or before validation phases.
mcp-tester
Test and evaluate MCP server tools in the current session. Use when auditing MCP configurations, validating tool quality, testing MCP servers, generating test cases, checking tool descriptions, or analyzing tool efficiency and redundancy.
evaluation-reporting-framework
Evaluation and reporting for code quality, performance, security, architecture, team processes, AI/LLM outputs, A/B tests, ROI analysis, and compliance. Scoring systems, benchmarking, dashboard creation, and multi-format report generation (PDF, HTML, Markdown, JSON).
evaluation-metrics
Automatically applies when evaluating LLM performance. Ensures proper eval datasets, metrics computation, A/B testing, LLM-as-judge patterns, and experiment tracking.
skill-quality-validator
Claude技能质量检查器,自动验证技能是否符合官方最佳实践标准。Use when 创建新技能需要验证规范、 修改现有技能需要质量检查、从他人处获取技能需要评估质量、批量检查多个技能的合规性。 适用于: (1) 新技能开发后的质量验证 (2) 技能更新后的合规检查 (3) 第三方技能的质量评估 (4) 团队技能库的标准化管理 (5) 技能打包前的最终审核
production-eval-strategy
Strategies for evaluating agents in production - sampling, baselines, and regression detection
llm-call
External LLM invocation. Triggered ONLY by @council,@probe,@crossref,@gpt,@gemini,@grok,@qwen.
test-mcp-connector
ONLY trigger this skill when the user EXPLICITLY asks for MCP-based testing:**Required triggers (ALL must mention "MCP" explicitly):**- "test connector with mcp"- "test mcp connector"- "test [provider] with mcp"- "use mcp to test [provider]"- "run mcp connector test"- "mcp test for [provider]"**DO NOT trigger for:**- Generic "test the connector" requests (use stackone run / test_actions instead)- "test [provider]" without explicit MCP mention- Regular validation or testing requests- Any testing that doesn't explicitly mention MCPThis skill builds a REAL agent with Claude Agent SDK that sends natural language prompts to evaluate if action descriptions are agent-friendly. It's more intensive than regular testing and should only be used when explicitly requested.
evaluation-quality
Instrument evaluation metrics, quality scores, and feedback loops
agent-certifier
Given a human certification or license (e.g. PL-300, SAP B1, Azure AI Engineer), create a production-ready agent skill profile and certification ladder, including skills.yaml entries, agent YAML, and skills documentation, using the anthropics/skills SKILL.md conventions.
agent-audit
Validates agent configurations for model selection appropriateness, tool restriction accuracy, focus area quality, and approach completeness. Use when reviewing, auditing, improving, or troubleshooting agents, checking model choice (Sonnet/Haiku/Opus), validating tool permissions, assessing focus area specificity, or ensuring approach methodology is complete. Also triggers when user asks about agent best practices, wants to optimize agent design, needs help with agent validation, or is debugging agent issues.
kpi-pr-throughput
KPI for measuring and improving PR throughput. Defines metrics, measurement methods, and improvement strategies. Use to optimize how many quality PRs get merged.
wolf-scripts-core
Core automation scripts for archetype selection, evidence validation, quality scoring, and safe bash execution
artifact-validator
Validate and grade Claude Code Skills, Commands, Subagents, and Hooks for quality and correctness. Check YAML syntax, verify naming conventions, validate required fields, test activation patterns, assess description quality. Generate quality scores using Q = 0.40R + 0.30C + 0.20S + 0.10E framework with specific improvement recommendations. Use when validating artifacts, checking quality, troubleshooting activation issues, or ensuring artifact correctness before deployment.
agentv-eval-builder
Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring custom evaluators (code validators or LLM judges) for agent testing workflows.
ai-output-validator
AI出力の品質を自動検証するスキル。事実確認、論理性、一貫性、幻覚(ハルシネーション)検出、バイアス分析、安全性チェックを実施し、改善提案を提供。
eval-framework
Framework for capturing, storing, and comparing AI evaluations to measure consistency and completeness.Use when: comparing reviews, measuring evaluation quality, running reproducibility tests,auditing AI outputs, validating findings across runs.Triggers: "compare evaluations", "measure consistency", "evaluation framework", "reproducible review","compare reviews", "validate findings", "audit evaluation".
decision-critic
Invoke IMMEDIATELY via python script to stress-test decisions and reasoning. Do NOT analyze first - the script orchestrates the critique workflow.