📏 Evaluation and Benchmarks Skills

fpf-skillmetric-evaluate-cslc

Evaluates a value against an FPF CSLC (Characteristic/Scale/Level/Coordinate) definition (A.18).

Evaluate requirement clarity and completeness using ISO/IEC/IEEE 29148:2018 criteria. Use when user asks to implement features, fix bugs, or make changes. Automatically invoked when confidence_policy is enabled in ai-settings.json.

fpf-skillmetric-score-usability

Calculates the SkillUsabilityScore (U.Metric) for Zero-Shot Enactment.

fpf-skillhello-world

Minimal reference skill used to validate parsing and loading.

fpf-skillplanning-initialize-baseline

Creates an initial SlotFillingsPlanItem (A.15.3) baseline.

validation-test

from nevir

A test skill to validate that SessionStart hooks can create symlinks before skill discovery. If you can see this skill, the hook timing works correctly.

edu-demo-evaluator-free

from hanialshater

Watch educational demo like a learner (BLIND evaluation). No test cases. Nobenchmark. No rubric. Honest assessment of: impression, what works, what doesn't,learner impact, recommendation. Output: agent_X_free_eval.json

improve-skill

from mauromedda

Analyze Claude Code session transcripts to improve existing skills or create new ones. Use when you want to review a past session to identify what worked, what didn't, and how to enhance skill documentation. Extracts session data and provides structured analysis prompts. Triggers on "improve skill", "analyze session", "review session", "skill improvement", "create skill from session", "skill not working", "skill missed", "skill didn't trigger", "enhance skill", "refine skill", "skill feedback", "session transcript", "what went wrong", "skill optimization", "better triggers".

prompt-iteration

from darthmolen

Use when iteratively improving agent prompts through automated LLM-as-Judge evaluation. Runs eval→fix→commit loop with circuit breakers.

nixtla-benchmark-reporter

Generate comprehensive markdown benchmark reports from forecast accuracy metrics with model comparisons, statistical analysis, and regression detection. Use when analyzing baseline performance, comparing forecast models, or validating model quality. Trigger with 'generate benchmark report', 'analyze forecast metrics', or 'create performance summary'.

growth-learning

from randysalars

Analytics, feedback processing, and continuous improvement

eval-patterns

from therealchrisrock

This skill provides common evaluation patterns and integration guidance. Use when:- Integrating eval-framework with other plugins- Designing evaluation workflows- Choosing between content vs behavior evaluation- Setting up project-local rubrics

nixtla-universal-validator

Validate Nixtla skills and plugins with deterministic evidence bundles and strict schema gates. Use when auditing changes or enforcing compliance. Trigger with 'run validation' or 'audit validators'.

org-universal-validator

Validate skills and plugins with deterministic evidence bundles and strict schema gates. Use when auditing changes or enforcing compliance. Trigger with 'run validation' or 'audit validators'.

edu-demo-evaluator

from hanialshater

Evaluate educational demos using Chrome tools for E2E testing. Executes testcases from test_cases.json, captures screenshots, verifies learning outcomes.Scores QUALITY relative to benchmark. Uses real browser interaction viamcp__claude-in-chrome__* tools.

model-evaluator

from eddiebe147

Evaluate and compare ML model performance with rigorous testing methodologies

marker-engine-rl

from DYAI2025

Vertieft den Marker-Engine-Skill um SFT/RL-Feinabstimmung mit LeanDeep 4.0; lädt Marker aus Supabase/ZIP und lernt eine Policy zur präzisen, kontextualisierten Marker-Anwendung bei strikter Bottom-up-Logik.

rubric-design

from therealchrisrock

This skill provides guidance on designing effective evaluation rubrics. Use when:- Creating criteria for content or code quality assessment- Defining weights and thresholds for evaluation- Designing check types (pattern-based vs custom)- Structuring rubrics for maintainability and reusability

data-training-manager

from codatta

Manage AI training data, monitor content freshness, detect repetition, and update training samples for continuous learning. Use when managing training data, checking content quality, updating AI models, or preventing repetitive content.

ai-safety-auditor

from eddiebe147

Audit AI systems for safety, bias, and responsible deployment

org-verification-pipeline

Produces verified datasets, verified evaluation results, and a deployable contract bundle for a workflow. Use when you need provable correctness at data and evaluation boundaries. Trigger with 'verify workflow', 'validate contract', or 'run verification pipeline'.

use-skill-create

from mtthsnc

Use when creating new skills, editing existing skills, or verifying skills work before deployment

evaluation-methodology

from doanchienthangdev

Methods for evaluating AI model outputs - exact match, semantic similarity, LLM-as-judge, comparative evaluation, ELO ranking. Use when measuring AI quality, building eval pipelines, or comparing models.

ai-system-evaluation

from doanchienthangdev

End-to-end AI system evaluation - model selection, benchmarks, cost/latency analysis, build vs buy decisions. Use when selecting models, designing eval pipelines, or making architecture decisions.

prompt-quality-validator

from isymchych

Evaluate a prompt or Agent Skill (SKILL.md) for clarity, constraints, robustness, and operational fit; output a numeric score.

user-feedback

from doanchienthangdev

Collecting and using user feedback - explicit/implicit signals, feedback analysis, improvement loops, A/B testing. Use when improving AI systems, understanding user satisfaction, or iterating on quality.

skill-validator

from salmanparacha

Validate skills against production-level criteria. Use when reviewing, auditing, or improving skills to ensure they meet quality standards. Evaluates structure, content quality, user interaction patterns, documentation completeness, domain standards compliance, and technical robustness. Returns actionable validation report with scores and improvement recommendations.

reward

from atrawog

Reward model training for RLHF pipelines. Covers RewardTrainer, preference datasetpreparation, sequence classification heads, and reward scaling for stablereinforcement learning. Includes thinking quality scoring patterns.

skill-feedback

from dparedesi

Generate improvement reports for skills or CLI packages you authored. Use when ending a session where you worked on your own skill, when the user mentions "skill-feedback", "capture improvements", "session learnings", or when friction was observed during skill/package usage.

skill-builder

from dparedesi

Create, evaluate, and improve Agent skills to production quality (100/100). Use when the user wants to create a new skill, review an existing skill, score a skill against best practices, or improve a skill's quality. Also use when the user mentions skill development, skill templates, or skill optimization.

grading-claude-agents-md

from SpillwaveSolutions

Grades and improves CLAUDE.md (Claude Code) and AGENTS.md (Codex/OpenCode) configuration files. Use when asked to grade, score, evaluate, audit, review, improve, fix, optimize, or refactor agent config files. Triggers on 'grade my CLAUDE.md', 'score my AGENTS.md', 'is my CLAUDE.md too big', 'improve my agent config', 'fix my CLAUDE.md', 'optimize context usage', 'reduce tokens in CLAUDE.md', or 'audit my config files'. Automatically grades both files if present, generates improvement plan, and implements changes on approval.

content-evaluation-framework

This skill should be used when evaluating the quality of book chapters, lessons, or educational content. It provides a systematic 6-category rubric with weighted scoring (Technical Accuracy 30%, Pedagogical Effectiveness 25%, Writing Quality 20%, Structure & Organization 15%, AI-First Teaching 10%, Constitution Compliance Pass/Fail) and multi-tier assessment (Excellent/Good/Needs Work/Insufficient). Use this during iterative drafting, after content completion, on-demand review requests, or before validation phases.

mcp-tester

from ckorhonen

Test and evaluate MCP server tools in the current session. Use when auditing MCP configurations, validating tool quality, testing MCP servers, generating test cases, checking tool descriptions, or analyzing tool efficiency and redundancy.

evaluation-reporting-framework

Evaluation and reporting for code quality, performance, security, architecture, team processes, AI/LLM outputs, A/B tests, ROI analysis, and compliance. Scoring systems, benchmarking, dashboard creation, and multi-format report generation (PDF, HTML, Markdown, JSON).

evaluation-metrics

Automatically applies when evaluating LLM performance. Ensures proper eval datasets, metrics computation, A/B testing, LLM-as-judge patterns, and experiment tracking.

skill-quality-validator

Claude技能质量检查器,自动验证技能是否符合官方最佳实践标准。Use when 创建新技能需要验证规范、修改现有技能需要质量检查、从他人处获取技能需要评估质量、批量检查多个技能的合规性。适用于: (1) 新技能开发后的质量验证 (2) 技能更新后的合规检查 (3) 第三方技能的质量评估 (4) 团队技能库的标准化管理 (5) 技能打包前的最终审核

production-eval-strategy

Strategies for evaluating agents in production - sampling, baselines, and regression detection

llm-call

External LLM invocation. Triggered ONLY by @council,@probe,@crossref,@gpt,@gemini,@grok,@qwen.

test-mcp-connector

ONLY trigger this skill when the user EXPLICITLY asks for MCP-based testing:**Required triggers (ALL must mention "MCP" explicitly):**- "test connector with mcp"- "test mcp connector"- "test [provider] with mcp"- "use mcp to test [provider]"- "run mcp connector test"- "mcp test for [provider]"**DO NOT trigger for:**- Generic "test the connector" requests (use stackone run / test_actions instead)- "test [provider]" without explicit MCP mention- Regular validation or testing requests- Any testing that doesn't explicitly mention MCPThis skill builds a REAL agent with Claude Agent SDK that sends natural language prompts to evaluate if action descriptions are agent-friendly. It's more intensive than regular testing and should only be used when explicitly requested.

evaluation-quality

Instrument evaluation metrics, quality scores, and feedback loops

agent-certifier

Given a human certification or license (e.g. PL-300, SAP B1, Azure AI Engineer), create a production-ready agent skill profile and certification ladder, including skills.yaml entries, agent YAML, and skills documentation, using the anthropics/skills SKILL.md conventions.

agent-audit

Validates agent configurations for model selection appropriateness, tool restriction accuracy, focus area quality, and approach completeness. Use when reviewing, auditing, improving, or troubleshooting agents, checking model choice (Sonnet/Haiku/Opus), validating tool permissions, assessing focus area specificity, or ensuring approach methodology is complete. Also triggers when user asks about agent best practices, wants to optimize agent design, needs help with agent validation, or is debugging agent issues.

kpi-pr-throughput

KPI for measuring and improving PR throughput. Defines metrics, measurement methods, and improvement strategies. Use to optimize how many quality PRs get merged.

wolf-scripts-core

Core automation scripts for archetype selection, evidence validation, quality scoring, and safe bash execution

artifact-validator

Validate and grade Claude Code Skills, Commands, Subagents, and Hooks for quality and correctness. Check YAML syntax, verify naming conventions, validate required fields, test activation patterns, assess description quality. Generate quality scores using Q = 0.40R + 0.30C + 0.20S + 0.10E framework with specific improvement recommendations. Use when validating artifacts, checking quality, troubleshooting activation issues, or ensuring artifact correctness before deployment.

agentv-eval-builder

Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring custom evaluators (code validators or LLM judges) for agent testing workflows.

ai-output-validator

AI出力の品質を自動検証するスキル。事実確認、論理性、一貫性、幻覚（ハルシネーション）検出、バイアス分析、安全性チェックを実施し、改善提案を提供。

eval-framework

Framework for capturing, storing, and comparing AI evaluations to measure consistency and completeness.Use when: comparing reviews, measuring evaluation quality, running reproducibility tests,auditing AI outputs, validating findings across runs.Triggers: "compare evaluations", "measure consistency", "evaluation framework", "reproducible review","compare reviews", "validate findings", "audit evaluation".

decision-critic

from timmye

Invoke IMMEDIATELY via python script to stress-test decisions and reasoning. Do NOT analyze first - the script orchestrates the critique workflow.