📏Evaluation and Benchmarks Skills

Browse skills in the Evaluation and Benchmarks category.

Skill Judge

softaworks's avatarfrom softaworks

A powerful skill for Claude agents.

[Evaluation and Benchmarks]

fpf-skillmetric-evaluate-cslc

venikman's avatarfrom venikman

Evaluates a value against an FPF CSLC (Characteristic/Scale/Level/Coordinate) definition (A.18).

[Evaluation and Benchmarks]

confidence-evaluator

davjdk's avatarfrom davjdk

Evaluate requirement clarity and completeness using ISO/IEC/IEEE 29148:2018 criteria. Use when user asks to implement features, fix bugs, or make changes. Automatically invoked when confidence_policy is enabled in ai-settings.json.

[Evaluation and Benchmarks]

fpf-skillmetric-score-usability

venikman's avatarfrom venikman

Calculates the SkillUsabilityScore (U.Metric) for Zero-Shot Enactment.

[Evaluation and Benchmarks]

fpf-skillhello-world

venikman's avatarfrom venikman

Minimal reference skill used to validate parsing and loading.

[Evaluation and Benchmarks]

fpf-skillplanning-initialize-baseline

venikman's avatarfrom venikman

Creates an initial SlotFillingsPlanItem (A.15.3) baseline.

[Evaluation and Benchmarks]

validation-test

nevir's avatarfrom nevir

A test skill to validate that SessionStart hooks can create symlinks before skill discovery. If you can see this skill, the hook timing works correctly.

[Evaluation and Benchmarks]

edu-demo-evaluator-free

hanialshater's avatarfrom hanialshater

Watch educational demo like a learner (BLIND evaluation). No test cases. Nobenchmark. No rubric. Honest assessment of: impression, what works, what doesn't,learner impact, recommendation. Output: agent_X_free_eval.json

[Evaluation and Benchmarks]

improve-skill

mauromedda's avatarfrom mauromedda

Analyze Claude Code session transcripts to improve existing skills or create new ones. Use when you want to review a past session to identify what worked, what didn't, and how to enhance skill documentation. Extracts session data and provides structured analysis prompts. Triggers on "improve skill", "analyze session", "review session", "skill improvement", "create skill from session", "skill not working", "skill missed", "skill didn't trigger", "enhance skill", "refine skill", "skill feedback", "session transcript", "what went wrong", "skill optimization", "better triggers".

[Evaluation and Benchmarks]

prompt-iteration

darthmolen's avatarfrom darthmolen

Use when iteratively improving agent prompts through automated LLM-as-Judge evaluation. Runs eval→fix→commit loop with circuit breakers.

[Evaluation and Benchmarks]

nixtla-benchmark-reporter

intent-solutions-io's avatarfrom intent-solutions-io

Generate comprehensive markdown benchmark reports from forecast accuracy metrics with model comparisons, statistical analysis, and regression detection. Use when analyzing baseline performance, comparing forecast models, or validating model quality. Trigger with 'generate benchmark report', 'analyze forecast metrics', or 'create performance summary'.

[Evaluation and Benchmarks]

growth-learning

randysalars's avatarfrom randysalars

Analytics, feedback processing, and continuous improvement

[Evaluation and Benchmarks]

eval-patterns

therealchrisrock's avatarfrom therealchrisrock

This skill provides common evaluation patterns and integration guidance. Use when:- Integrating eval-framework with other plugins- Designing evaluation workflows- Choosing between content vs behavior evaluation- Setting up project-local rubrics

[Evaluation and Benchmarks]

nixtla-universal-validator

intent-solutions-io's avatarfrom intent-solutions-io

Validate Nixtla skills and plugins with deterministic evidence bundles and strict schema gates. Use when auditing changes or enforcing compliance. Trigger with 'run validation' or 'audit validators'.

[Evaluation and Benchmarks]

org-universal-validator

intent-solutions-io's avatarfrom intent-solutions-io

Validate skills and plugins with deterministic evidence bundles and strict schema gates. Use when auditing changes or enforcing compliance. Trigger with 'run validation' or 'audit validators'.

[Evaluation and Benchmarks]

edu-demo-evaluator

hanialshater's avatarfrom hanialshater

Evaluate educational demos using Chrome tools for E2E testing. Executes testcases from test_cases.json, captures screenshots, verifies learning outcomes.Scores QUALITY relative to benchmark. Uses real browser interaction viamcp__claude-in-chrome__* tools.

[Evaluation and Benchmarks]

model-evaluator

eddiebe147's avatarfrom eddiebe147

Evaluate and compare ML model performance with rigorous testing methodologies

[Evaluation and Benchmarks]

marker-engine-rl

DYAI2025's avatarfrom DYAI2025

Vertieft den Marker-Engine-Skill um SFT/RL-Feinabstimmung mit LeanDeep 4.0; lädt Marker aus Supabase/ZIP und lernt eine Policy zur präzisen, kontextualisierten Marker-Anwendung bei strikter Bottom-up-Logik.

[Evaluation and Benchmarks]

rubric-design

therealchrisrock's avatarfrom therealchrisrock

This skill provides guidance on designing effective evaluation rubrics. Use when:- Creating criteria for content or code quality assessment- Defining weights and thresholds for evaluation- Designing check types (pattern-based vs custom)- Structuring rubrics for maintainability and reusability

[Evaluation and Benchmarks]

data-training-manager

codatta's avatarfrom codatta

Manage AI training data, monitor content freshness, detect repetition, and update training samples for continuous learning. Use when managing training data, checking content quality, updating AI models, or preventing repetitive content.

[Evaluation and Benchmarks]

ai-safety-auditor

eddiebe147's avatarfrom eddiebe147

Audit AI systems for safety, bias, and responsible deployment

[Evaluation and Benchmarks]

org-verification-pipeline

intent-solutions-io's avatarfrom intent-solutions-io

Produces verified datasets, verified evaluation results, and a deployable contract bundle for a workflow. Use when you need provable correctness at data and evaluation boundaries. Trigger with 'verify workflow', 'validate contract', or 'run verification pipeline'.

[Evaluation and Benchmarks]

use-skill-create

mtthsnc's avatarfrom mtthsnc

Use when creating new skills, editing existing skills, or verifying skills work before deployment

[Evaluation and Benchmarks]

evaluation-methodology

doanchienthangdev's avatarfrom doanchienthangdev

Methods for evaluating AI model outputs - exact match, semantic similarity, LLM-as-judge, comparative evaluation, ELO ranking. Use when measuring AI quality, building eval pipelines, or comparing models.

[Evaluation and Benchmarks]

ai-system-evaluation

doanchienthangdev's avatarfrom doanchienthangdev

End-to-end AI system evaluation - model selection, benchmarks, cost/latency analysis, build vs buy decisions. Use when selecting models, designing eval pipelines, or making architecture decisions.

[Evaluation and Benchmarks]

prompt-quality-validator

isymchych's avatarfrom isymchych

Evaluate a prompt or Agent Skill (SKILL.md) for clarity, constraints, robustness, and operational fit; output a numeric score.

[Evaluation and Benchmarks]

user-feedback

doanchienthangdev's avatarfrom doanchienthangdev

Collecting and using user feedback - explicit/implicit signals, feedback analysis, improvement loops, A/B testing. Use when improving AI systems, understanding user satisfaction, or iterating on quality.

[Evaluation and Benchmarks]

skill-validator

salmanparacha's avatarfrom salmanparacha

Validate skills against production-level criteria. Use when reviewing, auditing, or improving skills to ensure they meet quality standards. Evaluates structure, content quality, user interaction patterns, documentation completeness, domain standards compliance, and technical robustness. Returns actionable validation report with scores and improvement recommendations.

[Evaluation and Benchmarks]

reward

atrawog's avatarfrom atrawog

Reward model training for RLHF pipelines. Covers RewardTrainer, preference datasetpreparation, sequence classification heads, and reward scaling for stablereinforcement learning. Includes thinking quality scoring patterns.

[Evaluation and Benchmarks]

skill-feedback

dparedesi's avatarfrom dparedesi

Generate improvement reports for skills or CLI packages you authored. Use when ending a session where you worked on your own skill, when the user mentions "skill-feedback", "capture improvements", "session learnings", or when friction was observed during skill/package usage.

[Evaluation and Benchmarks]

skill-builder

dparedesi's avatarfrom dparedesi

Create, evaluate, and improve Agent skills to production quality (100/100). Use when the user wants to create a new skill, review an existing skill, score a skill against best practices, or improve a skill's quality. Also use when the user mentions skill development, skill templates, or skill optimization.

[Evaluation and Benchmarks]

grading-claude-agents-md

SpillwaveSolutions's avatarfrom SpillwaveSolutions

Grades and improves CLAUDE.md (Claude Code) and AGENTS.md (Codex/OpenCode) configuration files. Use when asked to grade, score, evaluate, audit, review, improve, fix, optimize, or refactor agent config files. Triggers on 'grade my CLAUDE.md', 'score my AGENTS.md', 'is my CLAUDE.md too big', 'improve my agent config', 'fix my CLAUDE.md', 'optimize context usage', 'reduce tokens in CLAUDE.md', or 'audit my config files'. Automatically grades both files if present, generates improvement plan, and implements changes on approval.

[Evaluation and Benchmarks]

content-evaluation-framework

majiayu000's avatarfrom majiayu000

This skill should be used when evaluating the quality of book chapters, lessons, or educational content. It provides a systematic 6-category rubric with weighted scoring (Technical Accuracy 30%, Pedagogical Effectiveness 25%, Writing Quality 20%, Structure & Organization 15%, AI-First Teaching 10%, Constitution Compliance Pass/Fail) and multi-tier assessment (Excellent/Good/Needs Work/Insufficient). Use this during iterative drafting, after content completion, on-demand review requests, or before validation phases.

[Evaluation and Benchmarks]

mcp-tester

ckorhonen's avatarfrom ckorhonen

Test and evaluate MCP server tools in the current session. Use when auditing MCP configurations, validating tool quality, testing MCP servers, generating test cases, checking tool descriptions, or analyzing tool efficiency and redundancy.

[Evaluation and Benchmarks]

evaluation-reporting-framework

majiayu000's avatarfrom majiayu000

Evaluation and reporting for code quality, performance, security, architecture, team processes, AI/LLM outputs, A/B tests, ROI analysis, and compliance. Scoring systems, benchmarking, dashboard creation, and multi-format report generation (PDF, HTML, Markdown, JSON).

[Evaluation and Benchmarks]

evaluation-metrics

majiayu000's avatarfrom majiayu000

Automatically applies when evaluating LLM performance. Ensures proper eval datasets, metrics computation, A/B testing, LLM-as-judge patterns, and experiment tracking.

[Evaluation and Benchmarks]

skill-quality-validator

majiayu000's avatarfrom majiayu000

Claude技能质量检查器,自动验证技能是否符合官方最佳实践标准。Use when 创建新技能需要验证规范、 修改现有技能需要质量检查、从他人处获取技能需要评估质量、批量检查多个技能的合规性。 适用于: (1) 新技能开发后的质量验证 (2) 技能更新后的合规检查 (3) 第三方技能的质量评估 (4) 团队技能库的标准化管理 (5) 技能打包前的最终审核

[Evaluation and Benchmarks]

production-eval-strategy

majiayu000's avatarfrom majiayu000

Strategies for evaluating agents in production - sampling, baselines, and regression detection

[Evaluation and Benchmarks]

llm-call

majiayu000's avatarfrom majiayu000

External LLM invocation. Triggered ONLY by @council,@probe,@crossref,@gpt,@gemini,@grok,@qwen.

[Evaluation and Benchmarks]

test-mcp-connector

majiayu000's avatarfrom majiayu000

ONLY trigger this skill when the user EXPLICITLY asks for MCP-based testing:**Required triggers (ALL must mention "MCP" explicitly):**- "test connector with mcp"- "test mcp connector"- "test [provider] with mcp"- "use mcp to test [provider]"- "run mcp connector test"- "mcp test for [provider]"**DO NOT trigger for:**- Generic "test the connector" requests (use stackone run / test_actions instead)- "test [provider]" without explicit MCP mention- Regular validation or testing requests- Any testing that doesn't explicitly mention MCPThis skill builds a REAL agent with Claude Agent SDK that sends natural language prompts to evaluate if action descriptions are agent-friendly. It's more intensive than regular testing and should only be used when explicitly requested.

[Evaluation and Benchmarks]

evaluation-quality

majiayu000's avatarfrom majiayu000

Instrument evaluation metrics, quality scores, and feedback loops

[Evaluation and Benchmarks]

agent-certifier

majiayu000's avatarfrom majiayu000

Given a human certification or license (e.g. PL-300, SAP B1, Azure AI Engineer), create a production-ready agent skill profile and certification ladder, including skills.yaml entries, agent YAML, and skills documentation, using the anthropics/skills SKILL.md conventions.

[Evaluation and Benchmarks]

agent-audit

majiayu000's avatarfrom majiayu000

Validates agent configurations for model selection appropriateness, tool restriction accuracy, focus area quality, and approach completeness. Use when reviewing, auditing, improving, or troubleshooting agents, checking model choice (Sonnet/Haiku/Opus), validating tool permissions, assessing focus area specificity, or ensuring approach methodology is complete. Also triggers when user asks about agent best practices, wants to optimize agent design, needs help with agent validation, or is debugging agent issues.

[Evaluation and Benchmarks]

kpi-pr-throughput

majiayu000's avatarfrom majiayu000

KPI for measuring and improving PR throughput. Defines metrics, measurement methods, and improvement strategies. Use to optimize how many quality PRs get merged.

[Evaluation and Benchmarks]

wolf-scripts-core

majiayu000's avatarfrom majiayu000

Core automation scripts for archetype selection, evidence validation, quality scoring, and safe bash execution

[Evaluation and Benchmarks]

artifact-validator

majiayu000's avatarfrom majiayu000

Validate and grade Claude Code Skills, Commands, Subagents, and Hooks for quality and correctness. Check YAML syntax, verify naming conventions, validate required fields, test activation patterns, assess description quality. Generate quality scores using Q = 0.40R + 0.30C + 0.20S + 0.10E framework with specific improvement recommendations. Use when validating artifacts, checking quality, troubleshooting activation issues, or ensuring artifact correctness before deployment.

[Evaluation and Benchmarks]

agentv-eval-builder

majiayu000's avatarfrom majiayu000

Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring custom evaluators (code validators or LLM judges) for agent testing workflows.

[Evaluation and Benchmarks]

ai-output-validator

majiayu000's avatarfrom majiayu000

AI出力の品質を自動検証するスキル。事実確認、論理性、一貫性、幻覚(ハルシネーション)検出、バイアス分析、安全性チェックを実施し、改善提案を提供。

[Evaluation and Benchmarks]

eval-framework

majiayu000's avatarfrom majiayu000

Framework for capturing, storing, and comparing AI evaluations to measure consistency and completeness.Use when: comparing reviews, measuring evaluation quality, running reproducibility tests,auditing AI outputs, validating findings across runs.Triggers: "compare evaluations", "measure consistency", "evaluation framework", "reproducible review","compare reviews", "validate findings", "audit evaluation".

[Evaluation and Benchmarks]

decision-critic

timmye's avatarfrom timmye

Invoke IMMEDIATELY via python script to stress-test decisions and reasoning. Do NOT analyze first - the script orchestrates the critique workflow.

[Evaluation and Benchmarks]
← Back to All Skills