Question 1

What is experiment-analysis?

Accepted Answer

This Claude skill streamlines the monitoring and diagnosis of GRPO (Group Relative Policy Optimization) training runs by synthesizing WandB metrics and Axiom logs. It provides developers with automated insights into learning dynamics, Elo progression against fixed references, and pipeline throughput, ensuring high-performance model training and rapid troubleshooting of training instabilities.

Question 2

When should I use experiment-analysis?

Accepted Answer

experiment-analysis is useful in the following scenarios: • Diagnosing training stagnation or instability by analyzing Elo trajectories, KL divergence, and gradient norms to ensure the model is effectively learning. • Monitoring real-time pipeline performance, including rollout timing, inference engine throughput, and error rates via Axiom log integration. • Comparing experiment sweeps to identify the most effective hyperparameters and configurations for reinforcement learning tasks. • Evaluating model quality using fixed reference benchmarks to calculate the Elo gap and track genuine performance improvements over time. • Tracking system health and extraction rates to ensure data processing and reward mechanisms are functioning correctly during long-running training jobs.

name	experiment-analysis
description	Analyze GRPO training runs for learning dynamics and pipeline performance. Use when diagnosing training issues, reviewing Elo progression, checking throughput, or updating experiment results.

Question	Command
Full Elo analysis	`uv run python .claude/skills/experiment-analysis/analyze_elo.py <run>`
Compare sweep runs	`uv run python .claude/skills/experiment-analysis/analyze_sweep.py --sweep <prefix>`
Is model learning?	`uv run python scripts/wandb_cli.py get-metrics -r <run> --all-metrics`
Rollout throughput?	`uv run python scripts/axiom_cli.py rollout-timing --last 6h`
Any errors?	`uv run python scripts/axiom_cli.py errors --last 1h`
Extraction rate?	`uv run python scripts/axiom_cli.py extraction-stats --last 24h`
System health?	`uv run python scripts/axiom_cli.py health --last 1h`

Metric	Good Sign	Bad Sign
base_model Elo	Declining	Stable/Rising
Baseline bot Elo	Declining (exploited)	Rising
Best checkpoint - base_model gap	Growing	Shrinking
Older checkpoint Elo	Declining	Stable
KL divergence	Stable <0.1	Spikes >0.2

Metric	Target	Action if Miss
Rollout p95 duration	<120s	Check inference engine
Extraction rate	>95%	Check logits processor
Error rate	<1%	Check Axiom errors
Grad norm	<50	Policy may be unstable

experiment-analysis

When & Why to Use This Skill

Use Cases

Experiment Analysis

Quick Reference

Tools Overview

WandB CLI (`scripts/wandb_cli.py`)

Axiom CLI (`scripts/axiom_cli.py`)

Detailed Guides

Key Metrics

Learning Signal (Fixed Reference Analysis)

Performance