experiment-analysis
Analyze GRPO training runs for learning dynamics and pipeline performance. Use when diagnosing training issues, reviewing Elo progression, checking throughput, or updating experiment results.
When & Why to Use This Skill
This Claude skill streamlines the monitoring and diagnosis of GRPO (Group Relative Policy Optimization) training runs by synthesizing WandB metrics and Axiom logs. It provides developers with automated insights into learning dynamics, Elo progression against fixed references, and pipeline throughput, ensuring high-performance model training and rapid troubleshooting of training instabilities.
Use Cases
- Diagnosing training stagnation or instability by analyzing Elo trajectories, KL divergence, and gradient norms to ensure the model is effectively learning.
- Monitoring real-time pipeline performance, including rollout timing, inference engine throughput, and error rates via Axiom log integration.
- Comparing experiment sweeps to identify the most effective hyperparameters and configurations for reinforcement learning tasks.
- Evaluating model quality using fixed reference benchmarks to calculate the Elo gap and track genuine performance improvements over time.
- Tracking system health and extraction rates to ensure data processing and reward mechanisms are functioning correctly during long-running training jobs.
| name | experiment-analysis |
|---|---|
| description | Analyze GRPO training runs for learning dynamics and pipeline performance. Use when diagnosing training issues, reviewing Elo progression, checking throughput, or updating experiment results. |
Experiment Analysis
Diagnose GRPO training runs using WandB metrics and Axiom logs.
Quick Reference
| Question | Command |
|---|---|
| Full Elo analysis | uv run python .claude/skills/experiment-analysis/analyze_elo.py <run> |
| Compare sweep runs | uv run python .claude/skills/experiment-analysis/analyze_sweep.py --sweep <prefix> |
| Is model learning? | uv run python scripts/wandb_cli.py get-metrics -r <run> --all-metrics |
| Rollout throughput? | uv run python scripts/axiom_cli.py rollout-timing --last 6h |
| Any errors? | uv run python scripts/axiom_cli.py errors --last 1h |
| Extraction rate? | uv run python scripts/axiom_cli.py extraction-stats --last 24h |
| System health? | uv run python scripts/axiom_cli.py health --last 1h |
Tools Overview
WandB CLI (scripts/wandb_cli.py)
Training metrics and Elo ratings. Use for:
- Elo trajectory analysis (learning signal)
- Reward/loss curves
- KL divergence and grad norm
Axiom CLI (scripts/axiom_cli.py)
Real-time logs and events. Use for:
- Rollout timing and throughput
- Inference engine performance
- Error monitoring
- Order extraction stats
Detailed Guides
- Learning Dynamics - Elo, rewards, KL analysis
- Pipeline Performance - Throughput, timing, errors
- Experiment Tracker Guide - Updating docs/experiment-tracker.md
- Examples - Real analysis walkthrough
Key Metrics
Learning Signal (Fixed Reference Analysis)
Key insight: Win rate against a dynamic league is meaningless. Use FIXED references.
| Metric | Good Sign | Bad Sign |
|---|---|---|
| base_model Elo | Declining | Stable/Rising |
| Baseline bot Elo | Declining (exploited) | Rising |
| Best checkpoint - base_model gap | Growing | Shrinking |
| Older checkpoint Elo | Declining | Stable |
| KL divergence | Stable <0.1 | Spikes >0.2 |
Fixed references (base_model, chaos_bot, etc.) don't change, so their Elo changes = learning. Elo gap (best checkpoint - base_model) measures how much better trained model is.
Performance
| Metric | Target | Action if Miss |
|---|---|---|
| Rollout p95 duration | <120s | Check inference engine |
| Extraction rate | >95% | Check logits processor |
| Error rate | <1% | Check Axiom errors |
| Grad norm | <50 | Policy may be unstable |