experiment-analysis

majiayu000's avatarfrom majiayu000

Analyze GRPO training runs for learning dynamics and pipeline performance. Use when diagnosing training issues, reviewing Elo progression, checking throughput, or updating experiment results.

0stars🔀0forks📁View on GitHub🕐Updated Jan 5, 2026

When & Why to Use This Skill

This Claude skill streamlines the monitoring and diagnosis of GRPO (Group Relative Policy Optimization) training runs by synthesizing WandB metrics and Axiom logs. It provides developers with automated insights into learning dynamics, Elo progression against fixed references, and pipeline throughput, ensuring high-performance model training and rapid troubleshooting of training instabilities.

Use Cases

  • Diagnosing training stagnation or instability by analyzing Elo trajectories, KL divergence, and gradient norms to ensure the model is effectively learning.
  • Monitoring real-time pipeline performance, including rollout timing, inference engine throughput, and error rates via Axiom log integration.
  • Comparing experiment sweeps to identify the most effective hyperparameters and configurations for reinforcement learning tasks.
  • Evaluating model quality using fixed reference benchmarks to calculate the Elo gap and track genuine performance improvements over time.
  • Tracking system health and extraction rates to ensure data processing and reward mechanisms are functioning correctly during long-running training jobs.
nameexperiment-analysis
descriptionAnalyze GRPO training runs for learning dynamics and pipeline performance. Use when diagnosing training issues, reviewing Elo progression, checking throughput, or updating experiment results.

Experiment Analysis

Diagnose GRPO training runs using WandB metrics and Axiom logs.

Quick Reference

Question Command
Full Elo analysis uv run python .claude/skills/experiment-analysis/analyze_elo.py <run>
Compare sweep runs uv run python .claude/skills/experiment-analysis/analyze_sweep.py --sweep <prefix>
Is model learning? uv run python scripts/wandb_cli.py get-metrics -r <run> --all-metrics
Rollout throughput? uv run python scripts/axiom_cli.py rollout-timing --last 6h
Any errors? uv run python scripts/axiom_cli.py errors --last 1h
Extraction rate? uv run python scripts/axiom_cli.py extraction-stats --last 24h
System health? uv run python scripts/axiom_cli.py health --last 1h

Tools Overview

WandB CLI (scripts/wandb_cli.py)

Training metrics and Elo ratings. Use for:

  • Elo trajectory analysis (learning signal)
  • Reward/loss curves
  • KL divergence and grad norm

Axiom CLI (scripts/axiom_cli.py)

Real-time logs and events. Use for:

  • Rollout timing and throughput
  • Inference engine performance
  • Error monitoring
  • Order extraction stats

Detailed Guides

Key Metrics

Learning Signal (Fixed Reference Analysis)

Key insight: Win rate against a dynamic league is meaningless. Use FIXED references.

Metric Good Sign Bad Sign
base_model Elo Declining Stable/Rising
Baseline bot Elo Declining (exploited) Rising
Best checkpoint - base_model gap Growing Shrinking
Older checkpoint Elo Declining Stable
KL divergence Stable <0.1 Spikes >0.2

Fixed references (base_model, chaos_bot, etc.) don't change, so their Elo changes = learning. Elo gap (best checkpoint - base_model) measures how much better trained model is.

Performance

Metric Target Action if Miss
Rollout p95 duration <120s Check inference engine
Extraction rate >95% Check logits processor
Error rate <1% Check Axiom errors
Grad norm <50 Policy may be unstable