What is nixtla-benchmark-reporter?

The Nixtla Benchmark Reporter is a specialized Claude skill designed to automate the generation of production-ready markdown reports from forecasting accuracy metrics. It streamlines the evaluation of time-series models (such as TimeGPT or StatsForecast) by calculating key statistics, comparing model performance, and detecting regressions against baselines. By transforming raw CSV data into actionable insights and GitHub-ready documentation, it reduces manual analysis time from hours to minutes, ensuring systematic quality control in forecasting workflows.

When should I use nixtla-benchmark-reporter?

nixtla-benchmark-reporter is useful in the following scenarios: • Model Selection: Compare multiple forecasting models (e.g., AutoTheta vs. AutoETS) across hundreds of time series to identify the most accurate and consistent performer for production deployment. • Regression Detection: Automatically identify performance degradation by comparing current experiment results against historical baselines, triggering alerts if metrics like sMAPE or MASE exceed defined thresholds. • Automated Reporting: Generate comprehensive markdown reports and executive summaries for stakeholders, featuring model win rates, statistical breakdowns, and failure case analysis. • CI/CD Pipeline Integration: Incorporate automated benchmarking into data science workflows to ensure every model update meets quality standards and generate GitHub issue templates for any detected regressions.

name	nixtla-benchmark-reporter
description	Generate comprehensive markdown benchmark reports from forecast accuracy metrics with model comparisons, statistical analysis, and regression detection. Use when analyzing baseline performance, comparing forecast models, or validating model quality. Trigger with 'generate benchmark report', 'analyze forecast metrics', or 'create performance summary'.
allowed-tools	"Read,Write,Glob,Bash(python:*)"
version	"1.0.0"
author	"Jeremy Longshore <jeremy@intentsolutions.io>"
license	MIT

Nixtla Benchmark Reporter

Purpose

Generate production-ready benchmark reports from forecasting accuracy metrics, enabling systematic model comparison and regression detection for Nixtla forecasting workflows.

Overview

This skill transforms raw forecast metrics (sMAPE, MASE, MAE, RMSE) into actionable insights. It:

Parses benchmark results CSV files from statsforecast/TimeGPT experiments
Calculates summary statistics (mean, median, std dev, percentiles)
Generates model comparison tables with winners highlighted
Creates regression detection reports comparing current vs. baseline results
Produces GitHub issue templates for performance degradations
Generates markdown reports with embedded charts and recommendations

Key Benefits:

Automates tedious manual benchmarking analysis (2-3 hours → 2 minutes)
Provides consistent reporting format across all forecasting experiments
Detects performance regressions automatically
Generates shareable, version-controlled markdown reports

Prerequisites

Benchmark results CSV files with metrics per series and model
CSV format: columns series_id, model, sMAPE, MASE (minimum)
Optional: Baseline results CSV for regression comparison
Python 3.8+ with pandas, numpy installed

Expected CSV Structure:

series_id,model,sMAPE,MASE,MAE,RMSE
D1,SeasonalNaive,15.23,1.05,12.5,18.3
D1,AutoETS,13.45,0.92,10.2,15.1
D1,AutoTheta,12.34,0.87,9.8,14.5
D2,SeasonalNaive,18.67,1.23,15.1,22.4
...

Instructions

Step 1: Parse Benchmark Results

The script automatically:

Reads benchmark CSV file(s)
Validates CSV structure (required columns present)
Extracts unique models and series
Groups metrics by model

Usage:

python {baseDir}/scripts/generate_benchmark_report.py \
    --results /path/to/benchmark_results.csv \
    --output /path/to/report.md

Step 2: Calculate Summary Statistics

For each model, calculates:

Mean: Average metric across all series
Median: Middle value (less sensitive to outliers)
Std Dev: Measure of consistency
Min/Max: Best and worst performance
Percentiles: 25th, 50th, 75th, 95th percentiles
Win Rate: Percentage of series where model performed best

Step 3: Generate Comparison Table

Creates markdown table comparing all models:

## Model Comparison (sMAPE)

| Model | Mean | Median | Std Dev | Min | Max | Wins |
|-------|------|--------|---------|-----|-----|------|
| AutoTheta | 12.3% | 11.8% | 4.2% | 5.1% | 28.9% | 32/50 (64%) |
| AutoETS | 13.5% | 12.9% | 5.1% | 6.2% | 31.2% | 18/50 (36%) |
| SeasonalNaive | 15.2% | 14.5% | 6.3% | 7.8% | 35.4% | 0/50 (0%) |

Step 4: Identify Winner and Recommendations

Determines overall best model based on:

Primary metric: Lowest mean sMAPE/MASE
Consistency: Lowest standard deviation
Win rate: Most series won

Generates recommendations:

Production baseline model selection
When to use alternatives (e.g., AutoETS for seasonal data)
Failure case analysis (series where all models struggle)

Step 5: Regression Detection (Optional)

If baseline results provided, compares current vs. baseline:

python {baseDir}/scripts/generate_benchmark_report.py \
    --results current_results.csv \
    --baseline baseline_results.csv \
    --output regression_report.md \
    --threshold 5.0  # Alert if sMAPE degrades >5%

Regression Report Includes:

Models with performance degradation
Severity of regression (% change)
Affected series
GitHub issue template for regressions

Step 6: Customize Report Format

Supports multiple output formats:

Standard Report (default):

python {baseDir}/scripts/generate_benchmark_report.py --results metrics.csv

Executive Summary (1-page):

python {baseDir}/scripts/generate_benchmark_report.py \
    --results metrics.csv \
    --format executive \
    --output summary.md

GitHub Issue Template:

python {baseDir}/scripts/generate_benchmark_report.py \
    --results metrics.csv \
    --format github \
    --output .github/ISSUE_TEMPLATE/regression.md

Output

The script generates:

Standard Report (report.md):

Executive Summary (1-2 paragraphs)
Model Comparison Table (all metrics)
Statistical Analysis (means, std devs, percentiles)
Winner Declaration with justification
Per-Series Breakdown (optional)
Recommendations for production use
Failure Case Analysis (series with sMAPE > 30%)

Regression Report (if baseline provided):

Regression Summary (models degraded)
Severity Analysis (% change per model)
Affected Series List
GitHub Issue Template

GitHub Issue Template:

---
title: "Performance Regression Detected: {model_name}"
labels: ["regression", "performance"]
assignees: ["team-lead"]
---

## Regression Summary
Model: {model_name}
Metric: sMAPE degraded by {X}%
Baseline: {baseline_value}%
Current: {current_value}%

## Affected Series
- {series_1}: {baseline}% → {current}% ({delta}%)
- {series_2}: {baseline}% → {current}% ({delta}%)
...

## Acceptance Criteria
- [ ] Investigate root cause
- [ ] Restore performance to within 2% of baseline
- [ ] Add regression test to CI/CD

Error Handling

Missing Metrics File:

Error: Benchmark results not found at /path/to/results.csv
Solution: Verify path and ensure CSV file exists

Invalid CSV Structure:

Error: Required columns missing: series_id, model, sMAPE
Solution: Ensure CSV has minimum required columns

Empty Results:

Warning: No metrics found in CSV file
Solution: Verify CSV has data rows (not just headers)

Regression Threshold Exceeded:

🚨 REGRESSION DETECTED: AutoTheta sMAPE degraded by 12.5%
  Baseline: 12.3%
  Current: 13.8%
  Threshold: 5.0%
Solution: Review recent model changes, check data quality

Examples

Example 1: Generate Standard Benchmark Report

python {baseDir}/scripts/generate_benchmark_report.py \
    --results nixtla_baseline_m4/results_M4_Daily_h14.csv \
    --output reports/m4_daily_baseline.md \
    --verbose

Output:

✓ Loaded 150 results (50 series × 3 models)
✓ Calculated summary statistics
✓ Identified winner: AutoTheta (mean sMAPE: 12.3%)
✓ Generated report: reports/m4_daily_baseline.md (1,245 words)

Example 2: Detect Regressions vs. Baseline

python {baseDir}/scripts/generate_benchmark_report.py \
    --results current_run/results.csv \
    --baseline baseline/v1.0_results.csv \
    --output regression_report.md \
    --threshold 3.0

Output:

⚠️  REGRESSION DETECTED in 2/3 models:
  - AutoETS: sMAPE 13.5% → 14.8% (+9.6%)
  - AutoTheta: sMAPE 12.3% → 12.7% (+3.3%)
✓ Generated regression report with GitHub issue template

Example 3: Generate Executive Summary

python {baseDir}/scripts/generate_benchmark_report.py \
    --results quarterly_benchmark.csv \
    --format executive \
    --output Q1_summary.md

Output:

# Q1 2025 Forecast Baseline Report

**Winner**: AutoTheta with 12.3% sMAPE (vs. 13.5% AutoETS, 15.2% Naive)

**Key Findings**:
- AutoTheta won 64% of series (32/50)
- Most consistent performance (std dev 4.2%)
- Recommended for production baseline

**Action Items**:
- Deploy AutoTheta as default model
- Use AutoETS for highly seasonal data (criteria: seasonal_strength > 0.8)
- Investigate 3 failure cases (sMAPE > 30%)

Example 4: Custom Metric Focus

python {baseDir}/scripts/generate_benchmark_report.py \
    --results results.csv \
    --primary-metric MASE \
    --output mase_focused_report.md

Best Practices

Version Control Reports: Commit generated reports to track performance over time
Automate in CI/CD: Generate reports automatically on every benchmark run
Set Regression Thresholds: Use --threshold to catch regressions early (recommend 3-5%)
Include Timestamps: Reports automatically include generation date/time
Document Assumptions: Reports include metadata about benchmark setup
Share with Stakeholders: Markdown reports render nicely on GitHub/GitLab
Archive Baselines: Keep historical baseline CSVs for regression comparison

Resources

Script: {baseDir}/scripts/generate_benchmark_report.py
Template: {baseDir}/assets/templates/report_template.md
Example Report: {baseDir}/references/EXAMPLE_REPORT.md
M4 Benchmark: https://github.com/Mcompetitions/M4-methods
Forecast Metrics: https://otexts.com/fpp3/accuracy.html

nixtla-benchmark-reporter

When & Why to Use This Skill

Use Cases

Nixtla Benchmark Reporter

Purpose

Overview

Prerequisites

Instructions

Step 1: Parse Benchmark Results

Step 2: Calculate Summary Statistics

Step 3: Generate Comparison Table

Step 4: Identify Winner and Recommendations

Step 5: Regression Detection (Optional)

Step 6: Customize Report Format

Output

Error Handling

Examples

Example 1: Generate Standard Benchmark Report

Example 2: Detect Regressions vs. Baseline

Example 3: Generate Executive Summary

Example 4: Custom Metric Focus

Best Practices

Resources