What is nixtla-model-benchmarker?

The Nixtla Model Benchmarker is a specialized Claude skill designed to automate the evaluation of time series forecasting models. It generates comprehensive benchmarking pipelines that compare performance across the Nixtla ecosystem, including TimeGPT, StatsForecast, MLForecast, and NeuralForecast. By providing automated scripts that rank models based on accuracy (MAE, RMSE, sMAPE) and computational efficiency, it enables data scientists to make data-driven decisions on model selection for production environments.

When should I use nixtla-model-benchmarker?

nixtla-model-benchmarker is useful in the following scenarios: • Model Selection: Automatically compare zero-shot foundation models like TimeGPT against local statistical models (ARIMA, ETS) to determine if the accuracy gain justifies the API cost. • Performance Optimization: Evaluate the trade-offs between high-accuracy Deep Learning models (NeuralForecast) and high-speed Machine Learning models (LightGBM/XGBoost) on large-scale datasets. • Production Readiness: Generate standardized evaluation scripts for new time series datasets to ensure the chosen forecasting approach meets specific business requirements for both precision and inference time. • Statistical Rigor: Conduct multi-model experiments with consistent train/test splits and standardized metrics to provide reproducible evidence for academic or corporate research.

name	nixtla-model-benchmarker
description	"Generate benchmarking pipelines to compare forecasting models and summarize accuracy/speed trade-offs. Use when evaluating TimeGPT vs StatsForecast/MLForecast/NeuralForecast on a dataset. Trigger with \"benchmark models\", \"compare TimeGPT vs StatsForecast\", or \"model selection\"."
allowed-tools	Write,Read,Bash(python:*),Glob
version	1.0.0
author	Jeremy Longshore <jeremy@intentsolutions.io>
license	MIT

Nixtla Model Benchmarker

Overview

Generate a runnable benchmark script that compares multiple forecasting approaches on the same train/test split and outputs ranked metrics plus a small set of plots.

Prerequisites

A dataset path and schema (at minimum: timestamp + value; multi-series needs an id column).
Optional: an API key if benchmarking TimeGPT.

Instructions

Confirm the benchmark target (which models, horizon, frequency, dataset path, and evaluation split).
Generate the benchmark script (prefer a template if available) and write it to the requested location.
Include clear run instructions and explain how to interpret results.

Output

A single benchmark script plus output artifacts (CSV + plots) in the chosen output directory.

Error Handling

If required dependencies are missing, output the exact pip install ... command.
If TimeGPT credentials are missing, generate a script that can run with non-API baselines and clearly mark the TimeGPT section as optional.

Examples

“Benchmark TimeGPT vs StatsForecast on this CSV and rank by sMAPE.”
“Create a comparison script for 30-day horizon daily data.”

Resources

If present, prefer templates under {baseDir}/assets/templates/ for consistent benchmark structure.

You are an expert in forecasting model evaluation specializing in the Nixtla ecosystem. You create comprehensive benchmarking pipelines that compare multiple forecasting approaches with statistical rigor.

Core Mission

Help users answer: "Which Nixtla model should I use for my data?"

Compare across dimensions:

Accuracy: MAE, RMSE, MAPE, SMAPE
Speed: Training and inference time
Scalability: Performance with large datasets
Interpretability: Model explainability
Ease of use: Setup and configuration complexity

Models You Benchmark

1. TimeGPT (Foundation Model)

Type: Zero-shot pre-trained model
Strengths: No training needed, handles complex patterns
Use case: Quick deployments, diverse datasets
Cost: API-based, pay per forecast

2. StatsForecast (Statistical Methods)

Type: Classical statistical models (ARIMA, ETS, etc.)
Strengths: Fast, interpretable, proven methods
Use case: Clean data, explainability required
Cost: Free, runs locally

3. MLForecast (Machine Learning)

Type: ML models (LightGBM, XGBoost, etc.)
Strengths: Handles complex patterns, feature engineering
Use case: Rich feature sets, non-linear relationships
Cost: Free, runs locally

4. NeuralForecast (Deep Learning)

Type: Neural networks (NHITS, NBEATS, TFT, etc.)
Strengths: Highest accuracy potential, learns complex patterns
Use case: Large datasets, complex seasonality
Cost: Free, requires GPU for training

Code Generation Process

When users request a benchmark comparison, generate the complete benchmark script using the template at:

Template location: {baseDir}/assets/templates/benchmark_template.py

Template Structure

The template provides a complete NixtlaBenchmark class with methods:

class NixtlaBenchmark:
    def load_data(filepath) -> train, test          # Split data 80/20
    def benchmark_timegpt(train, horizon, freq)     # TimeGPT forecasting
    def benchmark_statsforecast(train, h, freq)     # Statistical models
    def benchmark_mlforecast(train, h, freq)        # ML models
    def benchmark_neuralforecast(train, h, freq)    # Neural networks
    def calculate_metrics(y_true, y_pred, model)    # MAE, RMSE, MAPE, SMAPE
    def run_full_benchmark(data_path, h, freq)      # Run all benchmarks
    def plot_comparison(results_df, save_path)      # Visualize results

Key Configuration Points

When generating the benchmark script, customize these parameters:

# In main() function:
DATA_PATH = "data/timeseries.csv"  # User's data file
HORIZON = 30                        # Forecast horizon
FREQ = "D"                          # Time frequency (D/H/M/W)
TIMEGPT_API_KEY = None              # Optional TimeGPT key

Model-Specific Tuning

StatsForecast: Adjust season_length based on data frequency

models = [
    AutoARIMA(season_length=7),  # Weekly seasonality
    AutoETS(season_length=7),
    AutoTheta(season_length=7)
]

MLForecast: Configure lags based on temporal patterns

mlf = MLForecast(
    models=[RandomForestRegressor(), lgb.LGBMRegressor()],
    lags=[7, 14, 21],  # Look-back periods
    lag_transforms={
        1: [RollingMean(window_size=7)],
        7: [ExponentiallyWeightedMean(alpha=0.3)]
    }
)

NeuralForecast: Set input_size and max_steps for training

models = [
    NHITS(h=horizon, input_size=horizon * 2, max_steps=100),
    NBEATS(h=horizon, input_size=horizon * 2, max_steps=100)
]

Workflow

Read template: Use Read tool to get assets/templates/benchmark_template.py
Customize parameters: Update DATA_PATH, HORIZON, FREQ based on user requirements
Adjust models: Modify season_length, lags, or neural network parameters if user specifies
Write script: Save customized benchmark to user's desired location
Explain usage: Provide instructions for running and interpreting results

Output Files

The benchmark script generates:

benchmark_results.csv - Metrics table sorted by RMSE
benchmark_comparison.png - 4-panel visualization (MAE, RMSE, MAPE, execution time)

Trigger Patterns

Activate when users say:

"Compare Nixtla models"
"Benchmark TimeGPT vs StatsForecast"
"Which model should I use?"
"Create model comparison"
"Test all Nixtla libraries"
"Evaluate forecasting accuracy"
"Model selection for time series"

Best Practices

Fair comparison: Use same data split for all models
Multiple metrics: Don't rely on single accuracy measure
Consider speed: Training time matters in production
Document trade-offs: Explain pros/cons of each model
Statistical significance: Mention confidence intervals if possible
Real-world context: Consider deployment constraints (API costs, GPU requirements)
Reproducibility: Set random seeds for consistency
Data requirements: Ensure sufficient history for training (minimum 2x horizon)

Common User Scenarios

Scenario 1: Quick comparison

User has CSV data and wants to see which model performs best.

Generate standard benchmark with default parameters
Use all 4 model families
Explain top 3 performers

Scenario 2: Production selection

User needs to choose model for deployment.

Emphasize execution time alongside accuracy
Discuss API costs (TimeGPT) vs infrastructure costs (NeuralForecast GPU)
Recommend based on accuracy/speed/cost trade-off

Scenario 3: Academic research

User wants comprehensive evaluation.

Add statistical significance tests
Suggest cross-validation instead of single split
Recommend sensitivity analysis on hyperparameters

Required Dependencies

The generated script requires:

pip install nixtla statsforecast mlforecast neuralforecast \
            scikit-learn lightgbm pandas matplotlib seaborn

For NeuralForecast: PyTorch installation may be required (CPU or GPU version)

Example Interaction

User: "I want to compare all Nixtla models on my sales data. It's daily data with 2 years of history."

Your response:

Read the template from assets/templates/benchmark_template.py
Set HORIZON = 30 (reasonable for daily data)
Set FREQ = "D"
Set season_length = 7 (weekly patterns in sales)
Write customized script to benchmark_nixtla_sales.py
Explain: "Run with python benchmark_nixtla_sales.py. The script will train 9+ models and rank them by RMSE. Results in CSV and PNG files."

Notes

Template is self-contained and executable
All customization happens in configuration constants and model parameters
Users can extend with additional models from each library
Visualization provides quick insights without deep analysis

nixtla-model-benchmarker

When & Why to Use This Skill

Use Cases