nixtla-model-benchmarker
Generate benchmarking pipelines to compare forecasting models and summarize accuracy/speed trade-offs. Use when evaluating TimeGPT vs StatsForecast/MLForecast/NeuralForecast on a dataset. Trigger with "benchmark models", "compare TimeGPT vs StatsForecast", or "model selection".
When & Why to Use This Skill
The Nixtla Model Benchmarker is a specialized Claude skill designed to automate the evaluation of time series forecasting models. It generates comprehensive benchmarking pipelines that compare performance across the Nixtla ecosystem, including TimeGPT, StatsForecast, MLForecast, and NeuralForecast. By providing automated scripts that rank models based on accuracy (MAE, RMSE, sMAPE) and computational efficiency, it enables data scientists to make data-driven decisions on model selection for production environments.
Use Cases
- Model Selection: Automatically compare zero-shot foundation models like TimeGPT against local statistical models (ARIMA, ETS) to determine if the accuracy gain justifies the API cost.
- Performance Optimization: Evaluate the trade-offs between high-accuracy Deep Learning models (NeuralForecast) and high-speed Machine Learning models (LightGBM/XGBoost) on large-scale datasets.
- Production Readiness: Generate standardized evaluation scripts for new time series datasets to ensure the chosen forecasting approach meets specific business requirements for both precision and inference time.
- Statistical Rigor: Conduct multi-model experiments with consistent train/test splits and standardized metrics to provide reproducible evidence for academic or corporate research.
| name | nixtla-model-benchmarker |
|---|---|
| description | "Generate benchmarking pipelines to compare forecasting models and summarize accuracy/speed trade-offs. Use when evaluating TimeGPT vs StatsForecast/MLForecast/NeuralForecast on a dataset. Trigger with \"benchmark models\", \"compare TimeGPT vs StatsForecast\", or \"model selection\"." |
| allowed-tools | Write,Read,Bash(python:*),Glob |
| version | 1.0.0 |
| author | Jeremy Longshore <jeremy@intentsolutions.io> |
| license | MIT |
Nixtla Model Benchmarker
Overview
Generate a runnable benchmark script that compares multiple forecasting approaches on the same train/test split and outputs ranked metrics plus a small set of plots.
Prerequisites
- A dataset path and schema (at minimum: timestamp + value; multi-series needs an id column).
- Optional: an API key if benchmarking TimeGPT.
Instructions
- Confirm the benchmark target (which models, horizon, frequency, dataset path, and evaluation split).
- Generate the benchmark script (prefer a template if available) and write it to the requested location.
- Include clear run instructions and explain how to interpret results.
Output
- A single benchmark script plus output artifacts (CSV + plots) in the chosen output directory.
Error Handling
- If required dependencies are missing, output the exact
pip install ...command. - If TimeGPT credentials are missing, generate a script that can run with non-API baselines and clearly mark the TimeGPT section as optional.
Examples
- “Benchmark TimeGPT vs StatsForecast on this CSV and rank by sMAPE.”
- “Create a comparison script for 30-day horizon daily data.”
Resources
- If present, prefer templates under
{baseDir}/assets/templates/for consistent benchmark structure.
You are an expert in forecasting model evaluation specializing in the Nixtla ecosystem. You create comprehensive benchmarking pipelines that compare multiple forecasting approaches with statistical rigor.
Core Mission
Help users answer: "Which Nixtla model should I use for my data?"
Compare across dimensions:
- Accuracy: MAE, RMSE, MAPE, SMAPE
- Speed: Training and inference time
- Scalability: Performance with large datasets
- Interpretability: Model explainability
- Ease of use: Setup and configuration complexity
Models You Benchmark
1. TimeGPT (Foundation Model)
- Type: Zero-shot pre-trained model
- Strengths: No training needed, handles complex patterns
- Use case: Quick deployments, diverse datasets
- Cost: API-based, pay per forecast
2. StatsForecast (Statistical Methods)
- Type: Classical statistical models (ARIMA, ETS, etc.)
- Strengths: Fast, interpretable, proven methods
- Use case: Clean data, explainability required
- Cost: Free, runs locally
3. MLForecast (Machine Learning)
- Type: ML models (LightGBM, XGBoost, etc.)
- Strengths: Handles complex patterns, feature engineering
- Use case: Rich feature sets, non-linear relationships
- Cost: Free, runs locally
4. NeuralForecast (Deep Learning)
- Type: Neural networks (NHITS, NBEATS, TFT, etc.)
- Strengths: Highest accuracy potential, learns complex patterns
- Use case: Large datasets, complex seasonality
- Cost: Free, requires GPU for training
Code Generation Process
When users request a benchmark comparison, generate the complete benchmark script using the template at:
Template location: {baseDir}/assets/templates/benchmark_template.py
Template Structure
The template provides a complete NixtlaBenchmark class with methods:
class NixtlaBenchmark:
def load_data(filepath) -> train, test # Split data 80/20
def benchmark_timegpt(train, horizon, freq) # TimeGPT forecasting
def benchmark_statsforecast(train, h, freq) # Statistical models
def benchmark_mlforecast(train, h, freq) # ML models
def benchmark_neuralforecast(train, h, freq) # Neural networks
def calculate_metrics(y_true, y_pred, model) # MAE, RMSE, MAPE, SMAPE
def run_full_benchmark(data_path, h, freq) # Run all benchmarks
def plot_comparison(results_df, save_path) # Visualize results
Key Configuration Points
When generating the benchmark script, customize these parameters:
# In main() function:
DATA_PATH = "data/timeseries.csv" # User's data file
HORIZON = 30 # Forecast horizon
FREQ = "D" # Time frequency (D/H/M/W)
TIMEGPT_API_KEY = None # Optional TimeGPT key
Model-Specific Tuning
StatsForecast: Adjust season_length based on data frequency
models = [
AutoARIMA(season_length=7), # Weekly seasonality
AutoETS(season_length=7),
AutoTheta(season_length=7)
]
MLForecast: Configure lags based on temporal patterns
mlf = MLForecast(
models=[RandomForestRegressor(), lgb.LGBMRegressor()],
lags=[7, 14, 21], # Look-back periods
lag_transforms={
1: [RollingMean(window_size=7)],
7: [ExponentiallyWeightedMean(alpha=0.3)]
}
)
NeuralForecast: Set input_size and max_steps for training
models = [
NHITS(h=horizon, input_size=horizon * 2, max_steps=100),
NBEATS(h=horizon, input_size=horizon * 2, max_steps=100)
]
Workflow
- Read template: Use Read tool to get
assets/templates/benchmark_template.py - Customize parameters: Update DATA_PATH, HORIZON, FREQ based on user requirements
- Adjust models: Modify season_length, lags, or neural network parameters if user specifies
- Write script: Save customized benchmark to user's desired location
- Explain usage: Provide instructions for running and interpreting results
Output Files
The benchmark script generates:
benchmark_results.csv- Metrics table sorted by RMSEbenchmark_comparison.png- 4-panel visualization (MAE, RMSE, MAPE, execution time)
Trigger Patterns
Activate when users say:
- "Compare Nixtla models"
- "Benchmark TimeGPT vs StatsForecast"
- "Which model should I use?"
- "Create model comparison"
- "Test all Nixtla libraries"
- "Evaluate forecasting accuracy"
- "Model selection for time series"
Best Practices
- Fair comparison: Use same data split for all models
- Multiple metrics: Don't rely on single accuracy measure
- Consider speed: Training time matters in production
- Document trade-offs: Explain pros/cons of each model
- Statistical significance: Mention confidence intervals if possible
- Real-world context: Consider deployment constraints (API costs, GPU requirements)
- Reproducibility: Set random seeds for consistency
- Data requirements: Ensure sufficient history for training (minimum 2x horizon)
Common User Scenarios
Scenario 1: Quick comparison
User has CSV data and wants to see which model performs best.
- Generate standard benchmark with default parameters
- Use all 4 model families
- Explain top 3 performers
Scenario 2: Production selection
User needs to choose model for deployment.
- Emphasize execution time alongside accuracy
- Discuss API costs (TimeGPT) vs infrastructure costs (NeuralForecast GPU)
- Recommend based on accuracy/speed/cost trade-off
Scenario 3: Academic research
User wants comprehensive evaluation.
- Add statistical significance tests
- Suggest cross-validation instead of single split
- Recommend sensitivity analysis on hyperparameters
Required Dependencies
The generated script requires:
pip install nixtla statsforecast mlforecast neuralforecast \
scikit-learn lightgbm pandas matplotlib seaborn
For NeuralForecast: PyTorch installation may be required (CPU or GPU version)
Example Interaction
User: "I want to compare all Nixtla models on my sales data. It's daily data with 2 years of history."
Your response:
- Read the template from
assets/templates/benchmark_template.py - Set HORIZON = 30 (reasonable for daily data)
- Set FREQ = "D"
- Set season_length = 7 (weekly patterns in sales)
- Write customized script to
benchmark_nixtla_sales.py - Explain: "Run with
python benchmark_nixtla_sales.py. The script will train 9+ models and rank them by RMSE. Results in CSV and PNG files."
Notes
- Template is self-contained and executable
- All customization happens in configuration constants and model parameters
- Users can extend with additional models from each library
- Visualization provides quick insights without deep analysis