cv-strategy

NaoyaTakashima's avatarfrom NaoyaTakashima

Cross-validation configuration and fold management for this competition

0stars🔀0forks📁View on GitHub🕐Updated Jan 10, 2026

When & Why to Use This Skill

The cv-strategy skill is a specialized framework designed for machine learning practitioners to manage cross-validation configurations and fold consistency. It streamlines the model evaluation process by providing standardized splitting strategies, preventing data leakage, and maintaining a rigorous tracking system for Out-of-Fold (OOF) predictions and leaderboard scores, which is essential for building reliable ensemble models.

Use Cases

  • Standardizing fold splits across multiple models (e.g., XGBoost, LightGBM) to ensure valid stacking and ensembling results.
  • Implementing StratifiedGroupKFold strategies to handle grouped data and prevent leakage between training and validation sets.
  • Monitoring the correlation between local Cross-Validation (CV) scores and Public Leaderboard (LB) scores to identify potential overfitting or validation gaps.
  • Automating the 'Leakage Checklist' to ensure feature engineering and target encoding are performed strictly within training folds.
  • Organizing and retrieving Out-of-Fold (OOF) predictions for systematic model performance analysis and meta-model training.
namecv-strategy
descriptionCross-validation configuration and fold management for this competition
allowed-toolsRead, Grep, Glob

CV Strategy

Fold Configuration

N_FOLDS = 5
SEED = 42

# Tabular
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Image with groups
from sklearn.model_selection import StratifiedGroupKFold
sgkf = StratifiedGroupKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

Golden Rules

  1. Same folds for ALL models - Required for proper stacking
  2. No data leakage - Target encoding within fold only
  3. Group awareness - Same source → same fold
  4. Reproducibility - Always set random_state

Current Competition

  • Competition: [Competition Name]
  • Metric: [Evaluation Metric]
  • Target: target column
  • Groups: [Group column if applicable]

Fold Splits (Saved)

models/folds.csv
- fold_0: train=[...], val=[...]
- fold_1: train=[...], val=[...]
...

OOF Predictions

models/oof/
├── xgb_v1_oof.npy
├── lgb_v1_oof.npy
├── catboost_v1_oof.npy
└── efficientnet_b3_oof.npy

Best CV Scores

Model CV Score LB Score Notes
XGBoost v1 0.8523 0.8501 Baseline
LightGBM v1 0.8545 0.8520 + target encoding
Ensemble v1 0.8612 0.8590 XGB + LGB + CatBoost

Leakage Checklist

  • Target encoding uses train fold only
  • Time-based features respect temporal order
  • Group-based splits for related samples
  • No test data in feature engineering
cv-strategy – AI Agent Skills | Claude Skills