cv-strategy
Cross-validation configuration and fold management for this competition
When & Why to Use This Skill
The cv-strategy skill is a specialized framework designed for machine learning practitioners to manage cross-validation configurations and fold consistency. It streamlines the model evaluation process by providing standardized splitting strategies, preventing data leakage, and maintaining a rigorous tracking system for Out-of-Fold (OOF) predictions and leaderboard scores, which is essential for building reliable ensemble models.
Use Cases
- Standardizing fold splits across multiple models (e.g., XGBoost, LightGBM) to ensure valid stacking and ensembling results.
- Implementing StratifiedGroupKFold strategies to handle grouped data and prevent leakage between training and validation sets.
- Monitoring the correlation between local Cross-Validation (CV) scores and Public Leaderboard (LB) scores to identify potential overfitting or validation gaps.
- Automating the 'Leakage Checklist' to ensure feature engineering and target encoding are performed strictly within training folds.
- Organizing and retrieving Out-of-Fold (OOF) predictions for systematic model performance analysis and meta-model training.
| name | cv-strategy |
|---|---|
| description | Cross-validation configuration and fold management for this competition |
| allowed-tools | Read, Grep, Glob |
CV Strategy
Fold Configuration
N_FOLDS = 5
SEED = 42
# Tabular
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
# Image with groups
from sklearn.model_selection import StratifiedGroupKFold
sgkf = StratifiedGroupKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
Golden Rules
- Same folds for ALL models - Required for proper stacking
- No data leakage - Target encoding within fold only
- Group awareness - Same source → same fold
- Reproducibility - Always set random_state
Current Competition
- Competition: [Competition Name]
- Metric: [Evaluation Metric]
- Target:
targetcolumn - Groups: [Group column if applicable]
Fold Splits (Saved)
models/folds.csv
- fold_0: train=[...], val=[...]
- fold_1: train=[...], val=[...]
...
OOF Predictions
models/oof/
├── xgb_v1_oof.npy
├── lgb_v1_oof.npy
├── catboost_v1_oof.npy
└── efficientnet_b3_oof.npy
Best CV Scores
| Model | CV Score | LB Score | Notes |
|---|---|---|---|
| XGBoost v1 | 0.8523 | 0.8501 | Baseline |
| LightGBM v1 | 0.8545 | 0.8520 | + target encoding |
| Ensemble v1 | 0.8612 | 0.8590 | XGB + LGB + CatBoost |
Leakage Checklist
- Target encoding uses train fold only
- Time-based features respect temporal order
- Group-based splits for related samples
- No test data in feature engineering