---
name: cv-strategy
description: Cross-validation configuration and fold management for this competition
allowed-tools: Read, Grep, Glob
---

# CV Strategy

## Fold Configuration

```python
N_FOLDS = 5
SEED = 42

# Tabular
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# Image with groups
from sklearn.model_selection import StratifiedGroupKFold
sgkf = StratifiedGroupKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
```

## Golden Rules

1. **Same folds for ALL models** - Required for proper stacking
2. **No data leakage** - Target encoding within fold only
3. **Group awareness** - Same source → same fold
4. **Reproducibility** - Always set random_state

## Current Competition

- Competition: [Competition Name]
- Metric: [Evaluation Metric]
- Target: `target` column
- Groups: [Group column if applicable]

## Fold Splits (Saved)

```
models/folds.csv
- fold_0: train=[...], val=[...]
- fold_1: train=[...], val=[...]
...
```

## OOF Predictions

```
models/oof/
├── xgb_v1_oof.npy
├── lgb_v1_oof.npy
├── catboost_v1_oof.npy
└── efficientnet_b3_oof.npy
```

## Best CV Scores

| Model | CV Score | LB Score | Notes |
|-------|----------|----------|-------|
| XGBoost v1 | 0.8523 | 0.8501 | Baseline |
| LightGBM v1 | 0.8545 | 0.8520 | + target encoding |
| Ensemble v1 | 0.8612 | 0.8590 | XGB + LGB + CatBoost |

## Leakage Checklist

- [ ] Target encoding uses train fold only
- [ ] Time-based features respect temporal order
- [ ] Group-based splits for related samples
- [ ] No test data in feature engineering
