ml-fundamentals

pluginagentmarketplace's avatarfrom pluginagentmarketplace

Master machine learning foundations - algorithms, preprocessing, feature engineering, and evaluation

1stars🔀1forks📁View on GitHub🕐Updated Jan 5, 2026

When & Why to Use This Skill

The ML Fundamentals skill is a comprehensive toolkit designed to streamline the end-to-end machine learning workflow. It automates critical stages including data preprocessing, feature engineering, and rigorous model evaluation using industry-standard best practices. By leveraging Scikit-learn pipelines, it ensures reproducible results and helps developers avoid common pitfalls like data leakage, making it an essential asset for building robust predictive models and mastering data science foundations.

Use Cases

  • Automating Data Preprocessing: Efficiently manage missing values and feature scaling to ensure high-quality input for machine learning models.
  • Implementing Feature Engineering: Enhance model performance by creating interaction terms, polynomial features, and discretized bins from raw variables.
  • Standardizing Model Evaluation: Execute robust cross-validation and performance metrics (Precision, Recall, F1) to validate model stability and generalization.
  • Building Reproducible ML Pipelines: Construct end-to-end workflows using Scikit-learn to maintain consistency from development to production and prevent data leakage.
nameml-fundamentals
descriptionMaster machine learning foundations - algorithms, preprocessing, feature engineering, and evaluation
version"1.4.0"
sasmp_version"1.4.0"
bonded_agent01-ml-fundamentals
bond_typePRIMARY_BOND
- nametest_size
typefloat
validation"0.1 <= x <= 0.4"
default0.2
strategyexponential_backoff
max_attempts3
base_delay_ms1000
levelinfo
metrics[execution_time, memory_usage, data_shape]

ML Fundamentals Skill

Master the building blocks of machine learning: from raw data to trained models.

Quick Start

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# 1. Load and split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 2. Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# 3. Train and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Accuracy: {score:.4f}")

Key Topics

1. Data Preprocessing

Step Purpose Implementation
Missing Values Handle NaN/None SimpleImputer(strategy='median')
Scaling Normalize ranges StandardScaler() or MinMaxScaler()
Encoding Convert categories OneHotEncoder() or LabelEncoder()
Outliers Remove extremes IQR method or Z-score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define column types
numeric_features = ['age', 'income', 'score']
categorical_features = ['gender', 'city', 'category']

# Create preprocessor
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), numeric_features),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]), categorical_features)
])

2. Feature Engineering

Technique Use Case Example
Polynomial Non-linear relationships PolynomialFeatures(degree=2)
Binning Discretize continuous KBinsDiscretizer(n_bins=5)
Log Transform Right-skewed data np.log1p(x)
Interaction Feature combinations x1 * x2

3. Model Evaluation

from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

# Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
print(f"CV F1: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

# Detailed report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

4. Cross-Validation Strategies

Strategy When to Use
KFold Standard, balanced data
StratifiedKFold Imbalanced classification
TimeSeriesSplit Temporal data
GroupKFold Grouped samples

Best Practices

DO

  • Split data BEFORE any preprocessing
  • Use pipelines for reproducibility
  • Stratify splits for classification
  • Log all preprocessing parameters
  • Version your feature engineering code

DON'T

  • Don't fit on test data
  • Don't ignore data leakage
  • Don't use accuracy for imbalanced data
  • Don't hard-code parameters

Exercises

Exercise 1: Basic Pipeline

# TODO: Create a pipeline that:
# 1. Imputes missing values
# 2. Scales features
# 3. Trains a logistic regression

Exercise 2: Cross-Validation

# TODO: Implement 5-fold stratified CV
# and report mean and std of F1 score

Unit Test Template

import pytest
import numpy as np
from sklearn.datasets import make_classification

def test_preprocessing_pipeline():
    """Test preprocessing handles missing values."""
    X, y = make_classification(n_samples=100, n_features=10)
    X[0, 0] = np.nan  # Introduce missing value

    pipeline = create_preprocessing_pipeline()
    X_transformed = pipeline.fit_transform(X)

    assert not np.isnan(X_transformed).any()
    assert X_transformed.shape[0] == X.shape[0]

def test_no_data_leakage():
    """Verify preprocessing doesn't leak test data."""
    X_train, X_test = X[:80], X[80:]

    pipeline.fit(X_train)
    X_test_transformed = pipeline.transform(X_test)

    # Check that test transform uses train statistics
    assert pipeline.named_steps['scaler'].mean_ is not None

Troubleshooting

Problem Cause Solution
NaN in prediction Missing imputer Add SimpleImputer to pipeline
Shape mismatch Inconsistent features Use ColumnTransformer
Memory error Too many one-hot features Use max_categories or hashing
Poor CV variance Data leakage Check preprocessing order

Related Resources


Version: 1.4.0 | Status: Production Ready

ml-fundamentals – AI Agent Skills | Claude Skills