feature-stores
Master feature stores - Feast, data validation, versioning, online/offline serving
When & Why to Use This Skill
Master the implementation and management of production-grade feature stores for machine learning systems. This skill provides comprehensive expertise in using Feast for feature orchestration, ensuring data integrity with Great Expectations, and managing dataset lineage with DVC. It bridges the gap between data engineering and model deployment by optimizing both high-throughput offline training and low-latency online inference.
Use Cases
- Case 1: Building a centralized feature registry to enable cross-team feature sharing and discovery, reducing redundant engineering efforts in large organizations.
- Case 2: Implementing real-time feature serving for low-latency applications like fraud detection or recommendation engines using Redis-backed online stores.
- Case 3: Establishing automated data validation pipelines to detect and prevent training-serving skew and data drift before they impact model performance.
- Case 4: Managing complex ML experiment reproducibility by versioning large-scale datasets and feature sets using DVC and Git-based workflows.
| name | feature-stores |
|---|---|
| version | "2.0.0" |
| sasmp_version | "1.3.0" |
| description | Master feature stores - Feast, data validation, versioning, online/offline serving |
| bonded_agent | 03-data-pipelines |
| bond_type | PRIMARY_BOND |
| category | data_engineering |
| difficulty | intermediate_to_advanced |
| estimated_hours | 35 |
Feature Stores Skill
Learn: Build production feature stores for ML systems.
Skill Overview
| Attribute | Value |
|---|---|
| Bonded Agent | 03-data-pipelines |
| Difficulty | Intermediate to Advanced |
| Duration | 35 hours |
| Prerequisites | mlops-basics |
Learning Objectives
- Understand feature store architecture
- Implement features with Feast
- Validate data quality with Great Expectations
- Serve features online and offline
- Version datasets with DVC
Topics Covered
Module 1: Feature Store Architecture (8 hours)
Components:
┌─────────────────────────────────────────────────────────────┐
│ FEATURE STORE ARCHITECTURE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Offline │ │ Feature │ │ Online │ │
│ │ Store │───▶│ Registry │◀───│ Store │ │
│ │ (Parquet) │ │ (Metadata) │ │ (Redis) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ [Training] [Discovery] [Inference] │
│ │
└─────────────────────────────────────────────────────────────┘
Exercises:
- Design feature store for e-commerce use case
- Compare Feast vs Tecton vs Hopsworks
Module 2: Feast Implementation (12 hours)
Feature Definition Example:
from feast import Entity, Feature, FeatureView, FileSource
from feast.types import Float32, Int64
from datetime import timedelta
# Entity definition
customer = Entity(
name="customer_id",
value_type=ValueType.INT64,
description="Customer identifier"
)
# Feature view
customer_features = FeatureView(
name="customer_features",
entities=["customer_id"],
ttl=timedelta(days=7),
schema=[
Feature(name="total_purchases", dtype=Float32),
Feature(name="avg_order_value", dtype=Float32),
Feature(name="days_since_last_order", dtype=Int64),
],
online=True,
source=customer_stats_source
)
Exercises:
- Set up Feast repository locally
- Create entity and feature views
- Materialize features to online store
- Retrieve features for training and inference
Module 3: Data Validation (8 hours)
Great Expectations Setup:
import great_expectations as gx
# Create validation suite
suite = context.add_expectation_suite("ml_data_validation")
# Add expectations
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(
column="target",
mostly=0.99
)
)
suite.add_expectation(
gx.expectations.ExpectColumnMeanToBeBetween(
column="feature_a",
min_value=0.0,
max_value=100.0
)
)
Module 4: Data Versioning (7 hours)
DVC Workflow:
# Initialize DVC
dvc init
# Add data to tracking
dvc add data/training_data.parquet
# Push to remote storage
dvc push
# Checkout specific version
git checkout v1.0.0
dvc checkout
Code Templates
Template: Feature Engineering Pipeline
# templates/feature_pipeline.py
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class FeaturePipeline(BaseEstimator, TransformerMixin):
"""Production feature engineering pipeline."""
def __init__(self, config: dict):
self.config = config
self.feature_names = []
def fit(self, X: pd.DataFrame, y=None):
"""Learn feature statistics."""
self.means = X.select_dtypes(include=['number']).mean()
self.stds = X.select_dtypes(include=['number']).std()
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Apply feature transformations."""
X = X.copy()
# Numerical normalization
for col in X.select_dtypes(include=['number']).columns:
X[f"{col}_normalized"] = (X[col] - self.means[col]) / self.stds[col]
# Temporal features
for col in self.config.get("datetime_columns", []):
X[f"{col}_hour"] = pd.to_datetime(X[col]).dt.hour
X[f"{col}_dow"] = pd.to_datetime(X[col]).dt.dayofweek
return X
Troubleshooting Guide
| Issue | Cause | Solution |
|---|---|---|
| Slow feature serving | Online store bottleneck | Scale Redis, add caching |
| Training-serving skew | Different transformations | Use unified feature pipeline |
| Stale features | Materialization lag | Increase refresh frequency |
Resources
- Feast Documentation
- Great Expectations Docs
- DVC Documentation
- [See: training-pipelines] - Use features in training
Version History
| Version | Date | Changes |
|---|---|---|
| 2.0.0 | 2024-12 | Production-grade with Feast examples |
| 1.0.0 | 2024-11 | Initial release |