What is empirical-config-builder?

The Empirical Config Builder is a data-driven utility designed to replace arbitrary hardcoded parameters in quantitative systems with dynamic thresholds derived from real-world market data. By utilizing statistical percentiles and historical analysis, it ensures that selection criteria for assets—such as volume, price, and sector distribution—are automatically optimized and validated against current market conditions, eliminating the risks associated with 'magic numbers'.

When should I use empirical-config-builder?

empirical-config-builder is useful in the following scenarios: • Optimizing Trading Universes: Automatically adjust asset selection thresholds (like minimum volume or price) based on current market medians rather than static, outdated values to ensure a consistent flow of candidates. • Refactoring Legacy Codebases: Identify and replace arbitrary 'magic numbers' in configuration files with empirical derivations to improve system robustness, transparency, and maintainability. • Dynamic Portfolio Scaling: Calculate sector-specific filtering percentages to maintain a stable number of target candidates regardless of market expansion or contraction. • Statistical Risk Parameterization: Derive correlation thresholds and outlier filters using P75 or P99 percentiles to ensure risk management parameters are grounded in actual data distributions rather than theory alone.

name	empirical-config-builder
description	"Derive selection thresholds from market data instead of hardcoding. Trigger when: (1) reviewing hardcoded parameters, (2) volume/price thresholds seem arbitrary, (3) selection returns too many/few candidates."
author	Claude Code
date	2026-01-08

Empirical Config Builder - Research Notes

Experiment Overview

Item	Details
Date	2026-01-08
Goal	Replace hardcoded selection thresholds with data-driven values
Environment	Python 3.10+, SymbolDatabase, numpy
Status	Success

Context

Universe selection had many hardcoded "magic numbers":

MIN_VOLUME_USD_EQUITY = 1_000_000 - why $1M?
MIN_PRICE_EQUITY = 5.0 - why $5?
SECTOR_TOP_PCT = 0.30 - why 30%?

These values were originally guessed and never validated against actual market data. The empirical config builder derives these from the SymbolDatabase using percentiles.

Parameters Analysis

Can Be Data-Driven (6 parameters)

Parameter	Derivation Method	Code
min_volume_equity	P50 of daily volume	`volume_pct[50]`
min_price	P5 of equity prices	`price_pct[5]`
max_price	P99 of equity prices	`price_pct[99]`
sector_top_pct	`target_candidates / equities_passing_volume`	Calculated
min_per_sector	`median_sector_size / 10`	Calculated
max_per_sector	`median_sector_size`	Calculated

Should Stay Hardcoded (Theory-Based)

Parameter	Value	Why Fixed
hurst_short_target	(0.30, 0.50)	Literature: H<0.5 = mean-reverting
hurst_long_target	(0.50, 0.70)	Literature: H>0.5 = trending
half_life_target_hours	(4, 24)	Trading frequency constraint
regime_duration_target	(5, 20)	Markov model requirement
scoring weights	Sum to 1.0	Design decision

Verified Workflow

1. Basic Usage (Notebook)

# In training notebook cell-14:
USE_EMPIRICAL_THRESHOLDS = True   # Enable empirical mode
TARGET_CANDIDATES = 1500          # Target candidate count

# Thresholds are automatically derived in cell-16

2. Programmatic Usage

from alpaca_trading.selection import SymbolDatabase
from alpaca_trading.selection.empirical_config import build_config_from_database

db = SymbolDatabase(db_path='data/symbol_database.db')
result = build_config_from_database(
    db=db,
    target_candidates=1500,       # How many candidates you want
    volume_percentile=50,         # P50 = median (top 50% by volume)
    price_percentile_low=5,       # Exclude bottom 5% (penny stocks)
    price_percentile_high=99,     # Exclude top 1% (too expensive)
)

# Use the derived config
config = result.config

# See what was derived
print(result.describe())
# Output:
# ======================================================================
# EMPIRICAL CONFIGURATION (derived from market data)
# ======================================================================
#
# DERIVED THRESHOLDS:
#   min_volume_equity             : $180,432  [P50 of equity volume]
#   min_price                     : 1.25  [P5 of equity price]
#   max_price                     : 892.50  [P99 of equity price]
#   sector_top_pct                : 41.67%  [calculated for 1500 target candidates]
#   min_per_sector                : 45  [median_sector_size / 10]
#   max_per_sector                : 450  [median_sector_size]

3. With Correlation Estimation (Advanced)

from alpaca_trading.selection.empirical_config import build_full_empirical_config

result = build_full_empirical_config(
    db=db,
    data_fetcher=fetcher,         # Required for correlation
    target_candidates=1500,
    estimate_correlations=True,   # Compute actual correlations
)

# max_correlation is now derived from P75 of pairwise correlations
print(f"max_correlation: {result.config.max_correlation:.2f}")

Output Structure

@dataclass
class EmpiricalConfigResult:
    config: SelectionConfig       # Ready-to-use config
    thresholds: Dict[str, Any]    # All derived values
    derivation_method: Dict[str, str]  # How each was derived
    data_summary: Dict[str, Any]  # Market stats used

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Fetching snapshots directly	Redundant API calls when DB exists	Use SymbolDatabase
Fixed percentiles for all	Different markets need different P values	Crypto uses P25 for volume
Using mean instead of median	Outliers skew mean significantly	Always use median (P50)
Deriving Hurst targets	Theory-based, not market-dependent	Keep as hardcoded
Same min_per_sector everywhere	Small sectors need protection	Use median_sector_size / 10

Key Insights

Volume percentile choice matters:
- P25: Very inclusive (~7000 candidates)
- P50: Balanced (~3600 candidates)
- P75: Selective (~1800 candidates)
Price percentiles:
- P5 excludes penny stocks without guessing "$5"
- P99 excludes extremely expensive stocks naturally
Sector filtering auto-calculation:
- sector_top_pct = target_candidates / equities_passing_volume
- Clamped to [0.15, 0.50] to prevent extremes
- min/max per sector derived from actual sector sizes
Correlation threshold:
- P75 of pairwise correlations is a reasonable threshold
- Computing this requires historical data (expensive)
- Optional - default 0.60 is usually fine

Files Modified

File	Changes
`alpaca_trading/selection/empirical_config.py`	Added `build_config_from_database()`, `EmpiricalConfigResult`
`notebooks/training.ipynb`	Added `USE_EMPIRICAL_THRESHOLDS` option
`CLAUDE.md`	Added empirical config documentation

Typical Results

Parameter	Hardcoded	Empirical (P50)
min_volume_equity	$1,000,000	$180,432
min_price	$5.00	$1.25
max_price	$10,000	$892.50
sector_top_pct	30%	42%

Observation: Hardcoded values were MORE restrictive than P50 (median). This explains why selection sometimes returned fewer candidates than expected.

References

Skill: symbol-database-selection - SymbolDatabase infrastructure
Skill: per-sector-candidate-filtering - Sector filtering parameters
Skill: symbol-selection-statistical - Statistical selection theory

empirical-config-builder

When & Why to Use This Skill

Use Cases