What is persistent-sector-caching?

This Claude skill implements a robust persistent caching layer for yfinance sector data using SQLite to overcome aggressive Yahoo Finance rate limits and 401 'Invalid Crumb' errors. It optimizes data acquisition for large-scale financial datasets by tracking update timestamps, marking failed fetches to prevent infinite retries, and utilizing conflict-aware SQL logic to preserve existing data integrity.

When should I use persistent-sector-caching?

persistent-sector-caching is useful in the following scenarios: • Scenario 1: Large-scale financial data scraping where fetching data for 10,000+ symbols triggers IP-based rate limiting and 401 Unauthorized errors. • Scenario 2: Building local stock market databases that require stable company metadata (sector/industry) without the latency of repeated API calls. • Scenario 3: Optimizing data engineering pipelines to reduce processing time from hours to minutes by implementing incremental updates instead of full refreshes. • Scenario 4: Debugging and fixing database synchronization issues where 'INSERT OR REPLACE' statements inadvertently wipe out existing cached columns.

name	persistent-sector-caching
description	"Cache yfinance sector data permanently to avoid Yahoo Finance rate limits. Trigger when: (1) 401 errors during sector fetching, (2) sector updates take too long, (3) need selective sector refresh."
author	Claude Code
date	2026-01-09

Persistent Sector Caching - Research Notes

Experiment Overview

Item	Details
Date	2026-01-09
Goal	Prevent yfinance rate limit errors (401) during sector updates
Environment	Python 3.10-3.13, yfinance, SQLite
Status	Success

Context

The training notebook was failing during STAGE 1b (sector updates) because Yahoo Finance aggressively rate-limits API requests. With ~11,000 equity symbols needing sector data, the system would hit 401 "Invalid Crumb" errors after fetching ~1,400-1,800 symbols regardless of parallelism or delays.

Symptoms:

HTTP Error 401: {"finance":{"error":{"code":"Unauthorized","description":"Invalid Crumb"}}}
Errors flood after ~12-15% of symbols processed
Reducing workers from 8 to 3 didn't help
Adding delays between chunks didn't help
User had to abort and re-run, losing progress

Root Cause:

Yahoo Finance rate-limits by IP, not by request count
Multiple workers make rate limiting WORSE (more requests per second)
Sector data was being re-fetched for ALL symbols every run
"other" sector symbols were being retried unnecessarily

The Solution: Persistent Caching

Since sector data rarely changes (companies don't switch sectors often), cache it permanently:

New column: sector_last_updated tracks when each symbol's sector was fetched
Default behavior: Only fetch sectors for symbols where sector_last_updated IS NULL
Failed fetches: Mark as fetched so we don't retry them
Force refresh: Optional parameter for manual full refresh

Database Migration (v3)

-- Add column for tracking when sector was fetched
ALTER TABLE symbols ADD COLUMN sector_last_updated TEXT;

-- Mark existing non-'other' sectors as already fetched (preserve cache)
UPDATE symbols
SET sector_last_updated = datetime('now')
WHERE sector IS NOT NULL AND sector != 'other';

Modified Query Logic

def _get_symbols_missing_sector(self, force_refresh=False):
    if force_refresh:
        # Force refresh: get all symbols with 'other' or NULL sector
        query = """
            SELECT symbol FROM symbols
            WHERE asset_type = ? AND (sector IS NULL OR sector = 'other')
            AND NOT (symbol LIKE '%.%')
        """
    else:
        # Default: only symbols NEVER fetched (persistent cache)
        query = """
            SELECT symbol FROM symbols
            WHERE asset_type = ? AND sector_last_updated IS NULL
            AND NOT (symbol LIKE '%.%')
        """

Update Logic

# Update sector AND mark as fetched
cursor.execute(
    'UPDATE symbols SET sector = ?, sector_last_updated = ? WHERE symbol = ?',
    (sector, now, symbol)
)

# Mark failed fetches as fetched too (won't retry)
for symbol in failed_symbols:
    cursor.execute(
        'UPDATE symbols SET sector_last_updated = ? WHERE symbol = ? AND sector_last_updated IS NULL',
        (now, symbol)
    )

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
8 parallel workers	401 errors after ~1400 symbols	More workers = faster rate limit hit
Reduce to 3 workers	401 errors after ~1800 symbols	Workers don't help, Yahoo limits by IP
100-symbol chunks + 5s delays	Still hit 401 after 15-20%	Delays don't reset Yahoo's rate limit window
Longer delays (10-15s)	Too slow (~4 hours for 11k symbols)	Not practical
Re-fetch 'other' sectors	Same symbols fail repeatedly	Some symbols just don't have sector data
Retry on 401 with backoff	Still fails, wastes time	401 means "blocked", not "retry"

Bug Fix: INSERT OR REPLACE Wipes Sector Cache (2026-01-11)

Problem: After implementing persistent caching, sector data was STILL being re-fetched every run.

Root Cause: update_equities() and update_crypto() used INSERT OR REPLACE which deletes the entire row before inserting. This wiped out sector and sector_last_updated columns because they weren't included in the INSERT statement.

Fix: Changed to INSERT ... ON CONFLICT(symbol) DO UPDATE SET ... which only updates specified columns, preserving sector data.

# BEFORE (broken): Wipes sector columns
cursor.execute('''
    INSERT OR REPLACE INTO symbols (symbol, asset_type, price, ...)
    VALUES (?, ?, ?, ...)
''', ...)

# AFTER (fixed): Preserves sector columns
cursor.execute('''
    INSERT INTO symbols (symbol, asset_type, price, ...)
    VALUES (?, ?, ?, ...)
    ON CONFLICT(symbol) DO UPDATE SET
        asset_type = excluded.asset_type,
        price = excluded.price,
        ...  -- sector and sector_last_updated NOT listed, so preserved
''', ...)

Lesson Learned: SQLite's INSERT OR REPLACE is actually DELETE + INSERT. Use INSERT ON CONFLICT DO UPDATE to preserve columns not in the statement.

Key Insights

Yahoo rate limits by IP - No amount of delays or reduced parallelism helps once blocked
Sector data is stable - Companies rarely change sectors, cache is safe
'other' is a valid result - Some symbols legitimately have no sector (ETFs, SPACs, etc.)
Mark failures as fetched - Prevents retrying symbols that will never work
First run is expensive - Accept that initial population takes 30-60+ min
Subsequent runs are fast - Only ~100-500 new symbols per day

Final Parameters

# Database version
DB_VERSION = 3  # Added sector_last_updated

# Workers (doesn't matter much now, few symbols to fetch)
SECTOR_MAX_WORKERS = 3

# Max symbols per session (prevents timeout)
SECTOR_MAX_SYMBOLS = 3000

Timing Expectations

Scenario	Symbols	Time
First run (empty cache)	~11,000	30-60+ min
Typical run (cached)	~100-500	1-5 min
After migration	0	instant
Force refresh (3k limit)	3,000	10-20 min

API Usage

from alpaca_trading.selection import SymbolDatabase

db = SymbolDatabase(db_path='data/symbol_database.db')

# Default: only fetch new symbols (persistent cache)
db.update_sectors(verbose=True)

# Check what would be fetched
stats = db.get_sector_cache_stats()
print(f"Never fetched: {stats['never_fetched']}")
print(f"Cached with sector: {stats['with_sector']}")
print(f"Cached as 'other': {stats['cached_other']}")

# Force refresh all 'other' sectors (use sparingly)
db.update_sectors(force_refresh=True, max_symbols=3000)

Files Modified

File	Changes
`alpaca_trading/selection/symbol_database.py`	Added `sector_last_updated` column, modified `_get_symbols_missing_sector()`, `update_sectors()`, added `get_sector_cache_stats()`
`CLAUDE.md`	Added "Persistent Sector Caching" section, updated Common Issues table

References

Skill: dotted-symbol-exclusion - Related exclusion of special symbols
Skill: selection-data-caching - Earlier caching attempt (superseded)
Skill: persistent-cache-gap-filling - Similar pattern for OHLCV data
yfinance rate limiting: https://github.com/ranaroussi/yfinance/issues

persistent-sector-caching

When & Why to Use This Skill

Use Cases