batch-ingest

houfu's avatarfrom houfu

Process large volumes of bank statements (50+ PDFs) in batches with checkpointing and progress tracking. Orchestrates the standard ingestion skill across multiple batches for resumable processing.

0stars🔀0forks📁View on GitHub🕐Updated Jan 3, 2026

When & Why to Use This Skill

This Claude skill automates the high-volume ingestion of financial documents, specifically bank statements, by processing them in manageable batches. It features robust progress tracking, checkpointing, and resumable workflows to ensure data integrity and efficiency when handling 50+ PDFs, making it ideal for large-scale financial data migration and automated bookkeeping.

Use Cases

  • Bulk Financial Data Entry: Automatically process and categorize hundreds of monthly bank statements into a central database for accounting and tax preparation.
  • Audit and Compliance: Efficiently ingest years of historical financial records with granular progress tracking to ensure 100% data coverage and auditability.
  • Resumable Document Workflows: Manage large-scale PDF processing tasks that can be paused and resumed across different sessions without duplicating work or losing progress.
  • Error-Tolerant Data Extraction: Handle high-volume document ingestion where individual file errors are isolated, allowing the rest of the batch to complete while providing a clear report for manual review.
namebatch-ingest
descriptionProcess large volumes of bank statements (50+ PDFs) in batches with checkpointing and progress tracking. Orchestrates the standard ingestion skill across multiple batches for resumable processing.

Batch Ingestion Skill

Purpose

Process large volumes of bank statements (50+ PDFs) in manageable batches with:

  • Progress tracking via TodoWrite
  • User verification between batches
  • Resume capability using database ingestion_log
  • Error recovery without losing progress

When to Use

  • Processing 50+ PDFs at once
  • Want to review progress between batches
  • Need ability to pause and resume
  • Want checkpointing in case of errors

Prerequisites

  • Database initialized (uv run python scripts/init_db.py)
  • PDFs in staging folder (data/statements/staging/)
  • Dashboard stopped (docker compose down)

Workflow

Phase 1: Discover and Plan

  1. Count PDFs in staging:
ls data/statements/staging/*.pdf | wc -l
  1. Ask user for batch size:
  • Recommend 10-15 PDFs per batch
  • Smaller batches = more checkpoints
  • Larger batches = faster but less granular
  1. Create TodoWrite plan:
total_pdfs = 73  # from ls count
batch_size = 15
num_batches = (total_pdfs + batch_size - 1) // batch_size

todos = [
    {"content": f"Process batch {i+1} of {num_batches} ({batch_size} PDFs)",
     "status": "pending",
     "activeForm": f"Processing batch {i+1}"}
    for i in range(num_batches)
]
# Add final todo
todos.append({
    "content": "Restart dashboard after completion",
    "status": "pending",
    "activeForm": "Restarting dashboard"
})

# Use TodoWrite tool to create the plan

Phase 2: Process Batches

For each batch:

  1. Mark batch as in_progress using TodoWrite

  2. Get next batch of PDFs:

ls data/statements/staging/*.pdf | head -n 15
  1. Run ingestion skill on this batch:
  • Use the standard skills/ingestion/SKILL.md
  • Process all PDFs in the batch
  • The ingestion skill will handle:
    • PDF reading and parsing
    • Database insertion
    • Categorization
    • Archiving
    • Error handling
  1. Mark batch as completed using TodoWrite

  2. Show progress summary:

from src.database.models import Database
from src.config import get_config

config = get_config()
db = Database(config["database_path"])

# Get most recent ingestion log
log = db.get_last_ingestion_log()
if log:
    print(f"✓ Batch completed")
    print(f"  PDFs processed: {log.pdfs_processed}")
    print(f"  Transactions added: {log.transactions_added}")
    print(f"  Status: {log.status}")
  1. Ask user to continue:
Batch X of Y completed. Continue to next batch? (yes/no)

If user says no: Stop and remind them they can resume later.

Phase 3: Resume Capability

To resume an interrupted batch ingestion:

  1. Check remaining PDFs:
ls data/statements/staging/*.pdf | wc -l
  1. Check TodoWrite list to see which batches are pending

  2. Continue from where you left off - just process the remaining batches

Error Handling

If a PDF fails during batch processing:

  • The ingestion skill will log the error
  • Continue with remaining PDFs in batch
  • Summarize failed PDFs at end of batch
  • User can review and retry failed PDFs separately

Progress Tracking

The ingestion skill automatically updates the ingestion_log table:

  • Tracks PDFs processed
  • Tracks transactions added/updated
  • Records errors
  • Stores summary

Query progress:

from src.database.models import Database

db = Database("/path/to/finance.db")

# Get all ingestion runs
conn = db.get_connection()
logs = conn.execute("""
    SELECT started_at, completed_at, status, pdfs_processed, transactions_added
    FROM ingestion_log
    ORDER BY started_at DESC
    LIMIT 10
""").fetchall()

for log in logs:
    print(f"{log[0]}: {log[3]} PDFs, {log[4]} transactions ({log[2]})")

Advantages Over Single Run

  • Checkpoint progress - Resume anytime
  • User control - Review between batches
  • Memory management - Process in chunks
  • Error isolation - One bad PDF doesn't stop everything

Example Session

User: "I have 73 PDFs to process. Run batch ingestion."

Claude: I'll process them in batches of 15 PDFs each (5 batches total).

[Creates TodoWrite plan with 5 batch todos]

Processing batch 1 of 5...
[Runs ingestion skill on first 15 PDFs]
✓ Batch 1 completed: 15 PDFs, 287 transactions

Continue to batch 2?

User: yes

Processing batch 2 of 5...
[Continues...]

Notes

  • Uses the standard skills/ingestion/SKILL.md for actual processing
  • No complex Python scripts required
  • All state tracked in database + TodoWrite
  • Can stop and resume anytime
  • Simpler and more maintainable than old batch system