receipt-parser-engineer

ThomasMcCrossin's avatarfrom ThomasMcCrossin

Create, fix, and refine deterministic receipt/invoice parsers in curlys-books, including vendor detection, OCR text extraction routing (pdfplumber for text PDFs → AWS Textract fallback; images → Textract), golden fixture creation, and updates to the vendor dispatcher/registry. Use when adding a new vendor parser, debugging mis-detections, improving totals/tax/date/line extraction, or deciding when to rely on Claude Vision fallback for vendors without a tested parser.

0stars🔀0forks📁View on GitHub🕐Updated Jan 11, 2026

When & Why to Use This Skill

The Receipt Parser Engineer is a specialized Claude skill designed to automate the development and maintenance of high-precision receipt and invoice parsers. It optimizes document workflows by intelligently routing files between direct PDF text extraction and AWS Textract OCR. By focusing on deterministic parsing logic and rigorous golden fixture testing, this skill ensures reliable extraction of financial metadata like tax, totals, and line items, providing a robust alternative to non-deterministic AI extraction for high-volume financial processing.

Use Cases

  • Developing new vendor-specific parsers: Automatically generate deterministic parsing logic for high-volume vendors to ensure 100% consistent data extraction.
  • Automated OCR Routing: Intelligently decide whether to use cost-effective PDF text extraction or advanced AWS Textract OCR based on file metadata and quality.
  • Regression Testing & Quality Assurance: Create and maintain 'golden fixtures' (ground-truth datasets) to ensure parser updates don't break existing extraction logic.
  • Financial Data Normalization: Converting messy OCR output from various receipt formats into structured, standardized JSON formats for accounting software integration.
  • Hybrid AI Extraction: Implementing a fallback mechanism that uses Claude Vision for complex or unknown receipt layouts while maintaining deterministic speed for known formats.
namereceipt-parser-engineer
descriptionCreate, fix, and refine deterministic receipt/invoice parsers in curlys-books, including vendor detection, OCR text extraction routing (pdfplumber for text PDFs → AWS Textract fallback; images → Textract), golden fixture creation, and updates to the vendor dispatcher/registry. Use when adding a new vendor parser, debugging mis-detections, improving totals/tax/date/line extraction, or deciding when to rely on Claude Vision fallback for vendors without a tested parser.

Receipt Parser Engineer

Overview

Build and maintain vendor parsers that turn extracted receipt text into ReceiptNormalized with high accuracy, backed by golden fixtures. Keep parsing deterministic where possible and treat Claude Vision as the safety net for unknown vendors.

Workflow Decision Tree

  1. Start from the file type

    • PDF: use embedded text when possible; if the PDF has little/no embedded text, OCR it.
    • Image: OCR it.
    • Email HTML/text: parse directly (no OCR).
  2. Decide whether to write/extend a deterministic parser

    • Write/refine a deterministic parser when the vendor is high-volume, has structured line items, or needs reliable tax/subtotal/total.
    • Prefer Claude Vision fallback when the vendor is low-volume, highly variable, or you lack enough samples to stabilize patterns.
  3. Choose the OCR/text-extraction path

    • Rule: pdfplumber is for text-based PDFs; AWS Textract is for images and anything pdfplumber can’t extract meaningfully from a PDF.

Core Invariants

  • Do not introduce any local OCR-binary wrapper or dependency; use Textract for OCR.
  • Keep vendor parsers deterministic: parse from ocr_text (and pdf_path only when table extraction is required).
  • Every parser change ships with a golden fixture test that reproduces the bug and prevents regressions.
  • Avoid “magic balancing” lines: prefer validation_warnings for missing/faded items rather than inventing data.

Workflow: Add a New Vendor Parser (Deterministic)

  1. Collect samples

    • Target 3–10 real receipts/invoices with known-good totals.
    • Prefer multiple layouts (thermal vs letter, refunds vs purchases, discounts/deposits).
  2. Generate OCR text for fixtures

    • Use the OCR factory (pdfplumber → Textract fallback) from the worker container:
      • docker compose exec worker python scripts/test_vendor_parsers.py /path/to/receipt.pdf
    • Save to tests/fixtures/golden_receipts/<vendor>/<name>_ocr.txt.
  3. Create expected outputs

    • Copy a known-good parse (or fill by hand) into:
      • tests/fixtures/golden_receipts/<vendor>/<name>_expected.json
    • Keep expected JSON minimal: only assert the fields you truly want stable.
  4. Implement the parser

    • Add packages/invoice_parsers/vendors/<vendor>_parser.py:
      • detect_format(ocr_text) -> bool should be strict enough to avoid false positives.
      • parse(ocr_text, entity, pdf_path=...) -> ReceiptNormalized should be resilient to OCR noise.
    • If table extraction is required, accept pdf_path and use pdfplumber inside the parser.
  5. Register the parser

    • Update packages/invoice_parsers/vendor_dispatcher.py:
      • import your parser
      • add it to the parser list (before GenericParser)
      • add the vendor key to GOLDEN_VENDORS when it has golden coverage
  6. Add golden tests

    • Update tests/unit/test_invoice_parsers.py with a new @pytest.mark.golden class for the vendor.
    • Assert at minimum: vendor_guess, totals, purchase date, invoice number (if applicable), and line count.
  7. Run tests

    • make test-golden
    • If you touch shared parsing behavior, run make test-unit too.

Workflow: Fix/Refine an Existing Parser

  1. Add a failing fixture first (new <name>_ocr.txt + <name>_expected.json).
  2. Make the smallest parser change that fixes the issue.
  3. Re-run make test-golden until green.
  4. Add/adjust validation_warnings when the receipt can’t be made internally consistent.

Debugging Playbook

  • One-shot OCR + parse (local workbench):

    • python3 skills/receipt-parser-engineer/scripts/parser_workbench.py --file /path/to/receipt.pdf --entity corp
    • python3 skills/receipt-parser-engineer/scripts/parser_workbench.py --ocr-text tests/fixtures/golden_receipts/<vendor>/<name>_ocr.txt --entity corp
  • OCR routing: packages/parsers/ocr/factory.py

  • Textract provider: packages/parsers/ocr/provider_textract.py

  • PDF embedded text extraction: packages/parsers/ocr/pdf_text_extractor.py

  • Dispatcher + golden vendor list: packages/invoice_parsers/vendor_dispatcher.py

  • Worker parse routing (incl. Claude Vision heuristic): services/worker/tasks/pipeline/parse.py

  • Golden fixtures: tests/fixtures/golden_receipts/

References (load as needed)

  • Agent prompt (Claude/Codex): references/agent-prompt.md
  • Repo/pipeline map: references/pipeline-map.md
receipt-parser-engineer – AI Agent Skills | Claude Skills