receipt-parser-engineer
Create, fix, and refine deterministic receipt/invoice parsers in curlys-books, including vendor detection, OCR text extraction routing (pdfplumber for text PDFs → AWS Textract fallback; images → Textract), golden fixture creation, and updates to the vendor dispatcher/registry. Use when adding a new vendor parser, debugging mis-detections, improving totals/tax/date/line extraction, or deciding when to rely on Claude Vision fallback for vendors without a tested parser.
When & Why to Use This Skill
The Receipt Parser Engineer is a specialized Claude skill designed to automate the development and maintenance of high-precision receipt and invoice parsers. It optimizes document workflows by intelligently routing files between direct PDF text extraction and AWS Textract OCR. By focusing on deterministic parsing logic and rigorous golden fixture testing, this skill ensures reliable extraction of financial metadata like tax, totals, and line items, providing a robust alternative to non-deterministic AI extraction for high-volume financial processing.
Use Cases
- Developing new vendor-specific parsers: Automatically generate deterministic parsing logic for high-volume vendors to ensure 100% consistent data extraction.
- Automated OCR Routing: Intelligently decide whether to use cost-effective PDF text extraction or advanced AWS Textract OCR based on file metadata and quality.
- Regression Testing & Quality Assurance: Create and maintain 'golden fixtures' (ground-truth datasets) to ensure parser updates don't break existing extraction logic.
- Financial Data Normalization: Converting messy OCR output from various receipt formats into structured, standardized JSON formats for accounting software integration.
- Hybrid AI Extraction: Implementing a fallback mechanism that uses Claude Vision for complex or unknown receipt layouts while maintaining deterministic speed for known formats.
| name | receipt-parser-engineer |
|---|---|
| description | Create, fix, and refine deterministic receipt/invoice parsers in curlys-books, including vendor detection, OCR text extraction routing (pdfplumber for text PDFs → AWS Textract fallback; images → Textract), golden fixture creation, and updates to the vendor dispatcher/registry. Use when adding a new vendor parser, debugging mis-detections, improving totals/tax/date/line extraction, or deciding when to rely on Claude Vision fallback for vendors without a tested parser. |
Receipt Parser Engineer
Overview
Build and maintain vendor parsers that turn extracted receipt text into ReceiptNormalized with high accuracy, backed by golden fixtures. Keep parsing deterministic where possible and treat Claude Vision as the safety net for unknown vendors.
Workflow Decision Tree
Start from the file type
- PDF: use embedded text when possible; if the PDF has little/no embedded text, OCR it.
- Image: OCR it.
- Email HTML/text: parse directly (no OCR).
Decide whether to write/extend a deterministic parser
- Write/refine a deterministic parser when the vendor is high-volume, has structured line items, or needs reliable tax/subtotal/total.
- Prefer Claude Vision fallback when the vendor is low-volume, highly variable, or you lack enough samples to stabilize patterns.
Choose the OCR/text-extraction path
- Rule: pdfplumber is for text-based PDFs; AWS Textract is for images and anything pdfplumber can’t extract meaningfully from a PDF.
Core Invariants
- Do not introduce any local OCR-binary wrapper or dependency; use Textract for OCR.
- Keep vendor parsers deterministic: parse from
ocr_text(andpdf_pathonly when table extraction is required). - Every parser change ships with a golden fixture test that reproduces the bug and prevents regressions.
- Avoid “magic balancing” lines: prefer
validation_warningsfor missing/faded items rather than inventing data.
Workflow: Add a New Vendor Parser (Deterministic)
Collect samples
- Target 3–10 real receipts/invoices with known-good totals.
- Prefer multiple layouts (thermal vs letter, refunds vs purchases, discounts/deposits).
Generate OCR text for fixtures
- Use the OCR factory (pdfplumber → Textract fallback) from the worker container:
docker compose exec worker python scripts/test_vendor_parsers.py /path/to/receipt.pdf
- Save to
tests/fixtures/golden_receipts/<vendor>/<name>_ocr.txt.
- Use the OCR factory (pdfplumber → Textract fallback) from the worker container:
Create expected outputs
- Copy a known-good parse (or fill by hand) into:
tests/fixtures/golden_receipts/<vendor>/<name>_expected.json
- Keep expected JSON minimal: only assert the fields you truly want stable.
- Copy a known-good parse (or fill by hand) into:
Implement the parser
- Add
packages/invoice_parsers/vendors/<vendor>_parser.py:detect_format(ocr_text) -> boolshould be strict enough to avoid false positives.parse(ocr_text, entity, pdf_path=...) -> ReceiptNormalizedshould be resilient to OCR noise.
- If table extraction is required, accept
pdf_pathand usepdfplumberinside the parser.
- Add
Register the parser
- Update
packages/invoice_parsers/vendor_dispatcher.py:- import your parser
- add it to the parser list (before
GenericParser) - add the vendor key to
GOLDEN_VENDORSwhen it has golden coverage
- Update
Add golden tests
- Update
tests/unit/test_invoice_parsers.pywith a new@pytest.mark.goldenclass for the vendor. - Assert at minimum:
vendor_guess, totals, purchase date, invoice number (if applicable), and line count.
- Update
Run tests
make test-golden- If you touch shared parsing behavior, run
make test-unittoo.
Workflow: Fix/Refine an Existing Parser
- Add a failing fixture first (new
<name>_ocr.txt+<name>_expected.json). - Make the smallest parser change that fixes the issue.
- Re-run
make test-goldenuntil green. - Add/adjust
validation_warningswhen the receipt can’t be made internally consistent.
Debugging Playbook
One-shot OCR + parse (local workbench):
python3 skills/receipt-parser-engineer/scripts/parser_workbench.py --file /path/to/receipt.pdf --entity corppython3 skills/receipt-parser-engineer/scripts/parser_workbench.py --ocr-text tests/fixtures/golden_receipts/<vendor>/<name>_ocr.txt --entity corp
OCR routing:
packages/parsers/ocr/factory.pyTextract provider:
packages/parsers/ocr/provider_textract.pyPDF embedded text extraction:
packages/parsers/ocr/pdf_text_extractor.pyDispatcher + golden vendor list:
packages/invoice_parsers/vendor_dispatcher.pyWorker parse routing (incl. Claude Vision heuristic):
services/worker/tasks/pipeline/parse.pyGolden fixtures:
tests/fixtures/golden_receipts/
References (load as needed)
- Agent prompt (Claude/Codex):
references/agent-prompt.md - Repo/pipeline map:
references/pipeline-map.md