# Curlys Books parser pipeline map (code-first)

## OCR / text extraction

Primary entry point:
- `packages/parsers/ocr/factory.py::extract_text_from_receipt(file_path)`

Routing rules:
- **Email artifacts** (`.html`, `.htm`, `.txt`): read file contents directly (`method=email_html|email_text`).
- **Images** (`.jpg`, `.png`, `.heic`, …): **AWS Textract** (`method=textract`, includes bounding boxes).
- **PDFs**:
  1. Try **pdfplumber** embedded text extraction (`method=pdfplumber`).
  2. If embedded text is insufficient: **AWS Textract** (`method=textract`).
  3. If Textract rejects the PDF: convert PDF → images and retry (`method=textract_via_image_conversion`).

Key modules:
- Textract: `packages/parsers/ocr/provider_textract.py`
- PDF embedded text: `packages/parsers/ocr/pdf_text_extractor.py`

## Worker pipeline (receipt processing)

Stages (Celery tasks):
- Stage 1 ingest: `services/worker/tasks/pipeline/ingest.py` (normalizes storage layout + queues OCR)
- Stage 2 OCR: `services/worker/tasks/pipeline/ocr.py` (stores `ocr_text`, `ocr_method`, `confidence`, `bounding_boxes`)
- Stage 3 parse: `services/worker/tasks/pipeline/parse.py` (vendor normalization + parser routing + categorization)
- Stage 4 persist: `services/worker/tasks/pipeline/persist.py` (writes normalized receipt + line items)

Where to look when debugging:
- `shared.ops_tasks.meta` contains the pipeline inputs/outputs (e.g., `ocr_text` truncated for storage, method, confidence).

## Parser routing (deterministic vs fallback)

Dispatcher:
- `packages/invoice_parsers/vendor_dispatcher.py`

Rules of thumb:
- Add/keep deterministic vendor parsers for repeatable formats and high-volume vendors.
- Keep `GenericParser` last (it always matches).
- Add a vendor key to `GOLDEN_VENDORS` only after it has golden fixture coverage.

Claude Vision fallback:
- Parse-stage heuristic lives in `services/worker/tasks/pipeline/parse.py` (used for some unknown-vendor image receipts).
- Forced (manual) reparse is implemented in `services/worker/tasks/reparse_with_claude.py`.

## Golden tests (regression safety net)

Fixtures:
- `tests/fixtures/golden_receipts/<vendor>/`
  - `<name>_ocr.txt`
  - `<name>_expected.json`

Tests:
- `tests/unit/test_invoice_parsers.py` (class-based `@pytest.mark.golden` suite)

Commands:
- `make test-golden`
- `pytest tests/unit/test_invoice_parsers.py -m golden -v`

