---
name: pdf-to-docx
description: Convert PDF pages to editable Word documents (.docx) while preserving layout. Use when users want to (1) convert PDF to Word, (2) make PDF content editable, (3) extract PDF pages to docx format, (4) preserve two-column academic paper layout, (5) OCR PDF images to text. Handles PDFs with embedded images by extracting page as image first, then using OCR.
---

# PDF to Word Converter

Convert PDF pages to editable Word documents while preserving layout structure.

## Workflow

1. **Extract PDF page as image** - Use pdfplumber to render page at high resolution
2. **Run OCR** - Use tesseract to extract text from the image
3. **Create Word document** - Use python-docx to create document with matching layout
4. **Verify result** - Compare generated document with original PDF

## Quick Start

### Extract a single page:
```bash
python scripts/extract_pdf_page.py /path/to/document.pdf 1 -o /output/dir
```

### Create two-column Word document:
```bash
python scripts/create_two_column_docx.py /output/dir/page1_text.txt output.docx \
  --title "Document Title" \
  --author "Author Name" \
  --page-number 1 \
  --total-pages 8
```

## Manual Workflow (for custom layouts)

When scripts don't match the exact layout needed, follow this manual process:

### Step 1: Extract page as image
```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]  # 0-indexed
    pil_image = page.to_image(resolution=200).original
    pil_image.save("page1.png", "PNG")
```

### Step 2: Run OCR
```bash
tesseract page1.png page1_text -l eng
```

### Step 3: View original to understand layout
Read the extracted image to understand:
- Column structure (single, two-column, etc.)
- Header/footer content
- Section headers and formatting
- Image/diagram placements

### Step 4: Create Word document with python-docx
```python
from docx import Document
from docx.shared import Pt, Cm
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT
from docx.oxml.ns import qn
from docx.oxml import OxmlElement

doc = Document()

# Set margins
for section in doc.sections:
    section.top_margin = Cm(1.5)
    section.bottom_margin = Cm(1.5)
    section.left_margin = Cm(1.5)
    section.right_margin = Cm(1.5)

# Two-column layout using borderless table
table = doc.add_table(rows=1, cols=2)
# Remove borders from cells
def remove_borders(cell):
    tc = cell._tc
    tcPr = tc.get_or_add_tcPr()
    tcBorders = OxmlElement('w:tcBorders')
    for edge in ('left', 'top', 'right', 'bottom'):
        el = OxmlElement(f'w:{edge}')
        el.set(qn('w:val'), 'nil')
        tcBorders.append(el)
    tcPr.append(tcBorders)

for cell in table.rows[0].cells:
    remove_borders(cell)
    cell.width = Cm(8.5)

# Add content to left column
left_cell = table.rows[0].cells[0]
p = left_cell.add_paragraph()
p.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
run = p.add_run("Content here...")
run.font.size = Pt(9)

doc.save("output.docx")
```

## Layout Patterns

### Academic Paper (Two-Column)
- Use borderless table with 2 columns
- Column width: ~8.5 cm each
- Font size: 9-10pt for body, 10-12pt for headers
- Justified text alignment
- Section headers in bold

### Single Column Document
- Standard paragraph formatting
- No table needed
- Wider margins acceptable

### With Images/Diagrams
- Mark image positions with placeholder text: `[Figure X - See original PDF]`
- Images must be manually extracted and inserted

## Dependencies

Required:
- **pdfplumber**: PDF parsing and image extraction
- **pillow**: Image processing
- **python-docx**: Word document creation
- **tesseract**: OCR (install via `brew install tesseract`)

Install Python packages:
```bash
pip install pdfplumber pillow python-docx
# Or use uvx:
uvx --with pdfplumber --with pillow --with python-docx python script.py
```

## Tips

- Use resolution 200+ DPI for better OCR accuracy
- For scanned PDFs, OCR is required
- For text-based PDFs, pdfplumber can extract text directly
- Compare final document with original to verify layout accuracy
- Bold/italic formatting must be applied manually based on visual inspection
