---
name: pdf
description: PDF 处理工具包。用于提取文本和表格、创建新 PDF、合并拆分文档、旋转页面、处理表单。当需要程序化处理、生成或分析 PDF 文档时使用此技能。
---

# PDF Processing Guide

## 快速开始

```python
from pypdf import PdfReader, PdfWriter

# 读取 PDF
reader = PdfReader("document.pdf")
print(f"页数: {len(reader.pages)}")

# 提取文本
text = ""
for page in reader.pages:
    text += page.extract_text()
```

## Python 库选择

| 任务 | 推荐库 | 用途 |
|------|--------|------|
| 基础操作 | pypdf | 合并、拆分、旋转、元数据 |
| 文本提取 | pdfplumber | 文本和表格提取 |
| 创建 PDF | reportlab | 生成新 PDF |
| OCR 扫描件 | pytesseract | 图片文字识别 |

## 常用操作

### 合并 PDF
```python
from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)
```

### 拆分 PDF
```python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)
```

### 提取表格
```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)
```

### 创建 PDF
```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("new.pdf", pagesize=letter)
c.drawString(100, 750, "Hello World!")
c.save()
```

### 旋转页面
```python
reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  #顺时针旋转 90 度
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)
```

## 命令行工具

```bash
# 提取文本 (poppler-utils)
pdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt  # 保留布局

# 合并 PDF (qpdf)
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# 拆分页面
qpdf input.pdf --pages .1-5 -- pages1-5.pdf
```

## 快速参考

| 任务 | 代码 |
|------|------|
| 合并 | `writer.add_page(page)` |
| 拆分 | 每页单独保存 |
| 提取文本 | `page.extract_text()` |
| 提取表格 | `page.extract_tables()` |
| 创建 | `canvas.Canvas()` |
| 旋转 | `page.rotate(90)` |