When & Why to Use This Skill
This Claude skill is a comprehensive PDF processing toolkit designed for programmatic document manipulation and data extraction. It empowers users to extract text and structured tables, merge or split PDF files, rotate pages, and generate new documents from scratch using industry-standard Python libraries like pypdf, pdfplumber, and reportlab.
Use Cases
- Automated Data Extraction: Extracting structured text and complex tables from financial reports, invoices, or academic papers for data analysis.
- Document Management & Assembly: Merging multiple PDF reports into a single file or splitting large documents into individual pages for easier distribution.
- Programmatic PDF Generation: Creating dynamic, custom PDF documents such as automated invoices, certificates, or personalized business reports.
- Document Correction & Formatting: Rotating misaligned pages, handling PDF forms, and modifying document metadata programmatically.
| name | pdf |
|---|
| description | PDF 处理工具包。用于提取文本和表格、创建新 PDF、合并拆分文档、旋转页面、处理表单。当需要程序化处理、生成或分析 PDF 文档时使用此技能。 |
|---|
PDF Processing Guide
快速开始
from pypdf import PdfReader, PdfWriter
# 读取 PDF
reader = PdfReader("document.pdf")
print(f"页数: {len(reader.pages)}")
# 提取文本
text = ""
for page in reader.pages:
text += page.extract_text()
Python 库选择
| 任务 |
推荐库 |
用途 |
| 基础操作 |
pypdf |
合并、拆分、旋转、元数据 |
| 文本提取 |
pdfplumber |
文本和表格提取 |
| 创建 PDF |
reportlab |
生成新 PDF |
| OCR 扫描件 |
pytesseract |
图片文字识别 |
常用操作
合并 PDF
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
拆分 PDF
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
提取表格
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
创建 PDF
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("new.pdf", pagesize=letter)
c.drawString(100, 750, "Hello World!")
c.save()
旋转页面
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) #顺时针旋转 90 度
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)
命令行工具
# 提取文本 (poppler-utils)
pdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt # 保留布局
# 合并 PDF (qpdf)
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# 拆分页面
qpdf input.pdf --pages .1-5 -- pages1-5.pdf
快速参考
| 任务 |
代码 |
| 合并 |
writer.add_page(page) |
| 拆分 |
每页单独保存 |
| 提取文本 |
page.extract_text() |
| 提取表格 |
page.extract_tables() |
| 创建 |
canvas.Canvas() |
| 旋转 |
page.rotate(90) |