pdf-tools

caiopizzol's avatarfrom caiopizzol

Search and extract content from PDF files. Use when searching PDFs, finding text in documents, or extracting specific pages without reading the entire file.

0stars🔀0forks📁View on GitHub🕐Updated Jan 7, 2026

When & Why to Use This Skill

This Claude skill enables efficient searching and extraction of specific content from PDF documents without the need to load entire files into the AI's context window. By leveraging powerful command-line utilities like pdfgrep and poppler-utils, it allows for high-speed keyword location, precise page range extraction, and metadata retrieval. This approach significantly optimizes token usage and improves processing speed, making it an essential tool for developers and researchers handling large-scale document analysis.

Use Cases

  • Case 1: Searching for specific technical specifications or legal clauses across massive PDF manuals without exceeding context limits.
  • Case 2: Extracting targeted page ranges or specific sections from long reports to focus analysis on relevant data points.
  • Case 3: Automating the retrieval of document metadata, such as page counts, to organize and categorize large digital libraries.
  • Case 4: Isolating text from specific pages to feed into downstream LLM workflows for summarization or data cleaning.
namepdf-tools
descriptionSearch and extract content from PDF files. Use when searching PDFs, finding text in documents, or extracting specific pages without reading the entire file.
allowed-toolsBash, Read, Glob

PDF Tools

Search and extract content from PDFs without loading entire files into context.

Installation

# macOS
brew install pdfgrep poppler

# Ubuntu/Debian
sudo apt install pdfgrep poppler-utils

Quick Reference

Task Command
Search pdfgrep "term" file.pdf
Search with page numbers pdfgrep -n "term" file.pdf
Search with context pdfgrep -n -C 2 "term" file.pdf
Get page count pdfinfo file.pdf | grep Pages
Extract pages 5-10 pdftotext -f 5 -l 10 file.pdf -

Core Workflow

Step 1: Search - Find where content lives

pdfgrep -n "authentication" large-manual.pdf
# Output: 42: User authentication requires...
#         45: Authentication tokens expire...

Step 2: Extract - Get just those pages

pdftotext -f 41 -l 46 large-manual.pdf -

Search Commands

# Basic search
pdfgrep "search term" document.pdf

# Case-insensitive
pdfgrep -i "search term" document.pdf

# With page numbers
pdfgrep -n "search term" document.pdf

# With context (2 lines before/after)
pdfgrep -n -C 2 "search term" document.pdf

# Count occurrences
pdfgrep -c "search term" document.pdf

# Search all PDFs in directory
pdfgrep -r "term" /path/to/pdfs/

Extract Commands

# Extract specific page range
pdftotext -f 10 -l 15 document.pdf -

# Extract single page
pdftotext -f 42 -l 42 document.pdf -

# Preserve layout (for tables)
pdftotext -layout -f 10 -l 10 document.pdf -

# Extract and limit output
pdftotext -f 10 -l 15 document.pdf - | head -50

Metadata

# Get page count
pdfinfo document.pdf | grep Pages

# Full metadata
pdfinfo document.pdf

Troubleshooting

Empty output from pdftotext: PDF is image-based (scanned). These tools work with text-based PDFs only.

pdfgrep missing matches: Try case-insensitive (-i). Check if PDF has selectable text.