gemini-pdf

odysseus0's avatarfrom odysseus0

Process multimodal documents using Gemini CLI, leveraging Gemini's superior multimodal capabilities. Use for PDFs, scanned documents, image-heavy documents, or any file where visual understanding matters. Ideal for extracting content from complex layouts, tables, diagrams, handwritten notes, or mixed text/image documents. Triggers on PDF processing, document extraction, "use Gemini for this", or when document has visual complexity that benefits from multimodal understanding.

0stars🔀0forks📁View on GitHub🕐Updated Jan 7, 2026

When & Why to Use This Skill

The gemini-pdf skill leverages the Gemini CLI to provide advanced multimodal document processing capabilities. It is specifically designed to handle complex PDFs, scanned documents, and image-heavy files where traditional text extraction fails. By utilizing Gemini's visual understanding, it accurately extracts content from intricate layouts, tables, diagrams, and even handwritten notes, converting them into structured formats like Markdown.

Use Cases

  • Faithful Conversion: Converting scanned or complex multi-column PDFs into clean Markdown while preserving headers, lists, and formatting.
  • Table Extraction: Automatically identifying and extracting data from complex tables within financial reports or technical manuals into Markdown tables.
  • Visual Content Analysis: Describing and interpreting diagrams, charts, and figures found in academic papers or architectural documents.
  • Digitizing Handwritten Records: Extracting text from handwritten notes or legacy scanned forms that require high-quality OCR and contextual understanding.
  • Structured Data Harvesting: Using specific prompts to extract specific fields and data points from mixed-media documents for database entry.
namegemini-pdf
descriptionProcess multimodal documents using Gemini CLI, leveraging Gemini's superior multimodal capabilities. Use for PDFs, scanned documents, image-heavy documents, or any file where visual understanding matters. Ideal for extracting content from complex layouts, tables, diagrams, handwritten notes, or mixed text/image documents. Triggers on PDF processing, document extraction, "use Gemini for this", or when document has visual complexity that benefits from multimodal understanding.

Gemini Document Processing

Delegate document processing to Gemini CLI for superior multimodal understanding. Use when documents have visual complexity - layouts, tables, diagrams, scans, mixed content.

Workspace Restriction

Gemini CLI sandboxes file access to the current working directory. For files outside the vault, run from the file's directory:

# Use subshell to preserve cwd
(cd /path/to/files && gemini "Summarize: ./document.pdf")

The --include-directories flag exists but doesn't work reliably. Running from the target directory is the workaround.

Basic Usage

Reference file paths directly in your prompt - Gemini reads them via its file system tools:

gemini "Convert this to markdown: /path/to/document.pdf"

# Save output
gemini "Convert to markdown: /path/to/doc.pdf" > output.md

Common Tasks

Faithful Conversion:

gemini "Convert this PDF to clean markdown. Preserve all content including headers, lists, tables. Output only markdown, no commentary: /path/to/document.pdf"

Table Extraction:

gemini "Extract all tables as markdown tables: /path/to/document.pdf"

Structured Extraction:

gemini "Extract and structure as markdown:
- [list the fields you want]
- [be specific about format]
File: /path/to/document.pdf"

Diagram/Image Description:

gemini "Describe the diagrams and figures in this document: /path/to/document.pdf"

When to Use Gemini vs Other Tools

Use Gemini:

  • Scanned documents / OCR needed
  • Complex layouts (multi-column, mixed content)
  • Tables, diagrams, charts
  • Handwritten content
  • Image-heavy documents

Use pypdf/pdfplumber:

  • Simple text-only PDFs
  • Programmatic batch processing
  • When you need raw text extraction without interpretation