docling
Document reading and conversion using Docling. Use this skill when user asks to read, open, or process document files in these formats: PDF, DOCX, PPTX, XLSX, HTML, Markdown, AsciiDoc, or images (PNG, JPG, TIFF). Supports OCR for scanned documents. Trigger when:(1) User asks to read/open a document file (e.g., "このPDFを読んで", "read this document", "ファイルの内容を確認して")(2) File extension is .pdf, .docx, .pptx, .xlsx, .html, .md, .adoc, .png, .jpg, .tiff(3) User wants to extract text from scanned documents with OCR(4) User wants to convert documents to Markdown/JSON/HTML(5) User wants to process documents with tables, figures, or photos(6) User wants to extract images/figures from documents
When & Why to Use This Skill
This Claude skill leverages the Docling library to transform complex, unstructured documents—including PDFs, DOCX, PPTX, and images—into clean, structured formats like Markdown, JSON, and HTML. It solves the critical challenge of parsing multi-column layouts, extracting intricate tables, and recovering embedded figures, making it an essential tool for preparing high-quality data for RAG (Retrieval-Augmented Generation) pipelines and automated data analysis.
Use Cases
- RAG Pipeline Optimization: Convert large PDF repositories into structured Markdown to improve the accuracy of LLM retrieval and context processing.
- Automated Financial Analysis: Extract complex tables from financial reports or spreadsheets and convert them into structured JSON for programmatic data processing.
- Legacy Document Digitization: Utilize advanced OCR (EasyOCR/Tesseract) to extract text and diagrams from scanned PNG, JPG, or TIFF files into editable formats.
- Technical Documentation Conversion: Seamlessly transform technical manuals from DOCX or AsciiDoc into web-ready HTML or Markdown while preserving image references and structural integrity.
- Batch Document Processing: Automate the conversion of hundreds of documents simultaneously using batch scripts, specifically designed for handling large files over 50 pages.
| name | docling |
|---|---|
| description | | |
| Document reading and conversion using Docling. Use this skill when user asks to read, open, or process document files in these formats | PDF, DOCX, PPTX, XLSX, HTML, Markdown, AsciiDoc, or images (PNG, JPG, TIFF). Supports OCR for scanned documents. Trigger when: |
Docling Document Conversion
Convert documents (PDF, DOCX, PPTX, HTML, Markdown, etc.) to structured formats with image extraction.
Installation
pip install docling
# For page range processing (optional but recommended)
pip install pymupdf
# For Tesseract OCR (optional):
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
Quick Start
CLI
# Basic conversion (output to directory)
docling document.pdf --to markdown
# With OCR for scanned documents
docling scanned.pdf --ocr --ocr-engine easyocr --to markdown
# Batch conversion
docling file1.pdf file2.docx --output ./converted
Python
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
print(result.document.export_to_markdown())
Using the Enhanced Conversion Script
Execute scripts/convert_document.py for advanced conversions with image extraction:
# Basic PDF to Markdown with image extraction
python scripts/convert_document.py document.pdf -o ./output
# All formats (Markdown, JSON, HTML) with images
python scripts/convert_document.py document.pdf -o ./output -f all
# With OCR (Japanese + English)
python scripts/convert_document.py scanned.pdf -o ./output --ocr --languages ja en
# High accuracy table extraction
python scripts/convert_document.py document.pdf -o ./output --table-mode accurate
# Process specific page range
python scripts/convert_document.py large.pdf -o ./output --pages 1-20
# Generate batch script for large files (50+ pages)
python scripts/convert_document.py large.pdf -o ./output --generate-script
Output Structure
output/
├── document.md # Markdown with embedded image links
├── document.json # Structured JSON data
├── document.html # HTML output (optional)
└── images/
├── figure_001.png # Diagrams, charts as PNG
├── figure_002.png
├── photo_001.jpg # Photos as JPEG
└── photo_002.jpg
Script Options
| Option | Description | Default |
|---|---|---|
-o, --output |
Output directory (required) | - |
-f, --format |
Output format (markdown, json, html, all) | markdown |
--ocr |
Enable OCR for scanned documents | disabled |
--ocr-engine |
OCR engine (easyocr, tesseract) | easyocr |
--languages |
OCR languages | en ja |
--table-mode |
Table extraction (fast, accurate) | fast |
--pages |
Page range (e.g., 1-20) | all pages |
--generate-script |
Generate batch script for large files | - |
--batch-size |
Pages per batch | 20 |
Large File Handling
For files with 50+ pages or 50+ MB:
Option 1: Page Range Processing
# Process pages 1-20
python scripts/convert_document.py large.pdf -o ./output --pages 1-20
# Then process pages 21-40
python scripts/convert_document.py large.pdf -o ./output2 --pages 21-40
Option 2: Generate Batch Script
# Generate a batch processing script
python scripts/convert_document.py large.pdf -o ./output --generate-script --batch-size 20
# Run the generated script
python ./output/batch_process.py
Advanced Configuration
OCR Setup
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions
pipeline = PdfPipelineOptions()
pipeline.do_ocr = True
pipeline.ocr_options = EasyOcrOptions(
lang=["ja", "en"], # Languages for OCR
confidence_threshold=0.5
)
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline)}
)
Table Extraction
from docling.datamodel.pipeline_options import TableFormerMode
pipeline.do_table_structure = True
pipeline.table_structure_options.mode = TableFormerMode.ACCURATE # or FAST
pipeline.table_structure_options.do_cell_matching = True
Image Extraction (Python API)
from pathlib import Path
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
# Enable image generation
pipeline = PdfPipelineOptions()
pipeline.generate_picture_images = True
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline)}
)
result = converter.convert("document.pdf")
# Save images
images_dir = Path("./output/images")
images_dir.mkdir(parents=True, exist_ok=True)
for idx, picture in enumerate(result.document.pictures):
if picture.image and picture.image.pil_image:
pil_img = picture.image.pil_image
pil_img.save(images_dir / f"image_{idx:03d}.png", "PNG")
Export Options
# Markdown
markdown = doc.export_to_markdown()
# JSON (dict)
data = doc.export_to_dict()
# HTML
html = doc.export_to_html()
# Save with different image modes
from docling_core.types.doc import ImageRefMode
doc.save_as_markdown("output.md", image_mode=ImageRefMode.REFERENCED)
Supported Formats
| Input | Output |
|---|---|
| PDF, DOCX, PPTX, XLSX | Markdown |
| HTML, Markdown, AsciiDoc | JSON |
| Images (PNG, JPG, TIFF) | HTML |
OCR Engines
| Engine | Install | Languages |
|---|---|---|
| EasyOCR | pip install easyocr |
80+ languages |
| Tesseract | System package | 100+ languages |
Common Patterns
Batch Processing
from pathlib import Path
converter = DocumentConverter()
for pdf in Path("docs").glob("*.pdf"):
result = converter.convert(str(pdf))
output = pdf.with_suffix(".md")
output.write_text(result.document.export_to_markdown())
RAG Chunking
from docling.chunking import HierarchicalChunker
chunker = HierarchicalChunker()
chunks = list(chunker.chunk(result.document))
for chunk in chunks:
print(chunk.text)
Extract Tables to CSV
for idx, table in enumerate(result.document.tables):
df = table.export_to_dataframe()
df.to_csv(f"table_{idx}.csv", index=False)
print(df.to_markdown()) # Print as Markdown