What is processing-documents?

This Claude skill provides a comprehensive suite for automating document lifecycles, including PDF, DOCX, XLSX, and PPTX formats. It enables developers to build sophisticated document pipelines through advanced data extraction, dynamic content generation, and high-performance batch operations, significantly reducing manual overhead in office-related workflows.

When should I use processing-documents?

processing-documents is useful in the following scenarios: • Automated Financial Reporting: Generate professional Excel spreadsheets and Word reports with custom styling, formulas, and charts directly from raw data sources. • Intelligent Data Extraction: Build automated pipelines to parse text, structured tables, and metadata from large volumes of PDF invoices, applications, or legal documents. • Template-Based Document Creation: Streamline the production of personalized contracts, invoices, or certificates by binding dynamic data to DOCX templates using a robust template engine. • High-Volume Batch Processing: Execute large-scale document transformations and conversions efficiently using concurrent processing and streaming to handle large files without memory exhaustion.

name	Processing Documents
description	Processes PDF, DOCX, XLSX, and PPTX files with extraction, generation, and batch operations. Use when building document pipelines, extracting content from office files, or generating reports.
category	tools

Processing Documents

Quick Start

import { PDFDocument } from 'pdf-lib';
import ExcelJS from 'exceljs';
import { Document, Packer, Paragraph, TextRun } from 'docx';

// Extract text from PDF
async function extractPDFText(buffer: Buffer): Promise<string> {
  const pdfDoc = await PDFDocument.load(buffer);
  const pages = pdfDoc.getPages();
  return pages.map(page => page.getTextContent()).join('\n\n');
}

// Read Excel spreadsheet
async function readExcel(buffer: Buffer) {
  const workbook = new ExcelJS.Workbook();
  await workbook.xlsx.load(buffer);
  return workbook.worksheets.map(sheet => ({
    name: sheet.name,
    rows: sheet.getSheetValues(),
  }));
}

// Generate Word document
async function generateDOCX(title: string, content: string[]): Promise<Buffer> {
  const doc = new Document({
    sections: [{
      children: [
        new Paragraph({ children: [new TextRun({ text: title, bold: true, size: 48 })] }),
        ...content.map(text => new Paragraph({ children: [new TextRun(text)] })),
      ],
    }],
  });
  return await Packer.toBuffer(doc);
}

Features

Feature	Description	Guide
PDF Extraction	Extract text, tables, images, and metadata from PDFs	Use pdf-lib or pdf-parse for text extraction
PDF Generation	Create PDFs from templates with data binding	Use pdf-lib with text, images, and table elements
DOCX Parsing	Parse Word documents preserving structure	Use mammoth or docx library for parsing
DOCX Generation	Generate Word documents with formatting	Use docx package with paragraphs and tables
Excel Reading	Read spreadsheets with formulas and formatting	Use exceljs to iterate sheets and cells
Excel Generation	Create spreadsheets with charts and styling	Use exceljs with conditional formatting
PPTX Generation	Create presentations with slides and charts	Use pptxgenjs for slide creation
Batch Processing	Process multiple documents with concurrency	Use p-queue for controlled parallel processing
Template Engine	Generate documents from templates with placeholders	Use docxtemplater for DOCX templates
Streaming	Handle large files without memory exhaustion	Process files in chunks with streams

Common Patterns

Batch Document Processing

import PQueue from 'p-queue';

async function processBatch(files: string[], transform: (buffer: Buffer) => Promise<Buffer>) {
  const queue = new PQueue({ concurrency: 4 });
  const results: { file: string; success: boolean; error?: string }[] = [];

  for (const file of files) {
    queue.add(async () => {
      try {
        const buffer = await fs.readFile(file);
        const output = await transform(buffer);
        await fs.writeFile(file.replace(/\.\w+$/, '_processed.pdf'), output);
        results.push({ file, success: true });
      } catch (error) {
        results.push({ file, success: false, error: error.message });
      }
    });
  }

  await queue.onIdle();
  return results;
}

Excel Report Generation

async function generateReport(data: Record<string, any>[]): Promise<Buffer> {
  const workbook = new ExcelJS.Workbook();
  const sheet = workbook.addWorksheet('Report');

  // Add headers with styling
  const headers = Object.keys(data[0] || {});
  sheet.addRow(headers);
  sheet.getRow(1).font = { bold: true };
  sheet.getRow(1).fill = { type: 'pattern', pattern: 'solid', fgColor: { argb: 'FFE0E0E0' } };

  // Add data rows
  data.forEach(row => sheet.addRow(headers.map(h => row[h])));

  // Auto-fit columns
  sheet.columns.forEach(col => { col.width = 15; });

  return Buffer.from(await workbook.xlsx.writeBuffer());
}

Invoice Generation from Template

import Docxtemplater from 'docxtemplater';
import PizZip from 'pizzip';

async function generateInvoice(templatePath: string, invoiceData: InvoiceData): Promise<Buffer> {
  const templateBuffer = await fs.readFile(templatePath);
  const zip = new PizZip(templateBuffer);
  const doc = new Docxtemplater(zip, { paragraphLoop: true, linebreaks: true });

  doc.render({
    invoiceNumber: invoiceData.number,
    date: invoiceData.date,
    customer: invoiceData.customer,
    items: invoiceData.items,
    total: invoiceData.total,
  });

  return doc.getZip().generate({ type: 'nodebuffer', compression: 'DEFLATE' });
}

PDF Table Extraction

async function extractTables(pdfBuffer: Buffer): Promise<ExtractedTable[]> {
  const pdfDoc = await PDFDocument.load(pdfBuffer);
  const tables: ExtractedTable[] = [];

  for (let i = 0; i < pdfDoc.getPageCount(); i++) {
    const page = pdfDoc.getPage(i);
    const content = await extractPageContent(page);
    const detectedTables = detectTableStructures(content);
    tables.push(...detectedTables.map(t => ({ ...t, pageNumber: i + 1 })));
  }

  return tables;
}

Best Practices

Do	Avoid
Stream large files (>10MB) to prevent memory issues	Loading entire large files into memory
Validate file types before processing	Assuming file extensions match content
Handle password-protected documents gracefully	Ignoring encrypted document errors
Preserve original formatting when transforming	Stripping formatting without user consent
Cache parsed results for repeated access	Re-parsing the same document multiple times
Use appropriate libraries per format	Building custom parsers for standard formats
Set file size limits for uploads	Processing unbounded file sizes
Sanitize filenames and paths	Using untrusted paths directly
Handle encoding issues (UTF-8, BOM)	Assuming all files use the same encoding
Log processing errors with context	Silently failing on corrupt files

Related Skills

media-processing - Video and audio processing
image-processing - Image manipulation with Sharp
typescript - Type-safe document handling

processing-documents

When & Why to Use This Skill

Use Cases