docx
Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. Use when working with professional documents (.docx files) for creating new documents, modifying content, working with tracked changes, or adding comments.
When & Why to Use This Skill
This Claude skill provides a comprehensive suite for professional .docx document management, enabling users to create, edit, and analyze files with high precision. It excels at handling complex workflows like redlining, tracked changes, and comment management, ensuring formatting preservation while offering powerful text extraction and conversion capabilities for seamless integration into business and legal workflows.
Use Cases
- Legal and Business Redlining: Automate the review process by programmatically applying tracked changes and managing comments in contracts and official agreements while maintaining document integrity.
- Automated Document Generation: Programmatically create professional reports, proposals, or templates from scratch using structured data with full control over paragraphs, text runs, and formatting.
- Content Extraction and Migration: Convert complex .docx files into clean Markdown for knowledge bases or extract raw XML data for deep document analysis and data mining.
- Format Conversion and Publishing: Streamline document distribution by converting professional documents into PDF or image formats (JPEG) for easy sharing and presentation.
- Collaborative Workflow Automation: Batch-process edits and feedback in academic or corporate documents, allowing for precise 'minimal-edit' updates that respect existing document structures.
| name | docx |
|---|---|
| description | "Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. Use when working with professional documents (.docx files) for creating new documents, modifying content, working with tracked changes, or adding comments." |
DOCX Creation, Editing, and Analysis
Overview
A .docx file is essentially a ZIP archive containing XML files that you can read or edit.
Workflow Decision Tree
Reading/Analyzing Content
Use text extraction or raw XML access
Creating New Document
Use docx-js workflow
Editing Existing Document
- Your own document + simple changes: Basic OOXML editing
- Someone else's document: Redlining workflow (recommended)
- Legal, academic, business docs: Redlining workflow (required)
Reading Content
Text Extraction
Convert to markdown using pandoc:
# Convert document to markdown with tracked changes
pandoc --track-changes=all path-to-file.docx -o output.md
Raw XML Access
Needed for: comments, complex formatting, document structure, embedded media.
# Unpack a file
python ooxml/scripts/unpack.py <input.docx> <output_dir>
Key file structures:
word/document.xml- Main document contentsword/comments.xml- Comments referenced in document.xmlword/media/- Embedded images and media files- Tracked changes use
<w:ins>and<w:del>tags
Creating New Documents
Use docx-js (JavaScript/TypeScript):
- Create a JavaScript/TypeScript file using Document, Paragraph, TextRun components
- Export as .docx using Packer.toBuffer()
import { Document, Paragraph, TextRun, Packer } from "docx";
const doc = new Document({
sections: [{
properties: {},
children: [
new Paragraph({
children: [new TextRun("Hello World")],
}),
],
}],
});
const buffer = await Packer.toBuffer(doc);
Editing Existing Documents
Use the Document library (Python):
- Unpack:
python ooxml/scripts/unpack.py <input.docx> <output_dir> - Create and run a Python script using the Document library
- Pack:
python ooxml/scripts/pack.py <unpacked_dir> <output.docx>
Redlining Workflow
For document review with tracked changes:
Principle: Minimal, Precise Edits Only mark text that actually changes. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]
Workflow
Get markdown representation:
pandoc --track-changes=all path-to-file.docx -o current.mdIdentify and group changes into batches of 3-10
Unpack the document:
python ooxml/scripts/unpack.py <input.docx> <output_dir>Implement changes in batches using Document library
Pack the document:
python ooxml/scripts/pack.py unpacked reviewed-document.docxFinal verification:
pandoc --track-changes=all reviewed-document.docx -o verification.md
Converting to Images
# Convert DOCX to PDF
soffice --headless --convert-to pdf document.docx
# Convert PDF pages to JPEG
pdftoppm -jpeg -r 150 document.pdf page
Dependencies
- pandoc: Text extraction
- docx: Creating new documents (npm)
- LibreOffice: PDF conversion
- Poppler: PDF to image conversion
- defusedxml: Secure XML parsing