docx
Word 文档处理工具包。用于创建新文档、编辑现有文档、处理修订追踪、添加批注、提取文本。当需要处理 .docx 文件进行文档创建、修改或分析时使用此技能。
When & Why to Use This Skill
The DOCX Processing toolkit is a comprehensive suite for programmatically managing Microsoft Word documents. It enables seamless creation, editing, and analysis of .docx files, featuring advanced capabilities such as track changes (redlining), comment management, and multi-format conversion. By integrating tools like Pandoc, python-docx, and docx-js, it provides a robust bridge between AI-generated content and professional document standards.
Use Cases
- Automated Report Generation: Programmatically generate structured business reports, invoices, or technical manuals with custom headings, tables, and formatting.
- Legal and Editorial Workflows: Manage complex document revisions by tracking changes, adding annotations, and processing redlines during contract reviews or collaborative writing.
- Content Migration and Publishing: Convert legacy Word documents into Markdown for web documentation or transform them into PDFs and images for standardized distribution.
- Data Extraction and Analysis: Batch process large volumes of .docx files to extract text, metadata, or table data for integration into databases or analytical tools.
| name | docx |
|---|---|
| description | Word 文档处理工具包。用于创建新文档、编辑现有文档、处理修订追踪、添加批注、提取文本。当需要处理 .docx 文件进行文档创建、修改或分析时使用此技能。 |
DOCX Processing Guide
工作流决策树
| 任务 | 推荐方法 |
|---|---|
| 读取/分析内容 | pandoc 转Markdown |
| 创建新文档 | docx-js (JavaScript) |
| 编辑现有文档 | python-docx 或OOXML |
| 修订追踪 | Redlining 工作流 |
读取文档
转换为 Markdown
# 基础转换
pandoc document.docx -o output.md
# 保留修订追踪
pandoc --track-changes=all document.docx -o output.md
创建新文档 (JavaScript)
const { Document, Packer, Paragraph, TextRun } = require('docx');
const doc = new Document({
sections: [{
children: [
new Paragraph({
children: [
new TextRun({ text: "标题", bold: true, size: 32 }),],
}),
new Paragraph({
children: [
new TextRun("正文内容"),
],
}),
],
}],
});
// 导出
Packer.toBuffer(doc).then(buffer => {
fs.writeFileSync("output.docx", buffer);
});
编辑文档 (Python)
from docx import Document
# 打开文档
doc = Document('existing.docx')
# 添加段落
doc.add_paragraph('新段落内容')
# 添加标题
doc.add_heading('新标题', level=1)
# 添加表格
table = doc.add_table(rows=2, cols=2)
table.cell(0, 0).text = '单元格内容'
# 保存
doc.save('modified.docx')
提取文本
from docx import Document
doc = Document('document.docx')
full_text = []
for para in doc.paragraphs:
full_text.append(para.text)
print('\n'.join(full_text))
修订追踪工作流
- 转换查看:
pandoc --track-changes=all file.docx -o current.md - 识别变更:标记需要修改的位置
- 实施变更:使用 OOXML 添加
<w:ins>和<w:del>标签 - 验证结果:再次转换确认修改正确
文档转图片
# DOCX → PDF
soffice --headless --convert-to pdf document.docx
# PDF → 图片
pdftoppm -jpeg -r 150 document.pdf page
依赖安装
# Python
pip install python-docx
# JavaScript
npm install docx
# 命令行工具
sudo apt-get install pandoc libreoffice poppler-utils