excel-parser
Smart Excel/CSV file parsing with intelligent routing based on file complexity analysis. Analyzes file structure (merged cells, row count, table layout) using lightweight metadata scanning, then recommends optimal processing strategy - either high-speed Pandas mode for standard tables or semantic HTML mode for complex reports. Use when processing Excel/CSV files with unknown or varying structure where optimization between speed and accuracy is needed.
When & Why to Use This Skill
This Claude skill provides an intelligent routing system for parsing Excel and CSV files using a 'Scout Pattern.' By performing a lightweight metadata scan to analyze file complexity—including merged cells, row counts, and table layouts—it automatically selects the optimal processing strategy: high-speed Pandas mode for standard datasets or semantic HTML extraction for complex, irregular reports. This ensures a perfect balance between processing speed, token efficiency, and data accuracy.
Use Cases
- Complex Financial Reporting: Accurately extracting data from financial statements that utilize merged cells and multi-level headers which typically break standard parsers.
- Large-Scale Data Ingestion: Automatically routing simple, high-volume CSV files to Pandas to maintain high performance and minimize LLM token consumption.
- Legacy Document Digitization: Processing older Excel files with inconsistent structures, empty row interruptions, or multiple sub-tables within a single sheet.
- Automated Data Pipelines: Serving as a robust front-end for data pipelines where the incoming file structure is unknown or varies significantly between batches.
| name | excel-parser |
|---|---|
| description | Smart Excel/CSV file parsing with intelligent routing based on file complexity analysis. Analyzes file structure (merged cells, row count, table layout) using lightweight metadata scanning, then recommends optimal processing strategy - either high-speed Pandas mode for standard tables or semantic HTML mode for complex reports. Use when processing Excel/CSV files with unknown or varying structure where optimization between speed and accuracy is needed. |
Excel Parser
Overview
Provide intelligent routing strategies for parsing Excel/CSV files by analyzing complexity and choosing the optimal processing path. The skill implements a "Scout Pattern" that scans file metadata before processing to balance speed (Pandas) with accuracy (semantic extraction).
Core Philosophy: Scout Pattern
Before processing data, deploy a lightweight "scout" to analyze file metadata and make intelligent routing decisions:
- Metadata Scanning - Use
openpyxlto scan file structure without loading data - Complexity Scoring - Calculate score based on merged cells, row count, and layout
- Path Selection - Choose between Pandas (fast) or HTML (accurate) processing
- Optimized Execution - Execute with the most appropriate tool for the file type
Key Principle: "LLM handles metadata decisions, Pandas/HTML processes bulk data"
When to Use This Skill
Use excel-parser when:
- Processing Excel/CSV files with unknown structure or varying complexity
- Handling files ranging from simple data tables to complex financial reports
- Need to optimize between processing speed and extraction accuracy
- Working with files that may contain merged cells, multi-level headers, or irregular layouts
Skip this skill when:
- File structure is already known and documented
- Processing simple, well-structured tables with confirmed format
- Using predefined scripts for specific file formats
Processing Workflow
Step 1: Analyze File Complexity
Use the scripts/complexity_analyzer.py to scan file metadata:
python scripts/complexity_analyzer.py <file_path> [sheet_name]
What it analyzes (without loading data):
- Merged cell distribution (shallow vs deep in the table)
- Row count and data continuity
- Empty row interruptions (indicates multi-table layouts)
Output (JSON format):
{
"is_complex": false,
"recommended_strategy": "pandas",
"reasons": ["No deep merges detected", "行数过多 (>1000), 强制使用 Pandas 模式"],
"stats": {
"total_rows": 5000,
"deep_merges": 0,
"empty_interruptions": 0
}
}
Step 2: Route to Optimal Strategy
Based on complexity analysis:
- is_complex = false → Use Path A (Pandas Standard Mode)
- is_complex = true → Use Path B (HTML Semantic Mode)
Step 3: Execute Processing
Follow the selected path's workflow to extract data.
Complexity Scoring Rules
Rule 1: Deep Merged Cells
- Condition: Merged cells appearing beyond row 5
- Interpretation: Complex table structure (not just header formatting)
- Decision: Mark as complex if >2 deep merges detected
- Example: Financial reports with merged category labels in data region
Rule 2: Empty Row Interruptions
- Condition: Multiple empty rows within the table
- Interpretation: Multiple sub-tables in single sheet
- Decision: Mark as complex if >2 empty row interruptions found
- Example: Summary table + detail table in one sheet
Rule 3: Row Count Override
- Condition: Total rows >1000
- Interpretation: Too large for HTML processing (token explosion)
- Decision: Force Pandas mode regardless of complexity
- Rationale: HTML conversion would exceed token limits
Rule 4: Default (Standard Table)
- Condition: No deep merges, continuous data, moderate size
- Interpretation: Standard data table
- Decision: Use Pandas for optimal speed
Path A: Pandas Standard Mode
When: Simple/large tables (most common case)
Strategy: Let LLM analyze ONLY the first 20 rows to determine header position, then use Pandas to read full data at native speed.
Workflow:
Sample First 20 Rows
import pandas as pd df_sample = pd.read_excel(file_path, sheet_name=sheet_name, header=None, nrows=20) csv_sample = df_sample.to_csv(index=False)LLM Analyzes Header Position
Prompt template:
You are a Pandas expert. Analyze the following first 20 rows of an Excel file (in CSV format): {csv_sample} Task: Identify the true header row index (0-based). If row 0 is the header, return 0. If the first two rows are titles and row 2 is the header, return 2. Return JSON format: { "header_row": <int>, "explanation": "<reasoning>" }Parse LLM Response
import json, re json_match = re.search(r'\{.*\}', llm_response, re.DOTALL) config = json.loads(json_match.group()) header_idx = config.get("header_row", 0)Read Full Data with Correct Parameters
full_df = pd.read_excel(file_path, sheet_name=sheet_name, header=header_idx) print(f"Successfully loaded DataFrame: {full_df.shape}")
Token Cost: ~500 tokens (only 20 rows analyzed by LLM) Processing Speed: Very fast (Pandas native speed)
Path B: HTML Semantic Mode
When: Complex/irregular tables (merged cells, multi-level headers)
Strategy: Convert to semantic HTML preserving structure (rowspan/colspan), then let LLM extract data understanding the visual layout.
Workflow:
Convert to Semantic HTML
import openpyxl from openpyxl.utils import get_column_letter wb = openpyxl.load_workbook(file_path, data_only=True) sheet = wb[sheet_name] html_parts = ['<table border="1">'] # Track merged cell spans merge_map = {} for merge in sheet.merged_cells.ranges: min_col, min_row, max_col, max_row = merge.bounds merge_map[(min_row, min_col)] = { 'rowspan': max_row - min_row + 1, 'colspan': max_col - min_col + 1 } # Build HTML with rowspan/colspan for row_idx, row in enumerate(sheet.iter_rows(), start=1): html_parts.append('<tr>') for col_idx, cell in enumerate(row, start=1): # Skip cells that are part of a merge (not the top-left) if any((row_idx, col_idx) in range(...) for merge in sheet.merged_cells.ranges): if (row_idx, col_idx) not in merge_map: continue # Get cell value value = cell.value or '' # Add rowspan/colspan if this is a merged cell origin attrs = '' if (row_idx, col_idx) in merge_map: span = merge_map[(row_idx, col_idx)] if span['rowspan'] > 1: attrs += f' rowspan="{span["rowspan"]}"' if span['colspan'] > 1: attrs += f' colspan="{span["colspan"]}"' html_parts.append(f'<td{attrs}>{value}</td>') html_parts.append('</tr>') html_parts.append('</table>') html_content = '\n'.join(html_parts)LLM Extracts Structured Data
Prompt template:
Analyze the following HTML table and extract key data as JSON. Pay attention to merged cells (rowspan/colspan) which indicate hierarchical headers or grouped data. HTML (first 2000 chars): {html_content[:2000]} Task: Extract the data preserving the semantic structure. Return JSON format suitable for the data type.Parse and Return
import json, re json_match = re.search(r'\{.*\}', llm_response, re.DOTALL) extracted_data = json.loads(json_match.group())
Token Cost: Higher (full HTML structure analyzed) Processing Speed: Slower (LLM semantic extraction) Use Case: Only for small, complex files where Pandas would fail
Implementation Code Template
Complete executable example combining both paths:
import openpyxl
from openpyxl.utils import range_boundaries
import pandas as pd
import json
import sys
class SmartExcelRouter:
def __init__(self, file_path):
self.file_path = file_path
self.wb = openpyxl.load_workbook(file_path, read_only=False, data_only=True)
def analyze_sheet_complexity(self, sheet_name):
"""Scout function: Calculate complexity score without loading data."""
sheet = self.wb[sheet_name]
max_row = sheet.max_row
merged_ranges = sheet.merged_cells.ranges
# Analyze merge distribution
deep_merges = 0
for merge in merged_ranges:
min_col, min_row, max_col, max_row_merge = range_boundaries(str(merge))
if min_row > 5: # Beyond header region
deep_merges += 1
# Check empty row interruptions
empty_interruptions = 0
if max_row < 200: # Only check short tables
for row in sheet.iter_rows(min_row=1, max_row=max_row):
if all(cell.value is None for cell in row):
empty_interruptions += 1
# Apply scoring rules
is_complex = False
reasons = []
if deep_merges > 2:
is_complex = True
reasons.append(f"检测到数据区域有 {deep_merges} 处合并单元格")
if max_row < 300 and empty_interruptions > 2:
is_complex = True
reasons.append("检测到多处空行,疑为多子表布局")
if max_row > 1000:
is_complex = False
reasons = ["行数过多 (>1000), 强制使用 Pandas 模式"]
if not is_complex and not reasons:
reasons.append("结构规则,未检测到复杂布局")
return {
"is_complex": is_complex,
"recommended_strategy": "html" if is_complex else "pandas",
"reasons": reasons,
"stats": {
"total_rows": max_row,
"deep_merges": deep_merges,
"empty_interruptions": empty_interruptions
}
}
def process_pandas_mode(self, sheet_name):
"""Path A: Read first 20 rows, LLM determines header, Pandas loads full data."""
print(f"[Pandas Mode] Processing {sheet_name}...")
# Step 1: Sample
df_sample = pd.read_excel(self.file_path, sheet_name=sheet_name, header=None, nrows=20)
csv_sample = df_sample.to_csv(index=False)
# Step 2: LLM analyzes (you need to implement call_llm function)
# For now, assume header is at row 0
header_idx = 0 # Replace with LLM analysis
# Step 3: Full read
full_df = pd.read_excel(self.file_path, sheet_name=sheet_name, header=header_idx)
print(f" Loaded DataFrame: {full_df.shape}")
return full_df
def process_html_mode(self, sheet_name):
"""Path B: Convert to HTML, LLM extracts semantically."""
print(f"[HTML Mode] Processing {sheet_name}...")
# Implementation would include HTML conversion as shown above
# For brevity, returning placeholder
return {"message": "Use HTML conversion code from Path B section"}
# Usage
if __name__ == "__main__":
router = SmartExcelRouter("example.xlsx")
for sheet_name in router.wb.sheetnames:
analysis = router.analyze_sheet_complexity(sheet_name)
print(f"\nSheet: {sheet_name}")
print(f"Strategy: {analysis['recommended_strategy']}")
print(f"Reasons: {analysis['reasons']}")
if analysis['is_complex']:
router.process_html_mode(sheet_name)
else:
router.process_pandas_mode(sheet_name)
Best Practices
1. Trust the Scout
Always run complexity analysis before processing. The metadata scan is fast (<1 second) and prevents wasted effort on wrong approach.
2. Respect the Row Count Rule
Never attempt HTML mode on files >1000 rows. Token limits will cause failures.
3. Pandas First for Unknown Files
When in doubt, try Pandas mode first. It fails fast and clearly when structure is incompatible.
4. Cache Analysis Results
If processing multiple sheets from same file, run analysis once and cache results.
5. Preserve Original Files
Never modify the original Excel file during analysis or processing.
Dependencies
Required Python packages:
openpyxl- Metadata scanning and Excel file manipulationpandas- High-speed data reading and manipulation- Access to Bash tool for executing Python scripts
Resources
This skill includes:
scripts/complexity_analyzer.py- Standalone executable for complexity analysis- Implementation code templates in this document for both processing paths