data-cleaning-standards
Clean and standardize vehicle insurance CSV/Excel data. Use when handling missing values, fixing data formats, removing duplicates, or standardizing fields. Mentions "clean data", "handle nulls", "standardize", "duplicates", or "normalize".
When & Why to Use This Skill
This Claude skill automates the cleaning and standardization of vehicle insurance datasets in CSV or Excel formats. It streamlines data preparation by resolving missing values, eliminating duplicates based on specific business logic, and normalizing inconsistent date and numeric formats to ensure high-quality, analysis-ready data.
Use Cases
- Standardizing raw insurance policy exports by normalizing date formats and numeric fields for consistent monthly reporting.
- Handling missing data points such as institution names or staff IDs using automated lookup and mapping strategies.
- Deduplicating policy records using composite keys (Policy No. + Time) to ensure only the most recent records are retained.
- Preparing large-scale insurance datasets for analysis by cleaning string variations and flagging outliers without deleting critical negative premium records.
| name | data-cleaning-standards |
|---|---|
| description | Clean and standardize vehicle insurance CSV/Excel data. Use when handling missing values, fixing data formats, removing duplicates, or standardizing fields. Mentions "clean data", "handle nulls", "standardize", "duplicates", or "normalize". |
| allowed-tools | Read, Edit, Grep, Glob |
Data Cleaning Standards
Clean and standardize vehicle insurance data following established business rules.
When to Activate
Use this skill when the user:
- Says "clean the data" or "standardize data"
- Mentions "missing values", "null handling", or "fill missing"
- Asks "remove duplicates" or "deduplicate"
- Wants to "normalize dates" or "standardize formats"
- Mentions data preparation or preprocessing
Quick Start Workflow
Step 1: Handle Missing Values
↓
Step 2: Remove Duplicates
↓
Step 3: Standardize Formats
↓
Step 4: Handle Outliers
Step 1: Handle Missing Values
1.1 Missing Value Strategy by Field Type
| Field Type | Strategy | Fill Value | Why |
|---|---|---|---|
| 三级机构 | Lookup from mapping | From mapping table | Authoritative source |
| 团队简称 | Keep null | '' (empty string) |
Optional field |
| 签单/批改保费 | Delete row | N/A | Critical metric |
| 手续费含税 | Fill zero | 0 |
Legitimate zero commission |
| 是否续保 | Keep null | '' |
Display as "Unknown" |
| String fields | Fill empty | '' |
Avoid None errors |
| Numeric fields | Delete or 0 | Depends on field | Case by case |
1.2 Implementation
def handle_missing_values(df, staff_mapping):
"""Apply missing value strategy"""
# 1. 三级机构 - lookup from mapping
if '三级机构' in df.columns:
for idx in df[df['三级机构'].isnull()].index:
staff = df.at[idx, '业务员']
mapped_info = lookup_staff_info(staff, staff_mapping)
if mapped_info:
df.at[idx, '三级机构'] = mapped_info['三级机构']
# 2. 手续费 - fill zero
if '手续费含税' in df.columns:
df['手续费含税'] = df['手续费含税'].fillna(0)
# 3. 签单保费 - delete missing rows
before = len(df)
df = df[df['签单/批改保费'].notnull()]
after = len(df)
if before > after:
print(f"Deleted {before - after} rows with missing premium")
# 4. String fields - fill empty
string_cols = df.select_dtypes(include=['object']).columns
df[string_cols] = df[string_cols].fillna('')
return df
Step 2: Remove Duplicates
2.1 Deduplication Rules
Composite Key: 保单号 + 投保确认时间
Keep Strategy: keep='last' (most recent record)
Why: Same policy may have multiple updates (批改)
2.2 Implementation
def remove_duplicates(df):
"""Remove duplicate records"""
# Ensure date is datetime
df['投保确认时间'] = pd.to_datetime(df['投保确认时间'], errors='coerce')
# Use duplicated() to avoid type issues
before = len(df)
dup_mask = df.duplicated(subset=['保单号', '投保确认时间'], keep='last')
df = df[~dup_mask]
after = len(df)
if before > after:
print(f"Removed {before - after} duplicates")
return df
Step 3: Standardize Formats
3.1 Date Standardization
Target Format: datetime64[ns]
def standardize_dates(df):
"""Convert all date fields to datetime"""
date_cols = ['刷新时间', '投保确认时间', '保险起期']
for col in date_cols:
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors='coerce')
invalid = df[col].isnull().sum()
if invalid > 0:
print(f"⚠️ {col}: {invalid} invalid dates converted to NaT")
return df
3.2 Numeric Standardization
Target Format: float64
def standardize_numerics(df):
"""Convert numeric fields to float"""
numeric_cols = ['签单/批改保费', '签单数量', '手续费', '手续费含税', '增值税']
for col in numeric_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
return df
3.3 String Standardization
Rules:
- Strip leading/trailing whitespace
- Map variations (Y/N → 是/否)
def standardize_strings(df):
"""Clean string fields"""
# Strip whitespace
string_cols = df.select_dtypes(include=['object']).columns
for col in string_cols:
df[col] = df[col].astype(str).str.strip()
# Standardize yes/no fields
yes_no_cols = ['是否续保', '是否新能源', '是否过户车', '是否异地车']
for col in yes_no_cols:
if col in df.columns:
df[col] = df[col].map({
'Y': '是', 'N': '否',
'y': '是', 'n': '否',
'1': '是', '0': '否'
}).fillna(df[col])
return df
Step 4: Handle Outliers
4.1 Outlier Detection Rules
Important: NEVER delete negative premiums (legitimate business data)
| Outlier Type | Detection Rule | Action |
|---|---|---|
| Negative premium | < 0 |
✅ KEEP (refunds/adjustments) |
| Extreme premium | < -1M or > 100K |
⚠️ Flag only |
| Zero commission | == 0 |
✅ KEEP (normal) |
| Negative amount | 保额 < 0 |
⚠️ Flag only |
4.2 Implementation
def detect_outliers(df):
"""Detect outliers but DO NOT delete"""
outliers = {}
# 1. Extreme premium (flag only)
if '签单/批改保费' in df.columns:
extreme = df[
(df['签单/批改保费'] < -1000000) |
(df['签单/批改保费'] > 100000)
]
if len(extreme) > 0:
outliers['extreme_premium'] = {
'count': len(extreme),
'samples': extreme['保单号'].head(5).tolist()
}
# 2. Negative amounts (flag only)
if '签单/批改保额' in df.columns:
negative_amt = df[df['签单/批改保额'] < 0]
if len(negative_amt) > 0:
outliers['negative_amount'] = {
'count': len(negative_amt),
'samples': negative_amt['保单号'].head(5).tolist()
}
# Report outliers without deleting
if outliers:
print("⚠️ Detected outliers (kept in data):")
for key, info in outliers.items():
print(f" - {key}: {info['count']} records")
return df, outliers
Critical Rule: Never filter out negative premiums:
# ❌ WRONG - DO NOT DO THIS
df = df[df['签单/批改保费'] > 0]
# ✅ CORRECT - Keep all values
total_premium = df['签单/批改保费'].sum() # May be negative
Complete Cleaning Pipeline
All-in-One Function
def clean_data(df, staff_mapping):
"""Complete cleaning pipeline"""
print("Starting data cleaning pipeline...")
initial_count = len(df)
# Step 1: Handle missing values
df = handle_missing_values(df, staff_mapping)
print(f"✓ Step 1: Handled missing values")
# Step 2: Remove duplicates
df = remove_duplicates(df)
print(f"✓ Step 2: Removed duplicates")
# Step 3: Standardize formats
df = standardize_dates(df)
df = standardize_numerics(df)
df = standardize_strings(df)
print(f"✓ Step 3: Standardized formats")
# Step 4: Detect outliers (no deletion)
df, outliers = detect_outliers(df)
print(f"✓ Step 4: Detected outliers")
final_count = len(df)
print(f"\nCleaning complete: {initial_count} → {final_count} records")
return df, outliers
Common Use Cases
Case 1: "Clean my CSV file"
import pandas as pd
import json
# Load data
df = pd.read_csv('data.csv', encoding='utf-8-sig')
mapping = json.load(open('staff_mapping.json'))
# Run full pipeline
df_clean, outliers = clean_data(df, mapping)
# Save cleaned data
df_clean.to_csv('data_cleaned.csv', index=False, encoding='utf-8-sig')
Case 2: "Handle missing institution fields"
# Focus on Step 1 - missing value handling
df = handle_missing_values(df, staff_mapping)
# Check results
missing_before = df_original['三级机构'].isnull().sum()
missing_after = df['三级机构'].isnull().sum()
print(f"Fixed {missing_before - missing_after} missing institutions")
Case 3: "Remove duplicate policies"
# Focus on Step 2 - deduplication
df_unique = remove_duplicates(df)
print(f"Removed {len(df) - len(df_unique)} duplicates")
Cleaning Checklist
Before cleaning:
- Backup original data
- Load staff mapping file
- Check file encoding (use
utf-8-sigfor Excel exports)
During cleaning:
- Handle missing values (Step 1)
- Remove duplicates (Step 2)
- Standardize formats (Step 3)
- Detect outliers (Step 4)
After cleaning:
- Verify record count change is reasonable
- Check critical fields are complete
- Review outlier report
- Save cleaned data with new filename
Troubleshooting
"Many records deleted"
Check: Are you accidentally deleting negative premiums?
# Check negative premium count
negative_count = (df['签单/批改保费'] < 0).sum()
print(f"Negative premiums: {negative_count} (should be kept)")
"Date conversion creates many NaT"
Solution: Check date format
# Inspect date column
print(df['投保确认时间'].head())
# Try different format
df['投保确认时间'] = pd.to_datetime(
df['投保确认时间'],
format='%Y/%m/%d', # Adjust format
errors='coerce'
)
"Duplicates not removed"
Check: Ensure date column is datetime type
print(df['投保确认时间'].dtype) # Should be datetime64[ns]
Related Files
Data processor: backend/data_processor.py
- See
_clean_data()method (lines 132-156) - See
merge_with_existing()method (lines 158-192)
Field definitions: docs/FIELD_MAPPING.md
Related Skills:
field-validation- Run this BEFORE cleaningstaff-mapping-management- Update mapping table
Skill Version: v1.0 Created: 2025-11-09 File Size: ~310 lines Focuses On: Data cleaning only (not validation or mapping)