data-cleaning
Data cleaning, preprocessing, and quality assurance techniques
When & Why to Use This Skill
This Claude skill provides comprehensive data cleaning, preprocessing, and quality assurance techniques to transform raw, messy data into reliable, analysis-ready datasets. It automates critical tasks such as missing value imputation, outlier detection, and data type validation across multiple platforms including Python, SQL, and Excel, ensuring high-quality inputs for downstream analytics and machine learning workflows.
Use Cases
- Automating the identification and removal of duplicate records in large customer databases to ensure a single source of truth.
- Handling missing data in survey results using advanced imputation techniques or strategic deletion to maintain statistical integrity.
- Standardizing inconsistent string formats (e.g., phone numbers, addresses, or categories) across disparate data sources for unified reporting.
- Detecting and treating statistical outliers in financial transaction data to prevent skewed analysis and improve model accuracy.
- Validating data types and schemas during ETL processes to prevent downstream system failures and ensure data governance.
| name | data-cleaning |
|---|---|
| description | Data cleaning, preprocessing, and quality assurance techniques |
| version | "2.0.0" |
| sasmp_version | "2.0.0" |
| bonded_agent | 05-programming-expert |
| bond_type | SECONDARY_BOND |
| atomic | true |
| retry_enabled | true |
| max_retries | 3 |
| backoff_strategy | exponential |
| type | string |
| required | false |
| enum | [small, medium, large] |
| default | medium |
| logging_level | info |
| metrics | [rows_cleaned, missing_handled, duplicates_removed] |
Data Cleaning Skill
Overview
Master data cleaning and preprocessing techniques essential for reliable analytics.
Topics Covered
- Missing value handling (imputation, deletion)
- Outlier detection and treatment
- Data type conversion and validation
- Duplicate identification and removal
- String cleaning and normalization
Learning Outcomes
- Clean messy datasets
- Handle missing data appropriately
- Detect and treat outliers
- Ensure data quality
Error Handling
| Error Type | Cause | Recovery |
|---|---|---|
| Memory error | Dataset too large | Use chunking or sampling |
| Type conversion failed | Invalid data format | Apply preprocessing first |
| Encoding issues | Wrong character encoding | Detect and specify encoding |
| Validation failure | Data doesn't meet schema | Review and adjust validation rules |
Related Skills
- programming (for automation)
- foundations (for data quality concepts)
- databases-sql (for SQL-based cleaning)