data-cleaning

majiayu000's avatarfrom majiayu000

Data cleaning, preprocessing, and quality assurance techniques

5stars🔀1forks📁View on GitHub🕐Updated Jan 11, 2026

When & Why to Use This Skill

This Claude skill provides comprehensive data cleaning, preprocessing, and quality assurance techniques to transform raw, messy data into reliable, analysis-ready datasets. It automates critical tasks such as missing value imputation, outlier detection, and data type validation across multiple platforms including Python, SQL, and Excel, ensuring high-quality inputs for downstream analytics and machine learning workflows.

Use Cases

  • Automating the identification and removal of duplicate records in large customer databases to ensure a single source of truth.
  • Handling missing data in survey results using advanced imputation techniques or strategic deletion to maintain statistical integrity.
  • Standardizing inconsistent string formats (e.g., phone numbers, addresses, or categories) across disparate data sources for unified reporting.
  • Detecting and treating statistical outliers in financial transaction data to prevent skewed analysis and improve model accuracy.
  • Validating data types and schemas during ETL processes to prevent downstream system failures and ensure data governance.
namedata-cleaning
descriptionData cleaning, preprocessing, and quality assurance techniques
version"2.0.0"
sasmp_version"2.0.0"
bonded_agent05-programming-expert
bond_typeSECONDARY_BOND
atomictrue
retry_enabledtrue
max_retries3
backoff_strategyexponential
typestring
requiredfalse
enum[small, medium, large]
defaultmedium
logging_levelinfo
metrics[rows_cleaned, missing_handled, duplicates_removed]

Data Cleaning Skill

Overview

Master data cleaning and preprocessing techniques essential for reliable analytics.

Topics Covered

  • Missing value handling (imputation, deletion)
  • Outlier detection and treatment
  • Data type conversion and validation
  • Duplicate identification and removal
  • String cleaning and normalization

Learning Outcomes

  • Clean messy datasets
  • Handle missing data appropriately
  • Detect and treat outliers
  • Ensure data quality

Error Handling

Error Type Cause Recovery
Memory error Dataset too large Use chunking or sampling
Type conversion failed Invalid data format Apply preprocessing first
Encoding issues Wrong character encoding Detect and specify encoding
Validation failure Data doesn't meet schema Review and adjust validation rules

Related Skills

  • programming (for automation)
  • foundations (for data quality concepts)
  • databases-sql (for SQL-based cleaning)