data-quality-checker

armanzeroeight's avatarfrom armanzeroeight

Implement data quality checks, validation rules, and monitoring. Use when ensuring data quality, validating data pipelines, or implementing data governance.

19stars🔀4forks📁View on GitHub🕐Updated Dec 2, 2025

When & Why to Use This Skill

The Data Quality Checker skill automates the process of maintaining data integrity by implementing robust validation rules, schema checks, and continuous monitoring. It solves the critical problem of 'garbage in, garbage out' in data pipelines, ensuring that downstream analytics, reporting, and machine learning models rely on accurate, consistent, and timely information through industry-standard tools like Great Expectations.

Use Cases

  • ETL Pipeline Validation: Automatically verify data schemas and value ranges during the ingestion process to prevent corrupt or malformed data from entering your data warehouse.
  • Production Data Monitoring: Set up continuous quality checks to detect stale data, unexpected null values, or duplicate records in live databases, triggering alerts before they impact business operations.
  • Data Governance Compliance: Implement and document standardized validation rules across the organization to ensure all datasets meet specific regulatory and quality benchmarks.
  • Automated Quality Audits: Generate comprehensive data quality metrics (completeness, uniqueness, validity) to track data health trends over time and identify areas for improvement.
namedata-quality-checker
descriptionImplement data quality checks, validation rules, and monitoring. Use when ensuring data quality, validating data pipelines, or implementing data governance.

Data Quality Checker

Implement comprehensive data quality checks and validation.

Quick Start

Use Great Expectations for validation, implement schema checks, monitor data quality metrics, set up alerts.

Instructions

Great Expectations Setup

import great_expectations as gx

context = gx.get_context()

# Create expectation suite
suite = context.add_expectation_suite("data_quality_suite")

# Add expectations
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="data_quality_suite"
)

# Schema validation
validator.expect_table_columns_to_match_ordered_list(
    column_list=["id", "name", "email", "created_at"]
)

# Null checks
validator.expect_column_values_to_not_be_null("email")

# Value ranges
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)

# Uniqueness
validator.expect_column_values_to_be_unique("email")

# Run validation
results = validator.validate()

Custom Validation Rules

def validate_data_quality(df):
    issues = []
    
    # Check for nulls
    null_counts = df.isnull().sum()
    if null_counts.any():
        issues.append(f"Null values found: {null_counts[null_counts > 0]}")
    
    # Check for duplicates
    duplicates = df.duplicated().sum()
    if duplicates > 0:
        issues.append(f"Found {duplicates} duplicate rows")
    
    # Check data freshness
    max_date = df['created_at'].max()
    if (datetime.now() - max_date).days > 1:
        issues.append("Data is stale")
    
    return issues

Data Quality Metrics

def calculate_quality_metrics(df):
    return {
        'completeness': 1 - (df.isnull().sum().sum() / df.size),
        'uniqueness': df.drop_duplicates().shape[0] / df.shape[0],
        'validity': (df['email'].str.contains('@').sum() / len(df)),
        'timeliness': (datetime.now() - df['created_at'].max()).days
    }

Best Practices

  • Validate at ingestion
  • Monitor quality metrics
  • Set up alerts for failures
  • Document quality rules
  • Regular quality audits
  • Track quality trends