🧹Data Cleaning Skills
Browse skills in the Data Cleaning category.
programmatic-eda
Systematic exploratory data analysis following best practices. Use when analyzing any dataset to understand structure, identify data quality issues (duplicates, missing values, inconsistencies, outliers), examine distributions, detect correlations, and generate visualizations. Provides comprehensive data profiling with sanity checks before analysis.
pandas-data-analyzer
Pandas kütüphanesi kullanarak veri analizi, veri temizleme, keşifsel veri analizi (EDA) ve görselleştirme işlemlerini gerçekleştirir. Kullanıcı .csv, .xlsx, .json dosyaları paylaştığında veya "veri analizi yap", "pandas kullan", "dataframe oluştur" gibi komutlar verdiğinde bu yeteneği kullanın.
data-quality-audit
Comprehensive data quality assessment against defined business rules and constraints. Use when validating data against expected schemas, checking referential integrity across tables, or auditing data pipeline outputs before production use.
nixtla-schema-mapper
Transform data sources to Nixtla schema (unique_id, ds, y) with column inference. Use when preparing data for forecasting. Trigger with 'map to Nixtla schema' or 'transform data'.
csv-validator
Validates and fixes BOM CSV files for ECIR tool compatibility. Use when users need to check CSV files before running ECIR comparisons, fix CSV formatting issues, ensure required columns exist, or diagnose why ECIR tool fails to process a CSV file.
nixtla-contract-schema-mapper
Transform prediction market data to Nixtla format (unique_id, ds, y). Use when preparing datasets for forecasting. Trigger with 'convert to Nixtla format' or 'schema mapping'.
dataset-curator
Curate and clean training datasets for high-quality machine learning
dotted-symbol-exclusion
Exclude ALL dotted symbols from yfinance sector lookups and training. Trigger when: (1) yfinance errors on warrants/units/class shares, (2) training notebook fails on excluded symbol types, (3) adding new symbol exclusion patterns.
pandas-data-processing
Pandas for time series analysis, OrcaFlex results processing, and marine engineering data workflows
data-engineering
ML data engineering covering data pipelines, data quality, collection strategies, storage, and versioning for machine learning systems.
training-data
Training data management including labeling strategies, data augmentation, handling imbalanced data, and data splitting best practices.
feature-engineering
Feature engineering techniques including feature extraction, transformation, selection, and feature store management for ML systems.
dataset-engineering
Building and processing datasets - data quality, curation, deduplication, synthesis, annotation, formatting. Use when creating training data, improving data quality, or generating synthetic data.
result-check
내가 만든 파일이 제대로 됐는지 확인해주는 도구. 빈 칸, 중복, 오류를 자동으로 찾아줘요!
field-validation
Validate data quality in CSV/Excel files for vehicle insurance platform. Use when checking required fields, validating data formats, detecting quality issues, or generating quality reports. Mentions "validate", "check fields", "data quality", "missing values", or "quality score".
numpy-sorting
Sorting and searching algorithms including O(n) partitioning, binary search, and hierarchical multi-key sorting. Triggers: sort, argsort, partition, searchsorted, lexsort, nan sort order.
numpy-set-ops
Set-theoretic operations for finding unique elements, membership testing, and array intersections. Triggers: unique, isin, intersect1d, setdiff1d, union1d.
staff-mapping-management
Manage staff-institution mapping table for vehicle insurance platform. Use when updating mapping files, resolving name conflicts, converting Excel to JSON, or checking mapping coverage. Mentions "update mapping", "staff conflicts", "mapping table", or "institution assignment".
numpy-masked
Masked arrays for robust handling of missing or invalid data, ensuring they are excluded from statistical and mathematical computations. Triggers: masked array, numpy.ma, missing data, invalid values, hard mask.
processing-data
Brief description of what this skill does. Use when the user mentions specific keywords or requests specific tasks related to this skill's functionality.
preprocessing-data-with-automated-pipelines
Automate data cleaning, transformation, and validation for ML tasks. Use when requesting "preprocess data", "clean data", "ETL pipeline", or "data transformation".
engineering-features-for-machine-learning
Create, select, and transform features to improve machine learning model performance. Handles feature scaling, encoding, and importance analysis. Use when asked to "engineer features" or "select features".
splitting-datasets
Split datasets into training, validation, and testing sets for ML model development. Use when requesting "split dataset", "train-test split", or "data partitioning".
skill-name
{{DESCRIPTION}}
ml-fundamentals
Master machine learning foundations - algorithms, preprocessing, feature engineering, and evaluation
data-analyst
This skill should be used when analyzing CSV datasets, handling missing values through intelligent imputation, and creating interactive dashboards to visualize data trends. Use this skill for tasks involving data quality assessment, automated missing value detection and filling, statistical analysis, and generating Plotly Dash dashboards for exploratory data analysis.
d3-core-data
Use when working with data transformations, scales, color schemes, formatting, or CSV/TSV parsing. Invoke for data processing pipelines, scale creation, color interpolation, number/date formatting, or data loading/parsing operations.
policyengine-uk-data
UK survey data enhancement - FRS with WAS imputation patterns
date-normalizer
Use when asked to parse, normalize, standardize, or convert dates from various formats to consistent ISO 8601 or custom formats.
ai-annotation-workflow
Эксперт по data annotation. Используй для ML labeling, annotation workflows и quality control.
data-pipeline-patterns
Follow these patterns when implementing data pipelines, ETL, data ingestion, or data validation in OptAIC. Use for point-in-time (PIT) correctness, Arrow schemas, quality checks, and Prefect orchestration.
data-validator
验证车险CSV数据的完整性和正确性,检查26个必需字段,验证数据类型、枚举值和业务规则。当用户提到"验证数据"、"检查数据"、"数据导入"、"CSV"时使用。
phone-number-formatter
Standardize and format phone numbers with international support, validation, and multiple output formats.
data-cleaning
Data cleaning, preprocessing, and quality assurance techniques
address-parser
Parse unstructured addresses into structured components - street, city, state, zip, country with validation.
feature-engineering-kit
Auto-generate features with encodings, scaling, polynomial features, and interaction terms for ML pipelines.
outlier-detective
Detect anomalies and outliers in datasets using statistical and ML methods. Use for data cleaning, fraud detection, or quality control analysis.
geocoder
Convert addresses to coordinates (geocoding) and coordinates to addresses (reverse geocoding). Use for location data enrichment or address validation.
data-quality-auditor
Assess data quality with checks for missing values, duplicates, type issues, and inconsistencies. Use for data validation, ETL pipelines, or dataset documentation.
loading-insurance-data
加载并预处理保险保单周度数据,支持智能周期检测、多周数据加载、数据验证和清洗。在开始任何保险数据分析任务时使用。
numpy-string-ops
Vectorized string manipulation using the char module and modern string alternatives, including cleaning and search operations. Triggers: string operations, numpy.char, text cleaning, substring search.
data-cleaning-standards
Clean and standardize vehicle insurance CSV/Excel data. Use when handling missing values, fixing data formats, removing duplicates, or standardizing fields. Mentions "clean data", "handle nulls", "standardize", "duplicates", or "normalize".
backend-data-processor
Process vehicle insurance Excel data using Pandas - file handling, data cleaning, merging, validation. Use when processing Excel/CSV files, handling data imports, implementing business rules (negative premiums, zero commissions), debugging data pipelines, or optimizing Pandas performance. Keywords: data_processor.py, Excel, CSV, Pandas, merge, deduplication, date normalization.
polars
Lightning-fast DataFrame library written in Rust for high-performance data manipulation and analysis. Use when user wants blazing fast data transformations, working with large datasets, lazy evaluation pipelines, or needs better performance than pandas. Ideal for ETL, data wrangling, aggregations, joins, and reading/writing CSV, Parquet, JSON files.
data-profiler
Profile datasets to understand schema, quality, and characteristics. Use when analyzing data files (CSV, JSON, Parquet), discovering dataset properties, assessing data quality, or when user mentions data profiling, schema detection, data analysis, or quality metrics. Provides basic and intermediate profiling including distributions, uniqueness, and pattern detection.
data-things
Helps with data stuff
data-stuff
For data
data-quality-checker
Implement data quality checks, validation rules, and monitoring. Use when ensuring data quality, validating data pipelines, or implementing data governance.
polars
Fast DataFrame library (Apache Arrow). Select, filter, group_by, joins, lazy evaluation, CSV/Parquet I/O, expression API, for high-performance data analysis workflows.
csv-excel-merger
Merge multiple CSV/Excel files with intelligent column matching, data deduplication, and conflict resolution. Handles different schemas, formats, and combines data sources. Use when users need to merge spreadsheets, combine data exports, or consolidate multiple files into one.