dataset-engineering

doanchienthangdev's avatarfrom doanchienthangdev

Building and processing datasets - data quality, curation, deduplication, synthesis, annotation, formatting. Use when creating training data, improving data quality, or generating synthetic data.

0stars🔀0forks📁View on GitHub🕐Updated Jan 8, 2026

When & Why to Use This Skill

The Dataset Engineering skill is a comprehensive toolkit designed for building, refining, and optimizing high-quality datasets for AI and machine learning applications. It automates critical data pipeline stages including deduplication using MinHash LSH, data curation, and advanced data synthesis techniques like Self-Instruct and AI-powered QA generation. By focusing on core data quality dimensions—accuracy, completeness, and consistency—this skill enables developers to transform raw information into structured, training-ready formats, significantly enhancing the performance and reliability of AI models.

Use Cases

  • LLM Fine-tuning: Curating and formatting high-quality instruction-response pairs to improve model performance and instruction-following capabilities.
  • Synthetic Data Generation: Using Self-Instruct and augmentation techniques to expand small seed datasets into robust, diverse training sets for niche domains.
  • Data Quality Assurance: Implementing automated pipelines to remove duplicate entries and validate data against specific schemas to ensure dataset integrity.
  • RAG System Optimization: Generating synthetic Question-Answer pairs from technical documentation to create benchmarks for evaluating Retrieval-Augmented Generation systems.
  • Data Formatting & Standardization: Converting raw conversational or unstructured data into standardized chat templates (e.g., OpenAI or ShareGPT formats) for seamless model training.
namedataset-engineering
descriptionBuilding and processing datasets - data quality, curation, deduplication, synthesis, annotation, formatting. Use when creating training data, improving data quality, or generating synthetic data.

Dataset Engineering Skill

Building high-quality datasets for AI applications.

Data Quality Dimensions

Dimension Description Check
Accuracy Data is correct Validation
Completeness No missing values Schema check
Consistency No contradictions Dedup
Timeliness Up-to-date Timestamps
Relevance Matches use case Filtering

Data Curation Pipeline

class DataCurationPipeline:
    def run(self, raw_data):
        # 1. Inspect
        self.inspect(raw_data)

        # 2. Deduplicate
        data = self.deduplicator.dedupe(raw_data)

        # 3. Clean and filter
        data = self.cleaner.clean(data)
        data = self.filter.filter(data)

        # 4. Format
        return self.formatter.format(data)

Deduplication

from datasketch import MinHash, MinHashLSH

class Deduplicator:
    def __init__(self, threshold=0.8):
        self.lsh = MinHashLSH(threshold=threshold, num_perm=128)

    def minhash(self, text):
        m = MinHash(num_perm=128)
        for word in text.split():
            m.update(word.encode('utf8'))
        return m

    def dedupe(self, docs):
        unique = []
        for i, doc in enumerate(docs):
            mh = self.minhash(doc["text"])
            if not self.lsh.query(mh):
                self.lsh.insert(f"doc_{i}", mh)
                unique.append(doc)
        return unique

Data Synthesis

AI-Powered QA Generation

def generate_qa(document, model, n=5):
    prompt = f"""Generate {n} QA pairs from:

{document}

Format: [{{"question": "...", "answer": "..."}}]"""

    return json.loads(model.generate(prompt))

Self-Instruct

def self_instruct(seeds, model, n=100):
    generated = []

    for _ in range(n):
        samples = random.sample(seeds + generated[-20:], 5)
        prompt = f"Examples:\n{format(samples)}\n\nNew task:"

        new = model.generate(prompt)
        if is_valid(new) and is_diverse(new, generated):
            generated.append(new)

    return generated

Data Augmentation

def augment_text(text):
    methods = [
        lambda t: synonym_replace(t),
        lambda t: back_translate(t),
        lambda t: model.rephrase(t)
    ]
    return random.choice(methods)(text)

Data Formatting

Instruction Format

def format_instruction(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example.get('input', '')}

### Response:
{example['output']}"""

Chat Format

def format_chat(conversation):
    return [
        {"role": turn["role"], "content": turn["content"]}
        for turn in conversation
    ]

Best Practices

  1. Inspect data before processing
  2. Deduplicate before expensive operations
  3. Use multiple synthesis methods
  4. Validate synthetic data quality
  5. Track data lineage