dataset-engineering
Building and processing datasets - data quality, curation, deduplication, synthesis, annotation, formatting. Use when creating training data, improving data quality, or generating synthetic data.
When & Why to Use This Skill
The Dataset Engineering skill is a comprehensive toolkit designed for building, refining, and optimizing high-quality datasets for AI and machine learning applications. It automates critical data pipeline stages including deduplication using MinHash LSH, data curation, and advanced data synthesis techniques like Self-Instruct and AI-powered QA generation. By focusing on core data quality dimensions—accuracy, completeness, and consistency—this skill enables developers to transform raw information into structured, training-ready formats, significantly enhancing the performance and reliability of AI models.
Use Cases
- LLM Fine-tuning: Curating and formatting high-quality instruction-response pairs to improve model performance and instruction-following capabilities.
- Synthetic Data Generation: Using Self-Instruct and augmentation techniques to expand small seed datasets into robust, diverse training sets for niche domains.
- Data Quality Assurance: Implementing automated pipelines to remove duplicate entries and validate data against specific schemas to ensure dataset integrity.
- RAG System Optimization: Generating synthetic Question-Answer pairs from technical documentation to create benchmarks for evaluating Retrieval-Augmented Generation systems.
- Data Formatting & Standardization: Converting raw conversational or unstructured data into standardized chat templates (e.g., OpenAI or ShareGPT formats) for seamless model training.
| name | dataset-engineering |
|---|---|
| description | Building and processing datasets - data quality, curation, deduplication, synthesis, annotation, formatting. Use when creating training data, improving data quality, or generating synthetic data. |
Dataset Engineering Skill
Building high-quality datasets for AI applications.
Data Quality Dimensions
| Dimension | Description | Check |
|---|---|---|
| Accuracy | Data is correct | Validation |
| Completeness | No missing values | Schema check |
| Consistency | No contradictions | Dedup |
| Timeliness | Up-to-date | Timestamps |
| Relevance | Matches use case | Filtering |
Data Curation Pipeline
class DataCurationPipeline:
def run(self, raw_data):
# 1. Inspect
self.inspect(raw_data)
# 2. Deduplicate
data = self.deduplicator.dedupe(raw_data)
# 3. Clean and filter
data = self.cleaner.clean(data)
data = self.filter.filter(data)
# 4. Format
return self.formatter.format(data)
Deduplication
from datasketch import MinHash, MinHashLSH
class Deduplicator:
def __init__(self, threshold=0.8):
self.lsh = MinHashLSH(threshold=threshold, num_perm=128)
def minhash(self, text):
m = MinHash(num_perm=128)
for word in text.split():
m.update(word.encode('utf8'))
return m
def dedupe(self, docs):
unique = []
for i, doc in enumerate(docs):
mh = self.minhash(doc["text"])
if not self.lsh.query(mh):
self.lsh.insert(f"doc_{i}", mh)
unique.append(doc)
return unique
Data Synthesis
AI-Powered QA Generation
def generate_qa(document, model, n=5):
prompt = f"""Generate {n} QA pairs from:
{document}
Format: [{{"question": "...", "answer": "..."}}]"""
return json.loads(model.generate(prompt))
Self-Instruct
def self_instruct(seeds, model, n=100):
generated = []
for _ in range(n):
samples = random.sample(seeds + generated[-20:], 5)
prompt = f"Examples:\n{format(samples)}\n\nNew task:"
new = model.generate(prompt)
if is_valid(new) and is_diverse(new, generated):
generated.append(new)
return generated
Data Augmentation
def augment_text(text):
methods = [
lambda t: synonym_replace(t),
lambda t: back_translate(t),
lambda t: model.rephrase(t)
]
return random.choice(methods)(text)
Data Formatting
Instruction Format
def format_instruction(example):
return f"""### Instruction:
{example['instruction']}
### Input:
{example.get('input', '')}
### Response:
{example['output']}"""
Chat Format
def format_chat(conversation):
return [
{"role": turn["role"], "content": turn["content"]}
for turn in conversation
]
Best Practices
- Inspect data before processing
- Deduplicate before expensive operations
- Use multiple synthesis methods
- Validate synthetic data quality
- Track data lineage