ai-training-data-generation

majiayu000's avatarfrom majiayu000

Generate high-quality training datasets from documents, text corpora, and structured content. Use when creating AI training data from dictionaries, documents, or when generating examples for machine learning models. Optimized for low-resource languages and domain-specific knowledge extraction.

0stars🔀0forks📁View on GitHub🕐Updated Jan 5, 2026

When & Why to Use This Skill

This Claude skill automates the generation of high-quality AI training datasets from documents, text corpora, and structured content. It streamlines the creation of fine-tuning data for LLMs, supporting multiple industry-standard formats like JSONL, HuggingFace, and OpenAI while ensuring data diversity and quality through advanced filtering and validation strategies.

Use Cases

  • Fine-tuning LLMs for Niche Domains: Automatically extract terminology, definitions, and instruction-response pairs from technical manuals to improve model performance in specialized fields.
  • Low-Resource Language Dataset Creation: Generate high-quality training data and dictionary pairs from linguistic documents to support AI development for underrepresented languages.
  • Synthetic Data Generation: Transform unstructured text into structured completion or classification examples to expand training sets for machine learning models.
  • Knowledge Base Structuring: Convert large-scale text corpora into validated JSONL or Ollama-compatible formats for RAG (Retrieval-Augmented Generation) and model alignment.
nameai-training-data-generation
descriptionGenerate high-quality training datasets from documents, text corpora, and structured content. Use when creating AI training data from dictionaries, documents, or when generating examples for machine learning models. Optimized for low-resource languages and domain-specific knowledge extraction.

AI Training Data Generation

Overview

A comprehensive skill for automatically generating high-quality training datasets from documents, text corpora, and structured content. Optimized for low-resource languages, dictionary content, and domain-specific knowledge extraction.

Capabilities

  • Multi-strategy Generation: Dictionary pairs, contextual definitions, completion tasks, classification examples
  • Quality Filtering: Confidence scoring, duplicate removal, and content validation
  • Format Flexibility: Support for multiple AI training formats (JSONL, HuggingFace, Ollama, OpenAI)
  • Language Awareness: Multi-language support with special handling for accented characters
  • Scalable Processing: Generate thousands of examples from large documents
  • Balance Management: Ensure dataset diversity and prevent category imbalance

Core Strategies

1. Dictionary Pair Extraction

Extract word-definition pairs from structured and semi-structured text.

Detection Patterns:

  • Separator-based: word – definition, term: meaning
  • Linguistic indicators: means, is defined as, refers to
  • Structural cues: Indentation, formatting, list structures
  • Context analysis: Surrounding text for validation

2. Implementation Pattern

from .ai_training_generator import AITrainingDataGenerator

# Initialize generator
generator = AITrainingDataGenerator(min_confidence=0.7)

# Generate comprehensive training data
training_data = generator.generate_comprehensive_training_data(
    parsed_document,
    target_count=10000
)

# Export in multiple formats
files = generator.export_training_data(
    training_data,
    output_dir="training_output",
    format_type="ollama"
)

Output Format Examples

JSONL Format (Standard)

{"input": "What does 'ááfengen' mean?", "output": "very good, excellent", "type": "dictionary_pair", "confidence": 0.95}

Ollama Format

{"prompt": "Translate this Chuukese word: ngang", "response": "fish", "system": "You are a Chuukese-English translator."}

HuggingFace Format

{"text": "### Instruction:\nWhat does 'chomong' mean in Chuukese?\n\n### Response:\nto help, assist"}

OpenAI Fine-tuning Format

{"messages": [{"role": "user", "content": "Define: kúún"}, {"role": "assistant", "content": "to go, to leave"}]}

Quality Assurance

  • Content validity: Does the example make linguistic sense?
  • Pattern matching: Does it follow expected language patterns?
  • Context appropriateness: Is the context relevant and helpful?
  • Uniqueness: Avoid repetitive or duplicate content

Best Practices

  1. Multiple validation passes: Automated and manual quality checks
  2. Confidence thresholds: Adjust based on use case requirements
  3. Human review sampling: Periodic manual validation of generated examples
  4. Balance management: Ensure even distribution across categories

Dependencies

  • re: Regular expression pattern matching
  • json: Data serialization and export
  • hashlib: Duplicate detection and content hashing
  • collections: Data structure utilities and counting
ai-training-data-generation – AI Agent Skills | Claude Skills