article-extraction

ljchg12-hue's avatarfrom ljchg12-hue

Extract clean article content from web pages, removing ads and clutter for reading and archiving

0stars🔀0forks📁View on GitHub🕐Updated Jan 6, 2026

When & Why to Use This Skill

This Claude skill enables high-quality article extraction by stripping away advertisements, navigation menus, and web clutter to deliver clean, readable content. It streamlines the process of converting complex web pages into structured text or Markdown, making it an essential tool for researchers and knowledge managers who need to archive, analyze, or repurpose web-based information without distractions.

Use Cases

  • Research Collection: Automatically gathering clean text from multiple news sources and blogs to build a comprehensive knowledge base for academic or market analysis.
  • Content Archiving: Saving distraction-free versions of web articles into personal knowledge management systems like Notion, Obsidian, or Evernote.
  • AI Data Preprocessing: Providing clean, noise-free text inputs to LLMs to ensure higher accuracy in automated summarization, translation, or sentiment analysis.
  • Reading List Management: Converting cluttered web URLs into a uniform, readable format optimized for e-readers or focused offline reading.
namearticle-extraction
descriptionExtract clean article content from web pages, removing ads and clutter for reading and archiving

Article Extraction Skill

Extract clean article text from web pages, removing ads, navigation, and clutter.

When to Use

  • Content archiving
  • Research collection
  • Reading list management
  • Content analysis

Core Capabilities

  • Main content extraction
  • Metadata extraction (title, author, date)
  • Image extraction
  • Clean HTML/Markdown output
  • Multi-page article handling
  • Paywall bypass (where legal)

Tools

# Readability (Node.js)
npm install @mozilla/readability

# newspaper3k (Python)
pip install newspaper3k
python -c "from newspaper import Article; a = Article('URL'); a.download(); a.parse(); print(a.text)"

# Trafilatura (Python)
pip install trafilatura
trafilatura -u "URL"

Best Practices

  • Respect robots.txt
  • Cache extracted content
  • Preserve attribution
  • Handle different CMS formats

Resources