article-extractor

jrajasekera's avatarfrom jrajasekera

Extract clean article content from URLs and save as markdown. Triggers when user provides a webpage URL and wants to download it, extract content, get a clean version without ads, capture an article for offline reading, save an article, grab content from a page, archive a webpage, clip an article, or read something later. Handles blog posts, news articles, tutorials, documentation pages, and similar web content. Supports Wayback Machine for dead links or paywalled content. This skill handles the entire workflow - do NOT use web_fetch or other tools first, just call the extraction script directly with the URL.

0stars🔀0forks📁View on GitHub🕐Updated Jan 7, 2026

When & Why to Use This Skill

The Article Extractor skill is a powerful utility designed to transform cluttered web pages into clean, structured Markdown files. It solves the problem of 'web noise' by automatically stripping away advertisements, navigation bars, and irrelevant scripts, leaving only the core content. With built-in support for the Wayback Machine and multiple extraction engines (Jina, Trafilatura, Readability), it ensures high reliability even when dealing with dead links, paywalls, or complex JavaScript-heavy websites.

Use Cases

  • Personal Knowledge Management: Seamlessly convert online blog posts, tutorials, and news articles into Markdown format for easy import into tools like Obsidian, Notion, or Logseq.
  • AI Content Analysis: Provide clean, high-signal text input for LLMs to perform accurate summarization, sentiment analysis, or data synthesis without the interference of HTML clutter.
  • Digital Archiving and Offline Reading: Capture and save permanent local copies of web content for offline access, ensuring that valuable information remains available even if the original source goes offline.
  • Research Recovery: Utilize the integrated Wayback Machine fallback to retrieve and extract content from broken URLs or pages that have been moved behind a paywall.
  • Automated Documentation Collection: Build local technical libraries by batch-extracting documentation pages and tutorials into a standardized, readable format.
namearticle-extractor
descriptionExtract clean article content from URLs and save as markdown. Triggers when user provides a webpage URL and wants to download it, extract content, get a clean version without ads, capture an article for offline reading, save an article, grab content from a page, archive a webpage, clip an article, or read something later. Handles blog posts, news articles, tutorials, documentation pages, and similar web content. Supports Wayback Machine for dead links or paywalled content. This skill handles the entire workflow - do NOT use web_fetch or other tools first, just call the extraction script directly with the URL.

Article Extractor

Extract clean article content from URLs, removing ads, navigation, and clutter. Multi-tool fallback ensures reliability.

Workflow

When user provides a URL to download/extract:

  1. Call the extraction script directly with the URL (do NOT fetch the URL first with web_fetch)
  2. Script handles fetching, extraction, and saving automatically
  3. Returns clean markdown file with frontmatter

Usage

# Basic extraction
scripts/extract-article.sh "https://example.com/article"

# Specify output location
scripts/extract-article.sh "https://example.com/article" -o my-article.md -d ~/Documents

# Try Wayback Machine if original fails
scripts/extract-article.sh "https://example.com/article" --wayback

Make script executable if needed: chmod +x scripts/extract-article.sh

Key Options

  • -o <file> - Output filename
  • -d <dir> - Output directory
  • -w, --wayback - Try Wayback Machine if extraction fails
  • -t <tool> - Force tool: jina, trafilatura, readability, fallback
  • -q - Quiet mode

For complete options, exit codes, tool details, and examples, see references/tools-and-options.md.

Common Failures

  • Exit 3 (access denied): Paywall or login required - try --wayback
  • Exit 4 (no content): Heavy JavaScript - try different --tool
  • Exit 2 (network): Connection issue - check URL

Local Tools (Optional)

For offline extraction: scripts/install-deps.sh