web-scraping

tyroneross's avatarfrom tyroneross

Extracts content from blog posts and news articles. Use when user asks to scrape a URL, extract article content, get text from a webpage, discover articles from a blog, parse RSS feeds, or needs LLM-ready content with token counts. Supports single articles, batch processing, and site-wide discovery.

0stars🔀0forks📁View on GitHub🕐Updated Dec 31, 2025

When & Why to Use This Skill

This Claude skill provides a robust solution for automated web scraping and content extraction specifically optimized for LLMs. It transforms complex blog posts and news articles into clean, structured markdown, offering essential features like token counting, batch processing, and RAG-ready text chunking to streamline data gathering and knowledge management.

Use Cases

  • AI Knowledge Base Construction: Automatically scrape and convert industry blogs into token-optimized markdown for RAG (Retrieval-Augmented Generation) systems.
  • Automated News Monitoring: Use site-wide discovery and RSS parsing to track and extract full-text content from multiple news sources simultaneously.
  • Market Research & Competitive Analysis: Batch process lists of competitor URLs to extract titles, authors, and publication dates for structured reporting.
  • Content Repurposing: Quickly extract clean text and metadata from old blog posts to facilitate rewriting, summarization, or social media distribution.
nameweb-scraping
descriptionExtracts content from blog posts and news articles. Use when user asks to scrape a URL, extract article content, get text from a webpage, discover articles from a blog, parse RSS feeds, or needs LLM-ready content with token counts. Supports single articles, batch processing, and site-wide discovery.
allowed-toolsBash(npx tsx:*), Bash(node:*), Read, Write, Glob

Web Scraping Skill

Extract blog and news content from any website using the @tyroneross/blog-scraper SDK.

Quick Reference

Single Article (Most Common)

import { extractArticle } from '@tyroneross/blog-scraper';

const article = await extractArticle('https://example.com/blog/post');
// Returns: { title, markdown, text, html, wordCount, readingTime, excerpt, author }

LLM-Ready Output (For AI/RAG)

import { scrapeForLLM } from '@tyroneross/blog-scraper/llm';

const { markdown, tokens, chunks, frontmatter } = await scrapeForLLM(url);
// tokens: estimated count for context window management
// chunks: pre-split for RAG applications

Discover Articles from Site

import { scrapeWebsite } from '@tyroneross/blog-scraper';

const result = await scrapeWebsite('https://techcrunch.com', {
  maxArticles: 10,
  extractFullContent: true
});

Smart Mode (Auto-Detect)

import { smartScrape } from '@tyroneross/blog-scraper';

const result = await smartScrape(url);
if (result.mode === 'article') {
  console.log(result.article.title);
} else {
  console.log(result.articles.length, 'articles found');
}

Batch Processing

import { scrapeUrls } from '@tyroneross/blog-scraper/batch';

const result = await scrapeUrls(urls, { concurrency: 3 });

Validate Before Scraping

import { validateUrl } from '@tyroneross/blog-scraper/validation';

const { isReachable, robotsAllowed, suggestedAction } = await validateUrl(url);

Output Properties

Property Description
title Article title
markdown Formatted Markdown content
text Plain text (no formatting)
html Raw HTML content
excerpt Short summary
author Author name if detected
publishedDate Publication date
wordCount Total words
readingTime Estimated minutes to read

Running Code

Create a script file and run with:

npx tsx script.ts

When to Use Each Function

User Request Function
"Extract this article" extractArticle(url)
"Get content for LLM" scrapeForLLM(url)
"Find articles on this site" scrapeWebsite(url)
"Not sure if article or blog" smartScrape(url)
"Process these 5 URLs" scrapeUrls(urls)
"Can I scrape this?" validateUrl(url)
web-scraping – AI Agent Skills | Claude Skills