When should I use web-scraping?

web-scraping is useful in the following scenarios: • AI Knowledge Base Construction: Automatically scrape and convert industry blogs into token-optimized markdown for RAG (Retrieval-Augmented Generation) systems. • Automated News Monitoring: Use site-wide discovery and RSS parsing to track and extract full-text content from multiple news sources simultaneously. • Market Research & Competitive Analysis: Batch process lists of competitor URLs to extract titles, authors, and publication dates for structured reporting. • Content Repurposing: Quickly extract clean text and metadata from old blog posts to facilitate rewriting, summarization, or social media distribution.

web-scraping

from tyroneross

Extracts content from blog posts and news articles. Use when user asks to scrape a URL, extract article content, get text from a webpage, discover articles from a blog, parse RSS feeds, or needs LLM-ready content with token counts. Supports single articles, batch processing, and site-wide discovery.

⭐0stars🔀0forks📁View on GitHub🕐Updated Dec 31, 2025

When & Why to Use This Skill

This Claude skill provides a robust solution for automated web scraping and content extraction specifically optimized for LLMs. It transforms complex blog posts and news articles into clean, structured markdown, offering essential features like token counting, batch processing, and RAG-ready text chunking to streamline data gathering and knowledge management.

Use Cases

AI Knowledge Base Construction: Automatically scrape and convert industry blogs into token-optimized markdown for RAG (Retrieval-Augmented Generation) systems.
Automated News Monitoring: Use site-wide discovery and RSS parsing to track and extract full-text content from multiple news sources simultaneously.
Market Research & Competitive Analysis: Batch process lists of competitor URLs to extract titles, authors, and publication dates for structured reporting.
Content Repurposing: Quickly extract clean text and metadata from old blog posts to facilitate rewriting, summarization, or social media distribution.

name	web-scraping
description	Extracts content from blog posts and news articles. Use when user asks to scrape a URL, extract article content, get text from a webpage, discover articles from a blog, parse RSS feeds, or needs LLM-ready content with token counts. Supports single articles, batch processing, and site-wide discovery.
allowed-tools	Bash(npx tsx:), Bash(node:), Read, Write, Glob

name

web-scraping

description

allowed-tools

Bash(npx tsx:*), Bash(node:*), Read, Write, Glob

Web Scraping Skill

Extract blog and news content from any website using the @tyroneross/blog-scraper SDK.

Quick Reference

Single Article (Most Common)

import { extractArticle } from '@tyroneross/blog-scraper';

const article = await extractArticle('https://example.com/blog/post');
// Returns: { title, markdown, text, html, wordCount, readingTime, excerpt, author }

LLM-Ready Output (For AI/RAG)

import { scrapeForLLM } from '@tyroneross/blog-scraper/llm';

const { markdown, tokens, chunks, frontmatter } = await scrapeForLLM(url);
// tokens: estimated count for context window management
// chunks: pre-split for RAG applications

Discover Articles from Site

import { scrapeWebsite } from '@tyroneross/blog-scraper';

const result = await scrapeWebsite('https://techcrunch.com', {
  maxArticles: 10,
  extractFullContent: true
});

Smart Mode (Auto-Detect)

import { smartScrape } from '@tyroneross/blog-scraper';

const result = await smartScrape(url);
if (result.mode === 'article') {
  console.log(result.article.title);
} else {
  console.log(result.articles.length, 'articles found');
}

Batch Processing

import { scrapeUrls } from '@tyroneross/blog-scraper/batch';

const result = await scrapeUrls(urls, { concurrency: 3 });

Validate Before Scraping

import { validateUrl } from '@tyroneross/blog-scraper/validation';

const { isReachable, robotsAllowed, suggestedAction } = await validateUrl(url);

Output Properties

Property	Description
`title`	Article title
`markdown`	Formatted Markdown content
`text`	Plain text (no formatting)
`html`	Raw HTML content
`excerpt`	Short summary
`author`	Author name if detected
`publishedDate`	Publication date
`wordCount`	Total words
`readingTime`	Estimated minutes to read

Running Code

Create a script file and run with:

npx tsx script.ts

When to Use Each Function

User Request	Function
"Extract this article"	`extractArticle(url)`
"Get content for LLM"	`scrapeForLLM(url)`
"Find articles on this site"	`scrapeWebsite(url)`
"Not sure if article or blog"	`smartScrape(url)`
"Process these 5 URLs"	`scrapeUrls(urls)`
"Can I scrape this?"	`validateUrl(url)`