building-rag-systems
Build Retrieval Augmented Generation (RAG) systems for AI applications. Use when creating document Q&A systems, knowledge bases, semantic search, or any application combining retrieval with LLM generation. Triggers include "RAG", "vector database", "embeddings", "document chunking", "semantic search", or "knowledge base".
When & Why to Use This Skill
This Claude skill provides a comprehensive framework for building production-ready Retrieval Augmented Generation (RAG) systems. It enables developers to bridge the gap between Large Language Models and private data by mastering the full pipeline: from document ingestion and semantic chunking to vector database integration and context-aware generation. By implementing these RAG patterns, users can create AI applications that provide accurate, cited answers while significantly reducing hallucinations.
Use Cases
- Enterprise Document Q&A: Build an internal system that allows employees to query vast libraries of company policies, SOPs, and technical manuals with high precision.
- Customer Support Automation: Develop intelligent support bots that retrieve real-time product information and troubleshooting steps to provide accurate, context-rich responses to user inquiries.
- Academic and Market Research: Create tools that can ingest thousands of research papers or market reports, allowing users to perform semantic searches and synthesize information across multiple sources.
- Technical Documentation Assistants: Implement code-aware RAG systems for developer portals that help users find specific API usage examples and architectural guidance within large codebases.
| name | building-rag-systems |
|---|---|
| description | Build Retrieval Augmented Generation (RAG) systems for AI applications. Use when creating document Q&A systems, knowledge bases, semantic search, or any application combining retrieval with LLM generation. Triggers include "RAG", "vector database", "embeddings", "document chunking", "semantic search", or "knowledge base". |
Building RAG Systems Skill
Build production-ready Retrieval Augmented Generation (RAG) pipelines.
RAG Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────┤
│ 1. INGESTION │
│ Documents → Chunking → Embeddings → Vector Store │
├─────────────────────────────────────────────────────────────┤
│ 2. RETRIEVAL │
│ Query → Embed Query → Similarity Search → Top-K Chunks │
├─────────────────────────────────────────────────────────────┤
│ 3. GENERATION │
│ Query + Retrieved Context → LLM → Response │
└─────────────────────────────────────────────────────────────┘
Quick Start
Installation
pip install openai chromadb langchain tiktoken
# For PDF processing:
pip install pypdf
# For web scraping:
pip install beautifulsoup4 requests
Minimal RAG Implementation
import os
from openai import OpenAI
import chromadb
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
# 1. Ingest documents
def add_document(text: str, doc_id: str):
# Get embedding
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
embedding = response.data[0].embedding
# Store in vector DB
collection.add(
documents=[text],
embeddings=[embedding],
ids=[doc_id]
)
# 2. Query
def query_rag(question: str, top_k: int = 3) -> str:
# Embed query
response = client.embeddings.create(
model="text-embedding-3-small",
input=question
)
query_embedding = response.data[0].embedding
# Retrieve similar chunks
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
# Build context
context = "\n\n".join(results["documents"][0])
# Generate response
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Answer based on this context:\n\n{context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
Document Chunking Strategies
Fixed Size Chunking
def chunk_fixed_size(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into fixed-size chunks with overlap."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap
return chunks
Semantic Chunking (Recommended)
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_semantic(text: str, chunk_size: int = 1000, chunk_overlap: int = 200) -> list[str]:
"""Split text semantically at sentence/paragraph boundaries."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " ", ""]
)
return splitter.split_text(text)
Markdown/Code Aware Chunking
from langchain.text_splitter import MarkdownTextSplitter, Language, RecursiveCharacterTextSplitter
# For Markdown
md_splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = md_splitter.split_text(markdown_text)
# For Python code
code_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1000,
chunk_overlap=200
)
chunks = code_splitter.split_text(python_code)
Embedding Models
OpenAI Embeddings
from openai import OpenAI
client = OpenAI()
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
"""Get embedding for text using OpenAI."""
response = client.embeddings.create(
model=model, # or "text-embedding-3-large" for better quality
input=text
)
return response.data[0].embedding
# Batch embeddings (more efficient)
def get_embeddings_batch(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return [item.embedding for item in response.data]
Local Embeddings (sentence-transformers)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def get_local_embedding(text: str) -> list[float]:
return model.encode(text).tolist()
def get_local_embeddings_batch(texts: list[str]) -> list[list[float]]:
return model.encode(texts).tolist()
Vector Databases
ChromaDB (Local/Simple)
import chromadb
# Persistent storage
client = chromadb.PersistentClient(path="./chroma_db")
# Create collection
collection = client.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"} # or "l2", "ip"
)
# Add documents
collection.add(
documents=["doc1 text", "doc2 text"],
embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
metadatas=[{"source": "file1.pdf"}, {"source": "file2.pdf"}],
ids=["id1", "id2"]
)
# Query
results = collection.query(
query_embeddings=[query_embedding],
n_results=5,
where={"source": "file1.pdf"}, # Optional filter
include=["documents", "metadatas", "distances"]
)
Pinecone (Cloud/Production)
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
# Create index
pc.create_index(
name="documents",
dimension=1536, # Match your embedding model
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index("documents")
# Upsert vectors
index.upsert(vectors=[
{"id": "id1", "values": embedding1, "metadata": {"source": "doc1"}},
{"id": "id2", "values": embedding2, "metadata": {"source": "doc2"}},
])
# Query
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True,
filter={"source": {"$eq": "doc1"}}
)
Weaviate (Hybrid Search)
import weaviate
client = weaviate.Client("http://localhost:8080")
# Create class
client.schema.create_class({
"class": "Document",
"vectorizer": "none", # We provide our own embeddings
"properties": [
{"name": "content", "dataType": ["text"]},
{"name": "source", "dataType": ["string"]}
]
})
# Add documents
client.data_object.create(
data_object={"content": "text", "source": "file.pdf"},
class_name="Document",
vector=embedding
)
# Hybrid search (vector + keyword)
results = client.query.get("Document", ["content", "source"]) \
.with_hybrid(query="search term", alpha=0.5) \
.with_limit(5) \
.do()
Retrieval Strategies
Basic Similarity Search
def retrieve_basic(query: str, top_k: int = 5):
query_embedding = get_embedding(query)
return collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
Hybrid Search (Vector + Keyword)
from rank_bm25 import BM25Okapi
class HybridRetriever:
def __init__(self, documents: list[str], embeddings: list[list[float]]):
self.documents = documents
self.embeddings = embeddings
# BM25 for keyword search
tokenized = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
def search(self, query: str, top_k: int = 5, alpha: float = 0.5):
# Vector search scores
query_emb = get_embedding(query)
vector_scores = cosine_similarity([query_emb], self.embeddings)[0]
# BM25 scores
bm25_scores = self.bm25.get_scores(query.lower().split())
# Normalize and combine
vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min())
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-6)
combined = alpha * vector_scores + (1 - alpha) * bm25_scores
# Get top-k
top_indices = combined.argsort()[-top_k:][::-1]
return [self.documents[i] for i in top_indices]
Reranking
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_with_rerank(query: str, initial_k: int = 20, final_k: int = 5):
# Initial retrieval
results = collection.query(query_embeddings=[get_embedding(query)], n_results=initial_k)
candidates = results["documents"][0]
# Rerank
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)
# Get top after reranking
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:final_k]]
Generation with Context
Basic RAG Prompt
def generate_response(query: str, context: list[str]) -> str:
context_str = "\n\n---\n\n".join(context)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"""Answer the question based on the provided context.
If the context doesn't contain the answer, say "I don't have enough information."
Context:
{context_str}"""
},
{"role": "user", "content": query}
],
temperature=0.3
)
return response.choices[0].message.content
With Source Citations
def generate_with_citations(query: str, chunks: list[dict]) -> str:
# Format context with source markers
context_parts = []
for i, chunk in enumerate(chunks):
context_parts.append(f"[{i+1}] {chunk['text']}\nSource: {chunk['source']}")
context_str = "\n\n".join(context_parts)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"""Answer based on the context. Cite sources using [1], [2], etc.
Context:
{context_str}"""
},
{"role": "user", "content": query}
]
)
return response.choices[0].message.content
Complete RAG Pipeline
import os
from openai import OpenAI
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter
class RAGPipeline:
def __init__(self, collection_name: str = "documents"):
self.client = OpenAI()
self.chroma = chromadb.PersistentClient(path="./chroma_db")
self.collection = self.chroma.get_or_create_collection(collection_name)
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
def ingest(self, text: str, source: str):
"""Ingest a document into the RAG system."""
chunks = self.splitter.split_text(text)
for i, chunk in enumerate(chunks):
embedding = self._get_embedding(chunk)
self.collection.add(
documents=[chunk],
embeddings=[embedding],
metadatas=[{"source": source, "chunk_index": i}],
ids=[f"{source}_{i}"]
)
def query(self, question: str, top_k: int = 5) -> str:
"""Query the RAG system."""
# Retrieve
query_embedding = self._get_embedding(question)
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
include=["documents", "metadatas"]
)
# Generate
context = "\n\n".join(results["documents"][0])
sources = [m["source"] for m in results["metadatas"][0]]
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"Answer based on this context:\n\n{context}"
},
{"role": "user", "content": question}
]
)
answer = response.choices[0].message.content
return f"{answer}\n\nSources: {', '.join(set(sources))}"
def _get_embedding(self, text: str) -> list[float]:
response = self.client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
# Usage
rag = RAGPipeline()
rag.ingest("Your document text here...", source="document.pdf")
answer = rag.query("What is this document about?")
Best Practices
- Chunk size matters - 500-1000 tokens is usually optimal
- Overlap chunks - 10-20% overlap prevents losing context at boundaries
- Metadata is key - Store source, page number, section for citations
- Hybrid search - Combine vector + keyword for better recall
- Reranking - Improves precision on top results
- Test retrieval first - Bad retrieval = bad RAG, regardless of LLM
- Evaluate - Use metrics like recall@k, MRR, or human evaluation
Common Pitfalls
- Chunks too large - Dilutes relevance, wastes context window
- Chunks too small - Loses context, fragments information
- No overlap - Important info at chunk boundaries gets lost
- Ignoring metadata - Can't filter or cite sources
- Over-relying on LLM - "I don't know" is better than hallucination
References
- references/chunking-strategies.md - Detailed chunking guide
- references/vector-databases.md - Vector DB comparison
- references/evaluation.md - RAG evaluation metrics