This Claude skill provides a comprehensive, full-stack framework for building and deploying production-ready Retrieval-Augmented Generation (RAG) chatbots specifically for documentation sites. It streamlines the integration of FastAPI, OpenAI, Qdrant vector search, and Neon Postgres to transform static content into an interactive, context-aware Q&A interface featuring source citations and session persistence.

When should I use rag-skill?

rag-skill is useful in the following scenarios: • Case 1: Adding an AI-powered Q&A assistant to technical documentation sites (like Docusaurus) to help users find answers instantly without manual searching. • Case 2: Implementing semantic search across large-scale internal knowledge bases to improve information discoverability through natural language queries. • Case 3: Creating interactive learning tools where users can select specific text on a page to trigger context-aware AI explanations and deep dives. • Case 4: Building customer support interfaces that provide verifiable answers by citing specific documentation sources, increasing user trust and reducing support tickets. • Case 5: Developing a persistent conversation system for documentation that maintains user context and history across multiple sessions.

rag-skill – AI Agent Skills

name	rag_skill
description	Build and integrate production-ready RAG (Retrieval-Augmented Generation) chatbots into documentation sites using OpenAI, Qdrant Cloud, and Neon Postgres. Handles complete stack from backend API to frontend UI integration.
version	1.0.0

This skill provides comprehensive guidance for building intelligent chatbots that answer questions based on documentation content using RAG architecture. It includes backend API development (FastAPI + OpenAI + Qdrant), conversation management (PostgreSQL), and frontend UI components (React/TypeScript for Docusaurus).

What This Skill Does

Backend Setup - Configure FastAPI with OpenAI, Qdrant, and Neon Postgres
RAG Service Implementation - Build embedding generation, vector search, and LLM response generation
Document Indexing - Extract, chunk, and embed documentation content
Frontend Integration - Create React chat components with text selection and source citations
Session Management - Implement conversation memory and persistence
Testing & Deployment - Validate functionality and deploy to production
Performance Optimization - Monitor costs, response times, and accuracy

When to Use This Skill

Add an AI-powered Q&A chatbot to documentation sites
Implement semantic search over documentation content
Build conversational interfaces with source citations
Create context-aware chatbots with text selection features
Integrate OpenAI, Qdrant, and PostgreSQL for RAG systems
Deploy production-ready RAG applications with proper testing

How This Skill Works

Phase 1: Backend Setup

Create project structure:
```
mkdir backend
cd backend
```

Install dependencies (requirements.txt):

fastapi==0.115.0
uvicorn[standard]==0.32.0
python-dotenv==1.0.1
openai==1.54.0
qdrant-client==1.12.0
psycopg2-binary==2.9.10
sqlalchemy==2.0.35
pydantic==2.9.2
pydantic-settings==2.6.0
python-multipart==0.0.12
markdown==3.7
beautifulsoup4==4.12.3
tiktoken==0.8.0

Configure environment variables (.env):
- OPENAI_API_KEY: OpenAI API key from platform.openai.com
- QDRANT_URL: Qdrant Cloud cluster URL (https://xxx.cloud.qdrant.io:6333)
- QDRANT_API_KEY: Qdrant Cloud API key
- QDRANT_COLLECTION_NAME: Collection name (e.g., "ai_native_book")
- DATABASE_URL: Neon Postgres connection string
- CORS_ORIGINS: Allowed origins (e.g., "http://localhost:3000")
Implement core services:
- RAG Service (rag_service.py):
  - Embedding generation using OpenAI text-embedding-3-small
  - Vector similarity search in Qdrant
  - LLM response generation with GPT-4o-mini
  - Context building from retrieved documents
- Database Models (models.py):
  - ChatSession (session_id, created_at, last_activity)
  - ChatMessage (session_id, role, content, timestamp)
- API Endpoints (main.py):
  - GET /api/health: System status and service connectivity
  - POST /api/chat: Send message and get AI response
  - GET /api/sessions/{session_id}/history: Retrieve conversation history
Create document indexing pipeline (indexer.py):
- Extract content from markdown/MDX files
- Clean and preprocess text
- Chunk documents (1000 words with 200-word overlap)
- Generate embeddings using OpenAI
- Store vectors in Qdrant with metadata (title, file_path, sidebar_position)

Phase 2: Frontend Integration

Create chat component (book/src/components/RAGChatbot/):

// index.tsx - Main chat component
- Message display with user/assistant roles
- Loading states and typing animations
- Source citation display
- Error handling and retry logic
- Session persistence

Add styling (styles.module.css):
- Dark mode support
- Responsive design
- Smooth animations
- Accessibility (ARIA labels, keyboard navigation)
Implement text selection feature:
- Detect text selection on documentation page
- Show yellow context indicator banner
- Include selected text in API requests
- Clear selection after use
Create global integration (book/src/theme/Root.tsx):
- Wrap Docusaurus with chat component
- Configure API endpoint URL
- Enable CORS headers

Phase 3: Testing & Deployment

Run comprehensive test suite (test_api.py):
- Health Check Test: Verify all services connected
- Basic Q&A Test: Test RAG retrieval and generation
- Context-Aware Test: Test conversation memory
- Text Selection Test: Test selected text integration
- Session Management Test: Test database persistence
Deploy backend:
- Deploy to Railway, Render, or AWS
- Configure production environment variables
- Enable HTTPS and proper CORS
- Set up monitoring and logging
Deploy frontend:
- Update API URL to production endpoint
- Build and deploy Docusaurus site
- Verify CORS and connectivity
Verify deployment:
- Test health endpoint
- Run smoke tests with sample queries
- Monitor performance and errors
- Check cost tracking

Technology Stack

Backend

Framework: FastAPI (Python 3.9+)
LLM Provider: OpenAI (GPT-4o-mini for chat, text-embedding-3-small for embeddings)
Vector Database: Qdrant Cloud (free tier: 1GB storage)
Relational Database: Neon Serverless Postgres (free tier: 0.5GB storage, 100 hours compute)
Additional Libraries: SQLAlchemy, Pydantic, python-dotenv, BeautifulSoup4, tiktoken

Frontend

Framework: React + TypeScript
Integration: Docusaurus v3+
Styling: CSS Modules with dark mode support
Features: Text selection, session management, source citations

Cloud Services Required

OpenAI Account
- API key from platform.openai.com
- Billing enabled
- Cost: ~$5-10/month for moderate usage (1000 queries)
Qdrant Cloud
- Free tier: 1GB storage
- Create cluster at cloud.qdrant.io
- Copy cluster URL and API key
Neon Postgres
- Free tier: 0.5GB storage, 100 hours compute
- Create database at neon.tech
- Copy connection string

Architecture Components

Backend API Structure

backend/
├── src/
│   ├── main.py              # FastAPI application with endpoints
│   ├── config.py            # Configuration and settings
│   ├── models.py            # Pydantic and SQLAlchemy models
│   ├── database.py          # Database connection and session management
│   ├── services/
│   │   ├── rag_service.py       # RAG logic (embeddings + retrieval + generation)
│   │   ├── conversation.py      # Conversation management
│   │   └── vector_store.py      # Qdrant vector operations
│   └── schemas/
│       └── chat.py              # Request/response schemas
├── scripts/
│   ├── index_docs.py        # Document indexing script
│   └── clear_and_reindex.py # Clear collection and reindex
├── tests/
│   └── test_api.py          # Comprehensive test suite
├── requirements.txt         # Python dependencies
├── requirements-dev.txt     # Development dependencies
├── .env.example             # Environment variables template
├── Dockerfile               # Container configuration
└── docker-compose.yml       # Multi-service orchestration

Core API Endpoints

Health Check: GET /api/health

Returns system status and service connectivity

Example response:

{
  "status": "healthy",
  "openai": "connected",
  "qdrant": "connected",
  "postgres": "connected"
}

Chat: POST /api/chat

Input:

{
  "message": "What is Physical AI?",
  "session_id": "uuid-optional",
  "selected_text": "optional selected context"
}

Output:

{
  "session_id": "uuid",
  "message": "AI response with context",
  "sources": [
    {
      "title": "Introduction to Physical AI",
      "file_path": "/docs/intro.md",
      "score": 0.85
    }
  ],
  "timestamp": "2025-12-02T10:30:00Z"
}

Session History: GET /api/sessions/{session_id}/history

Returns full conversation history for a session

Example response:

{
  "session_id": "uuid",
  "messages": [
    {
      "role": "user",
      "content": "What is Physical AI?",
      "timestamp": "2025-12-02T10:30:00Z"
    },
    {
      "role": "assistant",
      "content": "Physical AI refers to...",
      "timestamp": "2025-12-02T10:30:02Z"
    }
  ]
}

Frontend Components Structure

book/src/
├── theme/
│   ├── components/
│   │   └── ChatWidget/
│   │       ├── index.tsx           # Main chat component
│   │       ├── ChatWindow.tsx      # Chat window UI
│   │       ├── MessageList.tsx     # Message display
│   │       ├── MessageInput.tsx    # Input field
│   │       ├── SourceCitation.tsx  # Source display
│   │       └── styles.module.css   # Component styles
│   └── Root.tsx                     # Global wrapper for chatbot
└── css/
    └── custom.css                   # Global styles

Key Features Implementation

1. RAG-Powered Responses

Semantic Search Pattern:

# Generate query embedding
query_embedding = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input=user_query
).data[0].embedding

# Search in Qdrant
similar_docs = qdrant_client.search(
    collection_name="ai_native_book",
    query_vector=query_embedding,
    limit=5
)

# Build context from top results
context = "\n\n".join([
    f"[{doc.payload['title']}]\n{doc.payload['content']}"
    for doc in similar_docs
])

Response Generation Pattern:

# Generate response with context
response = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        *chat_history[-6:],  # Last 3 exchanges
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_query}"}
    ],
    temperature=0.7,
    max_tokens=1000
)

2. Text Selection Context

Frontend Detection:

useEffect(() => {
  const handleSelection = () => {
    const selection = window.getSelection();
    const text = selection?.toString().trim();
    if (text && text.length > 10) {
      setSelectedText(text);
      setShowContextBanner(true);
    }
  };

  document.addEventListener('mouseup', handleSelection);
  return () => document.removeEventListener('mouseup', handleSelection);
}, []);

Backend Integration:

# Include selected text in prompt if provided
if selected_text:
    user_message = f"Selected text: '{selected_text}'\n\n{user_message}"

3. Conversation Memory

Database Storage:

# Store chat history in Postgres
new_message = ChatMessage(
    session_id=session.id,
    role="user",
    content=user_message,
    timestamp=datetime.utcnow()
)
db.add(new_message)
db.commit()

# Retrieve history
chat_history = db.query(ChatMessage).filter(
    ChatMessage.session_id == session.id
).order_by(ChatMessage.timestamp.desc()).limit(10).all()

Context Building:

# Include in LLM context
history_messages = [
    {"role": msg.role, "content": msg.content}
    for msg in reversed(chat_history)
]

4. Source Citations

Return Sources:

sources = [
    {
        "title": doc.payload["title"],
        "file_path": doc.payload["file_path"],
        "score": round(doc.score, 3)
    }
    for doc in similar_docs[:3]
]

Frontend Display:

{sources.map((source, idx) => (
  <div key={idx} className={styles.source}>
    <a href={source.file_path}>{source.title}</a>
    <span className={styles.score}>
      {Math.round(source.score * 100)}% match
    </span>
  </div>
))}

Document Indexing Strategy

Chunking Parameters

Chunk Size: 1000 words (configurable)
Overlap: 200 words (maintains context continuity)
Metadata: Title, file path, chunk index, sidebar position
Processing: Clean HTML, remove code blocks, normalize whitespace

Embedding Model

Model: text-embedding-3-small (1536 dimensions)
Cost: ~$0.00002 per 1K tokens
Performance: ~100ms per embedding
Batch Size: 100 documents per batch

Vector Search Configuration

Distance Metric: Cosine similarity
Results: Top 5 most relevant chunks
Threshold: Minimum 0.5 similarity score
Metadata Filtering: Support filtering by file path, section

Indexing Script Usage

# Index all documentation
cd backend
source venv/bin/activate
python scripts/index_docs.py

# Clear and reindex
python scripts/clear_and_reindex.py

Performance Metrics

Expected Response Times

Embedding Generation: ~100ms
Vector Search: ~50ms
LLM Generation: ~1-2 seconds
Database Operations: ~50ms
Total Response: ~1.5-2.5 seconds (95th percentile < 3s)

Accuracy Metrics

Relevance Scores: 55-75% for top results
Context Retrieval: 3-5 relevant chunks per query
Answer Quality: High when relevant context is found
Source Attribution: 90%+ accuracy

Performance Targets

API Response Time: < 3 seconds (95th percentile)
Vector Search: < 100ms
Database Queries: < 50ms
Frontend Render: < 16ms (60fps)
Uptime: 95%+

Testing Strategy

Test Coverage Areas

Health Checks: Verify all services connected

response = requests.get(f"{BASE_URL}/api/health")
assert response.json()["status"] == "healthy"
assert response.json()["openai"] == "connected"
assert response.json()["qdrant"] == "connected"
assert response.json()["postgres"] == "connected"

Basic Q&A: Test RAG retrieval and generation

response = requests.post(
    f"{BASE_URL}/api/chat",
    json={"message": "What is Physical AI?"}
)
assert "session_id" in response.json()
assert len(response.json()["sources"]) > 0
assert response.json()["message"]

Context Awareness: Test conversation memory

# First message
response1 = requests.post(
    f"{BASE_URL}/api/chat",
    json={"message": "What is ROS 2?"}
)
session_id = response1.json()["session_id"]

# Follow-up message
response2 = requests.post(
    f"{BASE_URL}/api/chat",
    json={
        "message": "What are its main features?",
        "session_id": session_id
    }
)
assert "ROS 2" in response2.json()["message"]

Text Selection: Test selected text integration

response = requests.post(
    f"{BASE_URL}/api/chat",
    json={
        "message": "Explain this",
        "selected_text": "Physical AI combines artificial intelligence with physical robotics"
    }
)
assert response.status_code == 200

Session Management: Test database persistence

response = requests.get(
    f"{BASE_URL}/api/sessions/{session_id}/history"
)
assert len(response.json()["messages"]) >= 2

Sample Test Script

import requests

BASE_URL = "http://localhost:8000"

def test_health():
    response = requests.get(f"{BASE_URL}/api/health")
    assert response.json()["status"] == "healthy"
    print("✅ Health check passed")

def test_basic_qa():
    response = requests.post(
        f"{BASE_URL}/api/chat",
        json={"message": "What is Physical AI?"}
    )
    assert "session_id" in response.json()
    assert len(response.json()["sources"]) > 0
    print("✅ Basic Q&A passed")

if __name__ == "__main__":
    test_health()
    test_basic_qa()

Common Issues & Solutions

Issue: "process is not defined" in browser

Cause: Using Node.js process.env in React browser code

Solution: Use hardcoded values or build-time environment variables

// ❌ Wrong - process.env doesn't exist in browser
<RAGChatbot apiUrl={process.env.REACT_APP_API_URL} />

// ✅ Correct - hardcode or use Docusaurus config
<RAGChatbot apiUrl="http://localhost:8000" />

Issue: OpenAI client initialization error

Cause: Outdated OpenAI SDK version

Solution: Upgrade to latest version

pip install --upgrade openai

Issue: Empty search results

Cause: Documents not indexed in Qdrant

Solution: Run indexing script

cd backend
python scripts/index_docs.py

Issue: CORS errors

Cause: CORS origins not configured properly

Solution: Configure CORS in backend

# config.py
CORS_ORIGINS = "http://localhost:3000,https://your-domain.com"

Issue: Qdrant connection failures

Cause: Incorrect cluster URL or API key

Solution: Verify configuration

Cluster URL must include port: https://xxx.cloud.qdrant.io:6333
API key must be valid
Test with health endpoint

Issue: Slow responses

Cause: Too many context chunks or slow LLM model

Solution: Optimize retrieval and model

Reduce chunk retrieval limit (5 → 3)
Use faster model (GPT-4o-mini)
Implement caching
Reduce context size

Issue: Database connection errors

Cause: Invalid Neon connection string

Solution: Verify connection string format

postgresql://user:password@host/database?sslmode=require

Deployment Checklist

Pre-Deployment

All environment variables configured
Documents indexed in Qdrant
Database tables created in Neon
Test suite passing (5/5 tests)
Local build successful
No broken links or errors

Backend Deployment

Backend deployed to Railway/Render/AWS
Production environment variables set
HTTPS enabled
CORS configured for production domain
Health check endpoint returning success
Monitoring/logging configured

Frontend Deployment

API URL updated to production endpoint
Docusaurus site built successfully
Chat widget visible and functional
Text selection feature working
Sources displaying correctly

Post-Deployment

Smoke tests passing
Response times acceptable (< 3s)
No CORS errors
Conversation memory working
Cost tracking enabled
Error monitoring active

Cost Estimation

Monthly costs (moderate usage - 1000 queries):

OpenAI Embeddings: ~$0.50 (1000 queries × 1000 words × $0.00002 per 1K tokens)
OpenAI Chat: ~$5-10 (1000 responses × ~1000 tokens × $0.15-0.60 per 1M tokens)
Qdrant Cloud: $0 (free tier - 1GB storage)
Neon Postgres: $0 (free tier - 0.5GB storage, 100 hours compute)
Hosting: $5-10 (Railway/Render free tier or basic plan)
Total: $5-20/month

Cost optimization tips:

Use GPT-4o-mini instead of GPT-4o (10x cheaper)
Cache embeddings for common queries
Reduce chunk retrieval limit
Monitor usage and set limits

Success Criteria

A successful RAG chatbot implementation should achieve:

✅ Functional Requirements:

User asks question → Gets relevant answer
User selects text → Context indicator appears
Follow-up questions → Conversation memory works
Sources shown → User can verify information
All tests passing (5/5)

✅ Performance Requirements:

Response time < 3 seconds (95th percentile)
Vector search < 100ms
Database queries < 50ms
Uptime 95%+

✅ Quality Requirements:

Relevance score > 70% for top source
Answer accuracy high when context available
Source attribution accurate
Error rate < 5%
Positive user feedback

✅ Cost Requirements:

Within budget ($5-20/month for moderate usage)
Efficient token usage
Optimized embedding calls

Example Usage Workflow

User opens documentation → Chat button appears in bottom-right
User clicks chat button → Chat window opens
User asks "What is Physical AI?" → Backend searches Qdrant for relevant chunks
System finds 5 relevant chunks → Combines with query and chat history
OpenAI generates answer → Returns with 3 source citations
User sees answer + sources → Can click sources to verify information
User selects text on page → Yellow context banner appears
User asks "Explain this" → Context-aware response using selected text
User asks follow-up → Conversation memory maintained
User closes chat → Session persisted in database

Quality Gates

Before deployment to production, verify:

All tests pass (health, Q&A, context, selection, session)
Local build completes without errors
No broken links or missing resources
Performance targets met (< 3s response)
Accessibility standards verified (ARIA labels, keyboard navigation)
Security: API keys in environment variables, CORS configured
Monitoring and error tracking enabled
Cost tracking and limits configured

Next Steps After Implementation

Monitor and optimize:
- Track usage patterns and costs
- Monitor response times and error rates
- Collect user feedback
- Optimize chunk size and retrieval parameters
Enhance features:
- Add multi-language support
- Implement voice input/output
- Add feedback buttons (thumbs up/down)
- Create analytics dashboard
Scale infrastructure:
- Move to paid tiers as needed
- Implement caching layer
- Add rate limiting
- Set up load balancing
Improve quality:
- Fine-tune prompts
- Optimize chunking strategy
- Add more sophisticated context ranking
- Implement A/B testing

Output Format

When using this skill, you should have:

Input Requirements:

Documentation content location (e.g., book/docs/)
Desired chat features (text selection, sources, memory)
Cloud service credentials (OpenAI, Qdrant, Neon)
Deployment target (Railway, Render, AWS, etc.)

Skill Returns:

Backend Implementation:
- FastAPI application with health, chat, and history endpoints
- RAG service with embedding generation and vector search
- Database models and session management
- Document indexing script
- Comprehensive test suite
Frontend Implementation:
- React chat component with TypeScript
- Text selection feature
- Source citation display
- Dark mode styling
- Global Docusaurus integration
Configuration Files:
- .env.example with all required variables
- requirements.txt with pinned dependencies
- docker-compose.yml for local development
- API documentation and usage examples
Testing & Deployment:
- Test scripts with all 5 test cases
- Deployment instructions for chosen platform
- Health check verification
- Performance monitoring setup
Documentation:
- Setup guide with step-by-step instructions
- API endpoint documentation
- Troubleshooting guide
- Cost estimation and optimization tips

Version: 1.0.0 Last Updated: 2025-12-02 Tested With: Docusaurus 3.9.2, OpenAI API 1.54.0, Qdrant 1.12.0, FastAPI 0.115.0

When & Why to Use This Skill

Use Cases