Ashwin Aravind

Everyone building with LLMs eventually hits the same wall: the context window isn't enough, and the model doesn't know your specific data.

The solution is RAG (Retrieval-Augmented Generation). The problem is that every RAG tutorial shows you a toy example that breaks in production.

I've built RAG systems that handle millions of queries. Here's what actually matters vs. what looks impressive in demos but falls apart with real data.

What RAG Actually Is

RAG is simple:

User asks a question
You retrieve relevant documents from your data
You stuff those documents into the LLM prompt
LLM generates answer using your data

The complexity is in step 2: retrieving the RIGHT documents.

Bad retrieval = bad answers, no matter how good your LLM is.

The Parts That Matter

1. Chunking Strategy (More Important Than You Think)

You can't stuff your entire knowledge base into every prompt. You need to break documents into chunks.

Most tutorials show:

# Bad: naive chunking
def chunk_text(text, chunk_size=500):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

This breaks sentences mid-word, splits context, and ruins retrieval quality.

What Actually Works:

def chunk_by_semantic_units(text, max_chunk_size=500):
    """
    Chunk by paragraphs, keeping context intact
    """
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = ""
    
    for para in paragraphs:
        # If adding this paragraph keeps us under limit, add it
        if len(current_chunk) + len(para) < max_chunk_size:
            current_chunk += para + "\n\n"
        else:
            # Save current chunk and start new one
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para + "\n\n"
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

Better, but still not perfect. For structured content (docs, articles), chunk by headers:

def chunk_by_headers(text):
    """
    Chunk by section headers, preserving structure
    """
    import re
    
    # Split on markdown headers
    sections = re.split(r'\n(#{1,3}\s+.+)\n', text)
    
    chunks = []
    current_section = ""
    current_header = ""
    
    for i, section in enumerate(sections):
        if section.startswith('#'):
            # Save previous section
            if current_section:
                chunks.append({
                    "header": current_header,
                    "content": current_section,
                    "text": f"{current_header}\n{current_section}"
                })
            current_header = section
            current_section = ""
        else:
            current_section += section
    
    # Save last section
    if current_section:
        chunks.append({
            "header": current_header,
            "content": current_section,
            "text": f"{current_header}\n{current_section}"
        })
    
    return chunks

This preserves context. When you retrieve a chunk about "Installation," you get the full installation section, not a random 500-character slice.

The Rule:

For prose (articles, books): chunk by paragraphs
For docs (API docs, guides): chunk by headers/sections
For code: chunk by functions/classes
For conversations: chunk by turns

2. Embeddings (Not All Are Equal)

You need to convert chunks to vectors for semantic search. Not all embedding models are equal.

Models I've tested:

| Model | Dimension | Quality | Cost | Best For | |-------|-----------|---------|------|----------| | OpenAI text-embedding-ada-002 | 1536 | Good | $0.0001/1K tokens | General purpose | | OpenAI text-embedding-3-small | 1536 | Better | $0.00002/1K tokens | Cost-effective | | OpenAI text-embedding-3-large | 3072 | Best | $0.00013/1K tokens | High accuracy | | Cohere embed-v3 | 1024 | Great | $0.0001/1K tokens | Multilingual |

For most use cases, text-embedding-3-small is the sweet spot: cheap and good quality.

Implementation:

from openai import OpenAI

client = OpenAI()

def embed_text(text: str) -> list[float]:
    """
    Generate embeddings for text
    """
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def embed_chunks(chunks: list[str]) -> list[dict]:
    """
    Batch embed chunks (more efficient)
    """
    # OpenAI allows up to 2048 inputs per request
    embeddings = []
    batch_size = 2048
    
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i+batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        embeddings.extend([item.embedding for item in response.data])
    
    return [
        {"text": chunk, "embedding": emb}
        for chunk, emb in zip(chunks, embeddings)
    ]

Cost Math:

If you have 100K chunks, 200 tokens each:

100K chunks × 200 tokens = 20M tokens
At $0.02 per 1M tokens = $0.40 total

Embedding your entire knowledge base costs less than a coffee.

3. Storage (Postgres > Fancy Vector DBs)

Everyone thinks they need Pinecone or Weaviate. For most RAG systems, Postgres with pgvector is better.

Setup Postgres with pgvector:

-- Enable pgvector extension
CREATE EXTENSION vector;

-- Create table for chunks
CREATE TABLE document_chunks (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id TEXT,
    chunk_index INTEGER,
    header TEXT,
    content TEXT,
    embedding vector(1536), -- dimension from embedding model
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create index for vector similarity search
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Regular indexes
CREATE INDEX ON document_chunks(document_id);

Why Postgres:

You probably already have it
Supports vector similarity + regular queries
No additional service to manage
Good enough for millions of vectors
Can join with other tables (user data, metadata, etc.)

When to Use Dedicated Vector DB:

You have 10M+ vectors
You need sub-50ms retrieval
You're doing hybrid search at massive scale

Otherwise, Postgres is fine.

4. Retrieval (The Most Important Part)

This is where most RAG systems fail. Naive similarity search returns bad results.

Bad: Just Get Top K Similar:

def bad_retrieval(query: str, k: int = 3):
    query_embedding = embed_text(query)
    
    # Just get most similar chunks
    results = db.execute("""
        SELECT content
        FROM document_chunks
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (query_embedding, k))
    
    return [row['content'] for row in results]

This fails because:

Might return 3 chunks from same document (redundant)
Ignores recency (old info ranked same as new)
No diversity (all results about one aspect)

Good: Hybrid Retrieval:

def hybrid_retrieval(query: str, k: int = 3, diversity: float = 0.5):
    """
    Combine semantic search with keyword search and diversity
    """
    query_embedding = embed_text(query)
    
    # Semantic search (vector similarity)
    semantic_results = db.execute("""
        SELECT id, content, document_id,
               1 - (embedding <=> %s::vector) as similarity
        FROM document_chunks
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (query_embedding, query_embedding, k * 3))
    
    # Keyword search (full-text)
    keyword_results = db.execute("""
        SELECT id, content, document_id,
               ts_rank(to_tsvector(content), plainto_tsquery(%s)) as rank
        FROM document_chunks
        WHERE to_tsvector(content) @@ plainto_tsquery(%s)
        ORDER BY rank DESC
        LIMIT %s
    """, (query, query, k * 3))
    
    # Combine and rerank
    combined = merge_and_rerank(
        semantic_results, 
        keyword_results, 
        k=k,
        diversity=diversity
    )
    
    return combined

def merge_and_rerank(semantic, keyword, k, diversity):
    """
    Merge results, penalize duplicates from same document
    """
    seen_docs = set()
    results = []
    
    # Combine with weighted scores
    all_results = {}
    
    for item in semantic:
        all_results[item['id']] = {
            'content': item['content'],
            'document_id': item['document_id'],
            'score': item['similarity'] * 0.7  # Weight semantic
        }
    
    for item in keyword:
        if item['id'] in all_results:
            all_results[item['id']]['score'] += item['rank'] * 0.3
        else:
            all_results[item['id']] = {
                'content': item['content'],
                'document_id': item['document_id'],
                'score': item['rank'] * 0.3
            }
    
    # Sort by score, apply diversity penalty
    sorted_results = sorted(
        all_results.values(), 
        key=lambda x: x['score'], 
        reverse=True
    )
    
    for item in sorted_results:
        # Penalize if we've already included from this doc
        if item['document_id'] in seen_docs:
            item['score'] *= (1 - diversity)
        
        results.append(item)
        seen_docs.add(item['document_id'])
        
        if len(results) >= k:
            break
    
    return [r['content'] for r in results]

This combines:

Semantic similarity (understands intent)
Keyword matching (catches exact terms)
Document diversity (avoids redundancy)

Much better results in practice.

5. Prompt Construction (Context Assembly)

Once you have relevant chunks, you need to assemble them into a prompt.

Bad:

def bad_prompt(query, chunks):
    context = "\n\n".join(chunks)
    return f"{context}\n\nQuestion: {query}\nAnswer:"

No structure, no instructions, no quality control.

Good:

def construct_rag_prompt(query: str, chunks: list[str]) -> str:
    """
    Construct well-structured RAG prompt
    """
    # Number chunks for citations
    context_parts = []
    for i, chunk in enumerate(chunks, 1):
        context_parts.append(f"[{i}] {chunk}")
    
    context = "\n\n".join(context_parts)
    
    prompt = f"""You are a helpful assistant. Answer the question based on the context provided.

Context:
{context}

Instructions:
- Answer based ONLY on the provided context
- If the context doesn't contain the answer, say "I don't have enough information to answer that"
- Cite sources using [1], [2], etc.
- Be concise and accurate

Question: {query}

Answer:"""
    
    return prompt

This gives the model:

Clear structure
Instructions on how to use context
Ability to cite sources
Permission to admit ignorance

All critical for quality.

The Complete RAG Pipeline

Putting it together:

from openai import OpenAI
import psycopg2

client = OpenAI()

class RAGSystem:
    def __init__(self, db_connection):
        self.db = db_connection
    
    def ingest_document(self, document_id: str, text: str):
        """
        Add document to RAG system
        """
        # Chunk by semantic units
        chunks = chunk_by_headers(text)
        
        # Embed chunks
        embeddings = embed_chunks([c['text'] for c in chunks])
        
        # Store in database
        for i, (chunk, emb_data) in enumerate(zip(chunks, embeddings)):
            self.db.execute("""
                INSERT INTO document_chunks 
                (document_id, chunk_index, header, content, embedding)
                VALUES (%s, %s, %s, %s, %s)
            """, (
                document_id,
                i,
                chunk['header'],
                chunk['content'],
                emb_data['embedding']
            ))
        
        self.db.commit()
    
    def query(self, question: str, k: int = 3) -> str:
        """
        Query RAG system
        """
        # Retrieve relevant chunks
        chunks = hybrid_retrieval(question, k=k)
        
        # Construct prompt
        prompt = construct_rag_prompt(question, chunks)
        
        # Generate answer
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0  # Lower temp for factual accuracy
        )
        
        return response.choices[0].message.content

Usage:

# Ingest documents
rag = RAGSystem(db_connection)
rag.ingest_document("doc_1", "Your document content here...")

# Query
answer = rag.query("How do I install the SDK?")
print(answer)

What Doesn't Matter (But People Obsess Over)

1. Perfect Embedding Model

The difference between good and perfect embeddings is 5-10% quality. The difference between bad and good retrieval strategy is 50-100%.

Focus on retrieval logic, not embedding model optimization.

2. Complex Reranking Models

Some tutorials show elaborate reranking with BERT or other models. For most systems, hybrid search (semantic + keyword) is enough.

Add reranking only if hybrid search isn't working.

3. Fancy Vector Databases

Pinecone, Weaviate, Qdrant are all great. But Postgres handles millions of vectors fine.

Use a fancy vector DB only when:

You have 10M+ vectors
Postgres queries are actually slow (measure first)
You need features Postgres doesn't have

4. Custom Trained Embeddings

Unless you're Google-scale or have highly specialized domain language, pretrained embeddings work great.

I've never needed to train custom embeddings.

Performance and Cost

Real Numbers from Production:

System handling 100K documents, 500K chunks:

Embedding cost (one-time): ~$10
Storage: 2GB in Postgres (~$1/month)
Query latency: 50-200ms retrieval + 1-3s LLM = 1-3s total
Cost per query: $0.01-0.03 (mostly LLM, retrieval is cheap)

Optimization Tips:

Cache frequently asked questions

from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_query(question: str) -> str:
    return rag.query(question)

Batch embedding during ingestion Embed 2048 chunks at once instead of one at a time. 100x faster.
Use smaller context when possible Retrieving 3 chunks vs. 10 chunks saves tokens and improves quality.

When RAG Isn't the Answer

RAG works when:

You have documents to retrieve from
Answers are in those documents
Context can fit in prompt (even 10 chunks)

RAG doesn't work when:

You need reasoning beyond retrieval
Your data is structured (use SQL or code instead)
Context needed is too large (>100K tokens)

For those cases, consider:

Fine-tuning
Tool use (function calling)
Hybrid approaches

The Pragmatic Approach

Start simple:

Chunk by paragraphs/headers
Embed with text-embedding-3-small
Store in Postgres with pgvector
Retrieve with hybrid search
Construct clear prompts

This handles 95% of RAG use cases.

Add complexity only when you measure that simplicity isn't working.

Most RAG systems fail not because they're too simple, but because they're too complex too early.

Ship the simple version. Iterate based on actual quality issues, not anticipated ones.

That's how you build RAG that actually works.

A Pragmatic Guide to RAG Systems: What Actually Matters vs. What Looks Fancy