A Pragmatic Guide to RAG Systems: What Actually Matters vs. What Looks Fancy
Everyone building with LLMs eventually hits the same wall: the context window isn't enough, and the model doesn't know your specific data.
The solution is RAG (Retrieval-Augmented Generation). The problem is that every RAG tutorial shows you a toy example that breaks in production.
I've built RAG systems that handle millions of queries. Here's what actually matters vs. what looks impressive in demos but falls apart with real data.
What RAG Actually Is
RAG is simple:
- User asks a question
- You retrieve relevant documents from your data
- You stuff those documents into the LLM prompt
- LLM generates answer using your data
The complexity is in step 2: retrieving the RIGHT documents.
Bad retrieval = bad answers, no matter how good your LLM is.
The Parts That Matter
1. Chunking Strategy (More Important Than You Think)
You can't stuff your entire knowledge base into every prompt. You need to break documents into chunks.
Most tutorials show:
# Bad: naive chunking
def chunk_text(text, chunk_size=500):
return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
This breaks sentences mid-word, splits context, and ruins retrieval quality.
What Actually Works:
def chunk_by_semantic_units(text, max_chunk_size=500):
"""
Chunk by paragraphs, keeping context intact
"""
paragraphs = text.split('\n\n')
chunks = []
current_chunk = ""
for para in paragraphs:
# If adding this paragraph keeps us under limit, add it
if len(current_chunk) + len(para) < max_chunk_size:
current_chunk += para + "\n\n"
else:
# Save current chunk and start new one
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para + "\n\n"
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
Better, but still not perfect. For structured content (docs, articles), chunk by headers:
def chunk_by_headers(text):
"""
Chunk by section headers, preserving structure
"""
import re
# Split on markdown headers
sections = re.split(r'\n(#{1,3}\s+.+)\n', text)
chunks = []
current_section = ""
current_header = ""
for i, section in enumerate(sections):
if section.startswith('#'):
# Save previous section
if current_section:
chunks.append({
"header": current_header,
"content": current_section,
"text": f"{current_header}\n{current_section}"
})
current_header = section
current_section = ""
else:
current_section += section
# Save last section
if current_section:
chunks.append({
"header": current_header,
"content": current_section,
"text": f"{current_header}\n{current_section}"
})
return chunks
This preserves context. When you retrieve a chunk about "Installation," you get the full installation section, not a random 500-character slice.
The Rule:
- For prose (articles, books): chunk by paragraphs
- For docs (API docs, guides): chunk by headers/sections
- For code: chunk by functions/classes
- For conversations: chunk by turns
2. Embeddings (Not All Are Equal)
You need to convert chunks to vectors for semantic search. Not all embedding models are equal.
Models I've tested:
| Model | Dimension | Quality | Cost | Best For | |-------|-----------|---------|------|----------| | OpenAI text-embedding-ada-002 | 1536 | Good | $0.0001/1K tokens | General purpose | | OpenAI text-embedding-3-small | 1536 | Better | $0.00002/1K tokens | Cost-effective | | OpenAI text-embedding-3-large | 3072 | Best | $0.00013/1K tokens | High accuracy | | Cohere embed-v3 | 1024 | Great | $0.0001/1K tokens | Multilingual |
For most use cases, text-embedding-3-small is the sweet spot: cheap and good quality.
Implementation:
from openai import OpenAI
client = OpenAI()
def embed_text(text: str) -> list[float]:
"""
Generate embeddings for text
"""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def embed_chunks(chunks: list[str]) -> list[dict]:
"""
Batch embed chunks (more efficient)
"""
# OpenAI allows up to 2048 inputs per request
embeddings = []
batch_size = 2048
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i+batch_size]
response = client.embeddings.create(
model="text-embedding-3-small",
input=batch
)
embeddings.extend([item.embedding for item in response.data])
return [
{"text": chunk, "embedding": emb}
for chunk, emb in zip(chunks, embeddings)
]
Cost Math:
If you have 100K chunks, 200 tokens each:
- 100K chunks × 200 tokens = 20M tokens
- At $0.02 per 1M tokens = $0.40 total
Embedding your entire knowledge base costs less than a coffee.
3. Storage (Postgres > Fancy Vector DBs)
Everyone thinks they need Pinecone or Weaviate. For most RAG systems, Postgres with pgvector is better.
Setup Postgres with pgvector:
-- Enable pgvector extension
CREATE EXTENSION vector;
-- Create table for chunks
CREATE TABLE document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id TEXT,
chunk_index INTEGER,
header TEXT,
content TEXT,
embedding vector(1536), -- dimension from embedding model
created_at TIMESTAMP DEFAULT NOW()
);
-- Create index for vector similarity search
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Regular indexes
CREATE INDEX ON document_chunks(document_id);
Why Postgres:
- You probably already have it
- Supports vector similarity + regular queries
- No additional service to manage
- Good enough for millions of vectors
- Can join with other tables (user data, metadata, etc.)
When to Use Dedicated Vector DB:
- You have 10M+ vectors
- You need sub-50ms retrieval
- You're doing hybrid search at massive scale
Otherwise, Postgres is fine.
4. Retrieval (The Most Important Part)
This is where most RAG systems fail. Naive similarity search returns bad results.
Bad: Just Get Top K Similar:
def bad_retrieval(query: str, k: int = 3):
query_embedding = embed_text(query)
# Just get most similar chunks
results = db.execute("""
SELECT content
FROM document_chunks
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_embedding, k))
return [row['content'] for row in results]
This fails because:
- Might return 3 chunks from same document (redundant)
- Ignores recency (old info ranked same as new)
- No diversity (all results about one aspect)
Good: Hybrid Retrieval:
def hybrid_retrieval(query: str, k: int = 3, diversity: float = 0.5):
"""
Combine semantic search with keyword search and diversity
"""
query_embedding = embed_text(query)
# Semantic search (vector similarity)
semantic_results = db.execute("""
SELECT id, content, document_id,
1 - (embedding <=> %s::vector) as similarity
FROM document_chunks
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_embedding, query_embedding, k * 3))
# Keyword search (full-text)
keyword_results = db.execute("""
SELECT id, content, document_id,
ts_rank(to_tsvector(content), plainto_tsquery(%s)) as rank
FROM document_chunks
WHERE to_tsvector(content) @@ plainto_tsquery(%s)
ORDER BY rank DESC
LIMIT %s
""", (query, query, k * 3))
# Combine and rerank
combined = merge_and_rerank(
semantic_results,
keyword_results,
k=k,
diversity=diversity
)
return combined
def merge_and_rerank(semantic, keyword, k, diversity):
"""
Merge results, penalize duplicates from same document
"""
seen_docs = set()
results = []
# Combine with weighted scores
all_results = {}
for item in semantic:
all_results[item['id']] = {
'content': item['content'],
'document_id': item['document_id'],
'score': item['similarity'] * 0.7 # Weight semantic
}
for item in keyword:
if item['id'] in all_results:
all_results[item['id']]['score'] += item['rank'] * 0.3
else:
all_results[item['id']] = {
'content': item['content'],
'document_id': item['document_id'],
'score': item['rank'] * 0.3
}
# Sort by score, apply diversity penalty
sorted_results = sorted(
all_results.values(),
key=lambda x: x['score'],
reverse=True
)
for item in sorted_results:
# Penalize if we've already included from this doc
if item['document_id'] in seen_docs:
item['score'] *= (1 - diversity)
results.append(item)
seen_docs.add(item['document_id'])
if len(results) >= k:
break
return [r['content'] for r in results]
This combines:
- Semantic similarity (understands intent)
- Keyword matching (catches exact terms)
- Document diversity (avoids redundancy)
Much better results in practice.
5. Prompt Construction (Context Assembly)
Once you have relevant chunks, you need to assemble them into a prompt.
Bad:
def bad_prompt(query, chunks):
context = "\n\n".join(chunks)
return f"{context}\n\nQuestion: {query}\nAnswer:"
No structure, no instructions, no quality control.
Good:
def construct_rag_prompt(query: str, chunks: list[str]) -> str:
"""
Construct well-structured RAG prompt
"""
# Number chunks for citations
context_parts = []
for i, chunk in enumerate(chunks, 1):
context_parts.append(f"[{i}] {chunk}")
context = "\n\n".join(context_parts)
prompt = f"""You are a helpful assistant. Answer the question based on the context provided.
Context:
{context}
Instructions:
- Answer based ONLY on the provided context
- If the context doesn't contain the answer, say "I don't have enough information to answer that"
- Cite sources using [1], [2], etc.
- Be concise and accurate
Question: {query}
Answer:"""
return prompt
This gives the model:
- Clear structure
- Instructions on how to use context
- Ability to cite sources
- Permission to admit ignorance
All critical for quality.
The Complete RAG Pipeline
Putting it together:
from openai import OpenAI
import psycopg2
client = OpenAI()
class RAGSystem:
def __init__(self, db_connection):
self.db = db_connection
def ingest_document(self, document_id: str, text: str):
"""
Add document to RAG system
"""
# Chunk by semantic units
chunks = chunk_by_headers(text)
# Embed chunks
embeddings = embed_chunks([c['text'] for c in chunks])
# Store in database
for i, (chunk, emb_data) in enumerate(zip(chunks, embeddings)):
self.db.execute("""
INSERT INTO document_chunks
(document_id, chunk_index, header, content, embedding)
VALUES (%s, %s, %s, %s, %s)
""", (
document_id,
i,
chunk['header'],
chunk['content'],
emb_data['embedding']
))
self.db.commit()
def query(self, question: str, k: int = 3) -> str:
"""
Query RAG system
"""
# Retrieve relevant chunks
chunks = hybrid_retrieval(question, k=k)
# Construct prompt
prompt = construct_rag_prompt(question, chunks)
# Generate answer
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0 # Lower temp for factual accuracy
)
return response.choices[0].message.content
Usage:
# Ingest documents
rag = RAGSystem(db_connection)
rag.ingest_document("doc_1", "Your document content here...")
# Query
answer = rag.query("How do I install the SDK?")
print(answer)
What Doesn't Matter (But People Obsess Over)
1. Perfect Embedding Model
The difference between good and perfect embeddings is 5-10% quality. The difference between bad and good retrieval strategy is 50-100%.
Focus on retrieval logic, not embedding model optimization.
2. Complex Reranking Models
Some tutorials show elaborate reranking with BERT or other models. For most systems, hybrid search (semantic + keyword) is enough.
Add reranking only if hybrid search isn't working.
3. Fancy Vector Databases
Pinecone, Weaviate, Qdrant are all great. But Postgres handles millions of vectors fine.
Use a fancy vector DB only when:
- You have 10M+ vectors
- Postgres queries are actually slow (measure first)
- You need features Postgres doesn't have
4. Custom Trained Embeddings
Unless you're Google-scale or have highly specialized domain language, pretrained embeddings work great.
I've never needed to train custom embeddings.
Performance and Cost
Real Numbers from Production:
System handling 100K documents, 500K chunks:
- Embedding cost (one-time): ~$10
- Storage: 2GB in Postgres (~$1/month)
- Query latency: 50-200ms retrieval + 1-3s LLM = 1-3s total
- Cost per query: $0.01-0.03 (mostly LLM, retrieval is cheap)
Optimization Tips:
- Cache frequently asked questions
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_query(question: str) -> str:
return rag.query(question)
-
Batch embedding during ingestion Embed 2048 chunks at once instead of one at a time. 100x faster.
-
Use smaller context when possible Retrieving 3 chunks vs. 10 chunks saves tokens and improves quality.
When RAG Isn't the Answer
RAG works when:
- You have documents to retrieve from
- Answers are in those documents
- Context can fit in prompt (even 10 chunks)
RAG doesn't work when:
- You need reasoning beyond retrieval
- Your data is structured (use SQL or code instead)
- Context needed is too large (>100K tokens)
For those cases, consider:
- Fine-tuning
- Tool use (function calling)
- Hybrid approaches
The Pragmatic Approach
Start simple:
- Chunk by paragraphs/headers
- Embed with text-embedding-3-small
- Store in Postgres with pgvector
- Retrieve with hybrid search
- Construct clear prompts
This handles 95% of RAG use cases.
Add complexity only when you measure that simplicity isn't working.
Most RAG systems fail not because they're too simple, but because they're too complex too early.
Ship the simple version. Iterate based on actual quality issues, not anticipated ones.
That's how you build RAG that actually works.