What I Learned Building a Production RAG System

Retrieval-Augmented Generation (RAG) sounds simple in tutorials: embed your docs, store in a vector DB, retrieve top-k, stuff into a prompt. In production, that naive approach will fail you in spectacular ways.

This is what I actually learned building the RAG-based Agentic Assistant that now handles 10,000+ retrieval requests across 1,000+ heterogeneous documents.

The Naive Approach and Why It Fails

# Naive RAG — don't do this in production
docs = load_documents("./docs")
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("What is the refund policy?")

This works for demos. It breaks in production because:

Fixed chunk size ignores document structure — splitting a PDF at 512 tokens may cut a table in half
Dense-only retrieval misses exact matches — searching for "JWT" with embeddings may miss documents that use "JSON Web Token"
Top-k with no reranking returns noisy results — the 3rd most similar chunk may be irrelevant
No metadata filtering — all documents are treated equally

Lesson 1: Chunking Strategy Matters More Than You Think

Different document types need different chunking:

Document Type	Strategy	Why
Prose (articles, reports)	Sentence-window (3 sentences, 1 overlap)	Context bleeds across sentences
Structured (slides, tables)	Element-aware (one bullet/row = one chunk)	Structure carries meaning
Code	Function-level splitting	Function is the atomic unit
JSON/structured data	Schema-aware chunking	Keys and values must stay together

LlamaIndex’s SimpleNodeParser handles prose well, but for slides and tables I wrote custom parsers.

Lesson 2: Hybrid Search Beats Pure Dense Retrieval

Dense embeddings are great for semantic similarity. They’re bad for:

Exact product names, version numbers, acronyms
Rare proper nouns
Code identifiers

BM25 (sparse retrieval) handles these perfectly. The solution: hybrid search.

from llama_index.retrievers import BM25Retriever
from llama_index.retrievers import VectorIndexRetriever
from llama_index.retrievers import QueryFusionRetriever

vector_retriever = VectorIndexRetriever(index, similarity_top_k=10)
bm25_retriever = BM25Retriever.from_defaults(index, similarity_top_k=10)

hybrid_retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,
    mode="reciprocal_rerank",
)

Reciprocal Rank Fusion (RRF) combines results without needing score normalization. This alone improved precision by ~18% in my benchmarks.

Lesson 3: Reranking Is Not Optional

After retrieving top-10 candidates, a cross-encoder reranker re-scores each chunk against the query. Cross-encoders are slower but far more accurate than bi-encoders for scoring.

from llama_index.postprocessor import SentenceTransformerRerank

reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-6-v2",
    top_n=3,
)

Total precision improvement: +30% over baseline dense-only retrieval.

Lesson 4: Multi-Mode Embeddings for Different Query Types

Not all queries are equal. I store three embeddings per chunk:

Summary embedding — created by asking an LLM to summarize the chunk, then embedding the summary. Best for broad “what is this document about?” queries.
Q&A embedding — generate synthetic Q&A pairs from the chunk, embed the questions. Best for factual lookups.
Contextual embedding — embed the chunk with its surrounding context. Best for precise passage retrieval.

At query time, route to the right embedding type based on query classification.

Lesson 5: Agentic > Single-Pass for Multi-Hop Queries

Some questions require multiple retrieval steps:

“What are the differences between the authentication approach in VBank and Scheme-Saathi?”

A single-pass RAG retrieves chunks about one project or the other, not both. An agentic approach:

Decompose → “How does VBank do authentication?” and “How does Scheme-Saathi do authentication?”
Retrieve separately
Synthesize the comparison

from llama_index.agent import ReActAgent
from llama_index.tools import QueryEngineTool

query_tool = QueryEngineTool.from_defaults(
    query_engine=hybrid_query_engine,
    description="Use this to retrieve information from the document collection",
)

agent = ReActAgent.from_tools([query_tool], llm=llm, verbose=True)
response = agent.chat("Compare authentication in VBank and Scheme-Saathi")

LLM Benchmark Results

I tested four models on 50 multi-hop questions from the document set:

Model	Accuracy	Avg Latency	Cost/1k queries
Gemini 2.5 Pro	87%	3.2s	~$0.50
Gemini 2.5 Flash	81%	1.1s	~$0.05
LLaMA 3.2 (local)	74%	2.8s	$0
DeepSeek	79%	1.8s	~$0.02

My recommendation: Gemini 2.5 Flash for production (best accuracy/cost/speed tradeoff). LLaMA 3.2 for sensitive data that can’t leave your machine.

The Architecture That Actually Works

User Query
    │
    ▼
Query Classifier (fast, rule-based)
    │
    ├── Factual → Q&A embeddings + BM25 hybrid + rerank
    ├── Semantic → Contextual embeddings + dense + rerank
    └── Summary → Summary embeddings + dense
    │
    ▼
Hybrid Retrieval (top-10)
    │
    ▼
Cross-Encoder Reranking (top-3)
    │
    ▼
[Agentic? → Multi-step reasoning loop]
    │
    ▼
LLM Generation with citations

What I’d Do Differently

Start with evaluation — build your eval set (50+ Q&A pairs) before writing retrieval code. You can’t optimize what you can’t measure.
Cache aggressively — common queries repeat. Redis cache on query embeddings cut our p50 latency from 800ms to 45ms.
Log every retrieval — the query logs will show you exactly where the system fails. I found 30% of failures came from one badly-structured document type.

If you’re building a RAG system, start with hybrid retrieval + reranking. That combination alone will get you to production quality faster than any model swap.

Questions? Reach me at indu9128840871@gmail.com or LinkedIn.