Skip to content
All Posts
· By Indu Shekhar Jha

What I Learned Building a Production RAG System

Lessons from building a RAG pipeline that handles 10,000+ retrieval requests — covering chunking strategies, reranking, agentic pipelines, and the LLM benchmarks that actually matter.

Retrieval-Augmented Generation (RAG) sounds simple in tutorials: embed your docs, store in a vector DB, retrieve top-k, stuff into a prompt. In production, that naive approach will fail you in spectacular ways.

This is what I actually learned building the RAG-based Agentic Assistant that now handles 10,000+ retrieval requests across 1,000+ heterogeneous documents.

The Naive Approach and Why It Fails

# Naive RAG — don't do this in production
docs = load_documents("./docs")
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("What is the refund policy?")

This works for demos. It breaks in production because:

  1. Fixed chunk size ignores document structure — splitting a PDF at 512 tokens may cut a table in half
  2. Dense-only retrieval misses exact matches — searching for "JWT" with embeddings may miss documents that use "JSON Web Token"
  3. Top-k with no reranking returns noisy results — the 3rd most similar chunk may be irrelevant
  4. No metadata filtering — all documents are treated equally

Lesson 1: Chunking Strategy Matters More Than You Think

Different document types need different chunking:

Document TypeStrategyWhy
Prose (articles, reports)Sentence-window (3 sentences, 1 overlap)Context bleeds across sentences
Structured (slides, tables)Element-aware (one bullet/row = one chunk)Structure carries meaning
CodeFunction-level splittingFunction is the atomic unit
JSON/structured dataSchema-aware chunkingKeys and values must stay together

LlamaIndex’s SimpleNodeParser handles prose well, but for slides and tables I wrote custom parsers.

Lesson 2: Hybrid Search Beats Pure Dense Retrieval

Dense embeddings are great for semantic similarity. They’re bad for:

BM25 (sparse retrieval) handles these perfectly. The solution: hybrid search.

from llama_index.retrievers import BM25Retriever
from llama_index.retrievers import VectorIndexRetriever
from llama_index.retrievers import QueryFusionRetriever

vector_retriever = VectorIndexRetriever(index, similarity_top_k=10)
bm25_retriever = BM25Retriever.from_defaults(index, similarity_top_k=10)

hybrid_retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,
    mode="reciprocal_rerank",
)

Reciprocal Rank Fusion (RRF) combines results without needing score normalization. This alone improved precision by ~18% in my benchmarks.

Lesson 3: Reranking Is Not Optional

After retrieving top-10 candidates, a cross-encoder reranker re-scores each chunk against the query. Cross-encoders are slower but far more accurate than bi-encoders for scoring.

from llama_index.postprocessor import SentenceTransformerRerank

reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-6-v2",
    top_n=3,
)

Total precision improvement: +30% over baseline dense-only retrieval.

Lesson 4: Multi-Mode Embeddings for Different Query Types

Not all queries are equal. I store three embeddings per chunk:

At query time, route to the right embedding type based on query classification.

Lesson 5: Agentic > Single-Pass for Multi-Hop Queries

Some questions require multiple retrieval steps:

“What are the differences between the authentication approach in VBank and Scheme-Saathi?”

A single-pass RAG retrieves chunks about one project or the other, not both. An agentic approach:

  1. Decompose → “How does VBank do authentication?” and “How does Scheme-Saathi do authentication?”
  2. Retrieve separately
  3. Synthesize the comparison
from llama_index.agent import ReActAgent
from llama_index.tools import QueryEngineTool

query_tool = QueryEngineTool.from_defaults(
    query_engine=hybrid_query_engine,
    description="Use this to retrieve information from the document collection",
)

agent = ReActAgent.from_tools([query_tool], llm=llm, verbose=True)
response = agent.chat("Compare authentication in VBank and Scheme-Saathi")

LLM Benchmark Results

I tested four models on 50 multi-hop questions from the document set:

ModelAccuracyAvg LatencyCost/1k queries
Gemini 2.5 Pro87%3.2s~$0.50
Gemini 2.5 Flash81%1.1s~$0.05
LLaMA 3.2 (local)74%2.8s$0
DeepSeek79%1.8s~$0.02

My recommendation: Gemini 2.5 Flash for production (best accuracy/cost/speed tradeoff). LLaMA 3.2 for sensitive data that can’t leave your machine.

The Architecture That Actually Works

User Query


Query Classifier (fast, rule-based)

    ├── Factual → Q&A embeddings + BM25 hybrid + rerank
    ├── Semantic → Contextual embeddings + dense + rerank
    └── Summary → Summary embeddings + dense


Hybrid Retrieval (top-10)


Cross-Encoder Reranking (top-3)


[Agentic? → Multi-step reasoning loop]


LLM Generation with citations

What I’d Do Differently

  1. Start with evaluation — build your eval set (50+ Q&A pairs) before writing retrieval code. You can’t optimize what you can’t measure.
  2. Cache aggressively — common queries repeat. Redis cache on query embeddings cut our p50 latency from 800ms to 45ms.
  3. Log every retrieval — the query logs will show you exactly where the system fails. I found 30% of failures came from one badly-structured document type.

If you’re building a RAG system, start with hybrid retrieval + reranking. That combination alone will get you to production quality faster than any model swap.

Questions? Reach me at indu9128840871@gmail.com or LinkedIn.

All Posts indushekhar.tech