Retrieval-Augmented Generation (RAG) sounds simple in tutorials: embed your docs, store in a vector DB, retrieve top-k, stuff into a prompt. In production, that naive approach will fail you in spectacular ways.
This is what I actually learned building the RAG-based Agentic Assistant that now handles 10,000+ retrieval requests across 1,000+ heterogeneous documents.
The Naive Approach and Why It Fails
# Naive RAG — don't do this in production
docs = load_documents("./docs")
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("What is the refund policy?")
This works for demos. It breaks in production because:
- Fixed chunk size ignores document structure — splitting a PDF at 512 tokens may cut a table in half
- Dense-only retrieval misses exact matches — searching for
"JWT"with embeddings may miss documents that use"JSON Web Token" - Top-k with no reranking returns noisy results — the 3rd most similar chunk may be irrelevant
- No metadata filtering — all documents are treated equally
Lesson 1: Chunking Strategy Matters More Than You Think
Different document types need different chunking:
| Document Type | Strategy | Why |
|---|---|---|
| Prose (articles, reports) | Sentence-window (3 sentences, 1 overlap) | Context bleeds across sentences |
| Structured (slides, tables) | Element-aware (one bullet/row = one chunk) | Structure carries meaning |
| Code | Function-level splitting | Function is the atomic unit |
| JSON/structured data | Schema-aware chunking | Keys and values must stay together |
LlamaIndex’s SimpleNodeParser handles prose well, but for slides and tables I wrote custom parsers.
Lesson 2: Hybrid Search Beats Pure Dense Retrieval
Dense embeddings are great for semantic similarity. They’re bad for:
- Exact product names, version numbers, acronyms
- Rare proper nouns
- Code identifiers
BM25 (sparse retrieval) handles these perfectly. The solution: hybrid search.
from llama_index.retrievers import BM25Retriever
from llama_index.retrievers import VectorIndexRetriever
from llama_index.retrievers import QueryFusionRetriever
vector_retriever = VectorIndexRetriever(index, similarity_top_k=10)
bm25_retriever = BM25Retriever.from_defaults(index, similarity_top_k=10)
hybrid_retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
similarity_top_k=5,
num_queries=1,
mode="reciprocal_rerank",
)
Reciprocal Rank Fusion (RRF) combines results without needing score normalization. This alone improved precision by ~18% in my benchmarks.
Lesson 3: Reranking Is Not Optional
After retrieving top-10 candidates, a cross-encoder reranker re-scores each chunk against the query. Cross-encoders are slower but far more accurate than bi-encoders for scoring.
from llama_index.postprocessor import SentenceTransformerRerank
reranker = SentenceTransformerRerank(
model="cross-encoder/ms-marco-MiniLM-L-6-v2",
top_n=3,
)
Total precision improvement: +30% over baseline dense-only retrieval.
Lesson 4: Multi-Mode Embeddings for Different Query Types
Not all queries are equal. I store three embeddings per chunk:
- Summary embedding — created by asking an LLM to summarize the chunk, then embedding the summary. Best for broad “what is this document about?” queries.
- Q&A embedding — generate synthetic Q&A pairs from the chunk, embed the questions. Best for factual lookups.
- Contextual embedding — embed the chunk with its surrounding context. Best for precise passage retrieval.
At query time, route to the right embedding type based on query classification.
Lesson 5: Agentic > Single-Pass for Multi-Hop Queries
Some questions require multiple retrieval steps:
“What are the differences between the authentication approach in VBank and Scheme-Saathi?”
A single-pass RAG retrieves chunks about one project or the other, not both. An agentic approach:
- Decompose → “How does VBank do authentication?” and “How does Scheme-Saathi do authentication?”
- Retrieve separately
- Synthesize the comparison
from llama_index.agent import ReActAgent
from llama_index.tools import QueryEngineTool
query_tool = QueryEngineTool.from_defaults(
query_engine=hybrid_query_engine,
description="Use this to retrieve information from the document collection",
)
agent = ReActAgent.from_tools([query_tool], llm=llm, verbose=True)
response = agent.chat("Compare authentication in VBank and Scheme-Saathi")
LLM Benchmark Results
I tested four models on 50 multi-hop questions from the document set:
| Model | Accuracy | Avg Latency | Cost/1k queries |
|---|---|---|---|
| Gemini 2.5 Pro | 87% | 3.2s | ~$0.50 |
| Gemini 2.5 Flash | 81% | 1.1s | ~$0.05 |
| LLaMA 3.2 (local) | 74% | 2.8s | $0 |
| DeepSeek | 79% | 1.8s | ~$0.02 |
My recommendation: Gemini 2.5 Flash for production (best accuracy/cost/speed tradeoff). LLaMA 3.2 for sensitive data that can’t leave your machine.
The Architecture That Actually Works
User Query
│
▼
Query Classifier (fast, rule-based)
│
├── Factual → Q&A embeddings + BM25 hybrid + rerank
├── Semantic → Contextual embeddings + dense + rerank
└── Summary → Summary embeddings + dense
│
▼
Hybrid Retrieval (top-10)
│
▼
Cross-Encoder Reranking (top-3)
│
▼
[Agentic? → Multi-step reasoning loop]
│
▼
LLM Generation with citations
What I’d Do Differently
- Start with evaluation — build your eval set (50+ Q&A pairs) before writing retrieval code. You can’t optimize what you can’t measure.
- Cache aggressively — common queries repeat. Redis cache on query embeddings cut our p50 latency from 800ms to 45ms.
- Log every retrieval — the query logs will show you exactly where the system fails. I found 30% of failures came from one badly-structured document type.
If you’re building a RAG system, start with hybrid retrieval + reranking. That combination alone will get you to production quality faster than any model swap.
Questions? Reach me at indu9128840871@gmail.com or LinkedIn.