RAG-Based Agentic Assistant — Indu Shekhar Jha

Overview

A research and production project exploring the frontier of retrieval-augmented generation. The system ingests heterogeneous document collections (PDF, PPT, TXT, JSON, DOCX) and answers questions with citations, using both agentic (multi-step reasoning) and non-agentic (single-pass) pipelines.

Why RAG?

Fine-tuning LLMs for every new document set is expensive and slow. RAG gives LLMs access to up-to-date, domain-specific knowledge at inference time — no retraining required. The challenge is doing retrieval well: ranking irrelevant chunks first destroys answer quality.

Pipeline Design

Document Ingestion

Format-aware parsers handle PDF, PPT, JSON, TXT
Chunking strategy varies by document type (sentences for prose, bullets for slides)
Three embedding modes per chunk: summarization (for overview queries), Q&A (for factual lookup), contextual search (for semantic similarity)

Retrieval

ChromaDB vector store with hybrid search (dense + sparse BM25)
Re-ranking pass using cross-encoder to boost precision
Boosted information retrieval precision by 30% over baseline dense-only retrieval

LLM Layer

Experimented with and benchmarked:

LLaMA 3.2 — best for local/private deployments
DeepSeek — strong at code and structured reasoning
Gemini 2.5 Pro/Flash — best accuracy on complex multi-hop queries
Gemini Embeddings — best embedding quality overall

Agentic Pipeline

The agentic mode uses LlamaIndex’s agent framework:

Query decomposition → sub-questions
Tool calls to the retrieval system for each sub-question
Synthesis pass to combine evidence into a final answer with citations

Microservice Architecture

┌──────────────┐    ┌───────────────┐    ┌──────────────┐
│  FastAPI GW  │───▶│  Retrieval    │───▶│  ChromaDB    │
│  (REST API)  │    │  Service      │    │  Vector DB   │
└──────────────┘    └───────────────┘    └──────────────┘
        │                   │
        ▼                   ▼
┌──────────────┐    ┌───────────────┐
│  Generation  │    │  PostgreSQL   │
│  Service     │    │  (metadata)   │
└──────────────┘    └───────────────┘

Separating retrieval and generation services allows independent scaling — generation is GPU-bound; retrieval is IO-bound.

Results

Metric	Value
Documents supported	1,000+ heterogeneous
Retrieval precision improvement	+30%
Response latency reduction	35% (microservices)
Peak retrieval requests handled	10,000+

Tech Stack

Framework: LlamaIndex 3.2 · FastAPI LLMs: Gemini 2.5 Pro/Flash · LLaMA 3.2 · DeepSeek Vector DB: ChromaDB · Gemini Embeddings Storage: PostgreSQL · Python