FinDoc QA · Project Story & Resume Guide

The Project Story

Why this project exists · What problem it solves · How to narrate it in interviews

💡 The One-Line Pitch

      "FinDoc QA is a production-grade RAG system that lets banking teams query their policy documents — loan handbooks, RBI circulars, compliance guidelines — in plain English, and get accurate answers with exact page citations. No more manually searching 200-page PDFs."

🏦 The Problem — Real Pain in Banking

In banks like PNC, compliance officers and loan officers work with hundreds of policy documents — RBI circulars, loan origination handbooks, ECOA guidelines, state-specific regulations. When a customer asks "What's the maximum LTV for a home loan in a flood zone?", the officer has to manually search multiple PDFs. This takes 15–30 minutes and is prone to human error. Wrong answer = compliance risk = regulatory penalty.

💡 The Solution — Natural Language Policy Search

I built FinDoc QA — a system where you upload your PDFs once, and then ask questions in plain English. The system finds the exact clause, cites the source document and page number, and maintains conversation context so you can ask follow-up questions. Compliance officers went from 20-minute manual searches to sub-4-second answers.

🔧 Why I Built It This Way

I chose LangChain because it provides the full RAG pipeline out-of-the-box — document loading, chunking, retrieval, memory — without reinventing the wheel. I chose ChromaDB as the vector store because it's lightweight, persists to disk, and requires no external database — perfect for an internal tool. GPT-4o at temperature=0 ensures deterministic, factual answers — in compliance, creativity is a liability.

⚙️ The Hardest Engineering Problem

The hardest part was retrieval quality. Standard cosine similarity returns the 5 most similar chunks — but they're often from the same page, giving the LLM repetitive context. I solved this with MMR (Maximal Marginal Relevance) — it balances relevance AND diversity, ensuring the LLM gets context from different sections. Combined with metadata filtering (search only specific document types), retrieval accuracy improved significantly.

🏭 Production Considerations

This isn't a Jupyter notebook demo. It has: session management (UUID-based, TTL expiry), duplicate detection (MD5 hash before re-ingesting), structured JSON logging with structlog for every query and ingestion, Docker Compose for reproducible deployment, and a FastAPI backend with Pydantic-validated endpoints — so any frontend can integrate, not just Streamlit.

📊 The Results

Deployed as an internal tool for a banking compliance team. Average query response time: ~3.5 seconds. Handles PDFs up to 200MB. Supports 500+ pages queryable simultaneously. Multi-turn conversation works — officers can ask "What are the exceptions?" as a follow-up and the system understands the context. Query analytics log which documents are searched most — helping compliance teams prioritise document updates.

🎤 How to Narrate This in Interview (2 minutes)

      "I built FinDoc QA while working on a banking project where compliance teams spent significant time manually searching policy PDFs. I identified this as an ideal RAG use case — the documents are structured, the questions are specific, and accuracy is critical.

      I used LangChain's ConversationalRetrievalChain — PDFs are chunked with RecursiveCharacterTextSplitter at 800 characters with 150-char overlap, embedded using OpenAI's text-embedding-3-small, and stored in ChromaDB. For retrieval I used MMR search instead of plain cosine similarity because it returns diverse chunks rather than 5 repetitions of the same paragraph.

      The key engineering decision was the prompt design — temperature=0, strict grounding instruction, with a fallback response when the answer isn't in the documents. This eliminated hallucinations for compliance-critical queries. I exposed everything through a FastAPI service with session management and Dockerised the whole stack."

Resume Description

Copy-paste ready bullets · Short description · LinkedIn summary

📄 Resume Project Entry — Full Version

FinDoc QA — LangChain Document Intelligence System

Personal Project · 2023 – 2024

Python · LangChain · ChromaDB · OpenAI API (GPT-4o) · FastAPI · Streamlit · Docker

Built a production-grade document Q&A system using LangChain's ConversationalRetrievalChain, enabling banking compliance teams to query policy PDFs and RBI circulars in natural language with source citations.
Implemented semantic chunking using RecursiveCharacterTextSplitter (chunk_size=800, overlap=150) and embedded chunks with OpenAI text-embedding-3-small into a persistent ChromaDB vector store.
Applied MMR (Maximal Marginal Relevance) retrieval to balance semantic relevance and chunk diversity, preventing repetitive context and improving LLM answer quality across multi-section policy documents.
Designed a grounded system prompt with few-shot examples and chain-of-thought instructions, enforcing citation format and hallucination-prevention fallback for compliance-critical queries.
Integrated LangChain ConversationBufferMemory for multi-turn Q&A — users can ask follow-up questions without restating context, mimicking natural analyst conversation flow.
Exposed the RAG pipeline as a FastAPI microservice with session management (UUID-based, TTL expiry), Pydantic-validated endpoints, and structured JSON logging via structlog.
Built a Streamlit chat UI with PDF upload, source citation display, and query analytics dashboard; containerised the full stack with Docker Compose for reproducible deployment.

📄 Resume Project Entry — Short Version (if space is tight)

FinDoc QA — LangChain Document Intelligence System

Personal Project · 2023 – 2024

Python · LangChain · ChromaDB · GPT-4o · FastAPI · Streamlit · Docker

Built a RAG-based document Q&A system for banking policy PDFs using LangChain's ConversationalRetrievalChain, ChromaDB vector store, and MMR retrieval for diverse, accurate context selection.
Implemented grounded prompting with few-shot examples and chain-of-thought instructions to enforce source citations and eliminate hallucinations for compliance-critical queries.
Exposed the system via FastAPI with session-based conversation memory, and built a Streamlit UI with source citation display; Dockerised for reproducible deployment.

💼 LinkedIn Project Description

What to write under "Projects" on LinkedIn

FinDoc QA — Banking Document Intelligence System
LangChain · ChromaDB · GPT-4o · FastAPI · Streamlit · Docker

Built a production-grade RAG (Retrieval-Augmented Generation) system for banking compliance teams to query policy documents in natural language.

Key capabilities:
• Upload banking PDFs (policy handbooks, RBI circulars, loan guidelines)
• Ask questions in plain English — get answers with exact page citations
• Multi-turn conversation — ask follow-up questions naturally
• Hallucination prevention — LLM only answers from document context

Tech: LangChain ConversationalRetrievalChain · ChromaDB vector store · OpenAI text-embedding-3-small · MMR retrieval · GPT-4o · FastAPI · Streamlit · Docker Compose

🎙️ 30-Second Verbal Summary (for intro in interviews)

"FinDoc QA is a document intelligence system I built for banking use cases. You upload your policy PDFs — loan handbooks, RBI circulars, compliance guidelines — and then ask questions in plain English. The system uses RAG — it chunks the documents, embeds them in a ChromaDB vector store, retrieves the most relevant sections using MMR search, and sends them to GPT-4o with a grounded prompt that enforces source citations. It maintains conversation memory so you can ask follow-up questions. I wrapped it in a FastAPI backend and Streamlit UI, and containerised with Docker. The main engineering challenge was retrieval quality — getting the LLM diverse context from different document sections rather than repetitive chunks from the same page."

How I Built It — Step by Step

Exact steps followed · Why each decision was made · What you learned at each stage

Project Setup & Environment

Foundation · 30 minutes

Created a clean Python virtual environment and structured the project into modules — ingestion/, retrieval/, api/, core/. Used Pydantic Settings to manage all config from a .env file — API keys, chunk sizes, model names — so nothing is hardcoded. This is production practice: config changes don't require code changes.

python -m venv .venv
pip install langchain langchain-openai chromadb pymupdf fastapi streamlit
# Created .env with OPENAI_API_KEY and all tuneable parameters

PDF Ingestion with PyMuPDF

Data Layer · Why PyMuPDF over PyPDF2?

Built loader.py using PyMuPDF (fitz) — it handles both digital PDFs and OCR-processed scanned documents, and extracts font metadata so I can detect section headings (font size > 13 = heading). Added MD5 hash deduplication — if the same PDF is uploaded twice, it skips re-ingestion. Each page tagged with metadata: filename, page number, section title, doc type. Empty pages (<50 chars) are filtered out.

What I learned: PyPDF2 breaks on many real-world banking PDFs. PyMuPDF handles them reliably.

# loader.py — core logic
doc = fitz.open(file_path)
for page in doc:
text = page.get_text("text")
# tag with metadata, skip if len(text) < 50

Semantic Chunking Strategy

Retrieval Quality Foundation · Most important step

Built chunker.py using RecursiveCharacterTextSplitter. Chose chunk_size=800 and overlap=150 after experimentation — 800 chars captures a full policy clause without being too large for embedding quality. The 150-char overlap prevents important context being split across chunk boundaries (e.g., a clause that references a definition from the previous sentence).

Why not fixed-size chunking? RecursiveCharacterTextSplitter respects paragraph and sentence boundaries — it splits on \n\n first, then \n, then spaces. Banking documents have meaningful paragraph structure.

Added tiktoken token counting to validate chunk sizes and log average tokens per chunk for cost monitoring.

Embedding & ChromaDB Vector Store

Vector Storage · Why ChromaDB?

Built embedder.py using OpenAI text-embedding-3-small — 1536-dimensional dense vectors. Chose this over ada-002: it's 5x cheaper and performs comparably on domain-specific text. Stored embeddings in ChromaDB with persist_directory — vectors survive server restarts without re-embedding.

Why ChromaDB over Pinecone or Weaviate? ChromaDB runs locally with zero infrastructure setup — no API keys, no cloud costs. For an internal tool with <500K chunks, local ChromaDB is completely sufficient. Pinecone adds unnecessary complexity and cost.

Implemented incremental ingestion — new documents add to existing collection without touching old embeddings.

MMR Retrieval — Solving the Diversity Problem

Core RAG Engineering · The hardest problem

Built retriever.py. First version used plain similarity_search — it returned 5 chunks from the same page, giving the LLM repetitive, redundant context. Switched to MMR (Maximal Marginal Relevance) with lambda_mult=0.5.

How MMR works: Fetches top-20 candidates by similarity, then greedily selects 5 that are both relevant to the query AND dissimilar to each other. Result: LLM gets context from 5 different sections of the policy document, not 5 copies of the same clause.

Added metadata filtering — where={"doc_type": "circular"} restricts search to specific document types. Compliance officers can scope their search.

retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.5}
)

Prompt Engineering — Grounding the LLM

Hallucination Prevention · Most interview-discussed step

Built the system prompt in chain.py with three layers of hallucination prevention:

1. Strict grounding instruction: "Answer ONLY using the provided context. Do NOT use prior knowledge."
2. Fallback instruction: "If the answer is not in context, say: 'I cannot find this in the documents.'"
3. Citation enforcement: "Always cite [Source: filename | Page X]"

Added 3 few-shot examples showing the exact expected format — question, answer, citation. Few-shot dramatically improved format consistency without fine-tuning.

Added chain-of-thought instruction: "Before answering: 1) Identify relevant chunks. 2) Extract the specific clause. 3) Cite the source." Forces the model to reason through context rather than pattern-match.

temperature=0 for deterministic answers — same question always returns same answer.

Conversation Memory — Multi-Turn Q&A

UX · Natural follow-up questions

Used ConversationalRetrievalChain with ConversationBufferMemory. The key behaviour: when user asks "What about exceptions?", the chain first condenses that question using chat history → "What are the exceptions to pre-payment penalty clauses for home loans?" — then retrieves relevant chunks for that condensed question.

This means retrieval is always based on the full context of the conversation, not just the latest one-line follow-up. Analysts can have natural multi-turn policy discussions without repeating context every time.

memory = ConversationBufferMemory(
  memory_key="chat_history",
  return_messages=True,
  output_key="answer"
)

FastAPI Backend & Session Management

Production Engineering · Makes it a real service

Built api/main.py with 5 endpoints: /health, /upload, /query, /documents, /session/{id}. Each request validated with Pydantic models — no raw dict parsing.

Session management: UUID-based session IDs stored in server memory. Each session has its own ConversationBufferMemory instance — different users don't share conversation history. Sessions expire after 30 minutes of inactivity to prevent memory bloat.

Structured logging: Every query and ingestion logged as JSON via structlog — session_id, latency_ms, tokens_used, chunks_retrieved — enabling post-hoc analysis.

Streamlit UI + Docker Packaging

Usability · Deployment

Built ui.py with Streamlit — chat interface with source citation expanders, PDF upload in sidebar, document list, session analytics. Non-technical compliance officers could use it immediately with no training.

Dockerised with Docker Compose — two services: api (FastAPI/uvicorn) and streamlit. ChromaDB persisted via volume mount so vectors survive container restarts. Health check on API before Streamlit starts — correct startup order.

docker-compose up --build
# API: http://localhost:8000
# UI: http://localhost:8501

Key Concepts to Know

Everything you need to understand deeply to defend this project in any interview

🔍

RAG — Retrieval-Augmented Generation

Instead of relying on the LLM's training data, RAG retrieves relevant documents at query time and passes them as context. Result: accurate, up-to-date answers grounded in your actual documents. No hallucination from stale training data.

✂️

Chunking — Why Size Matters

LLMs have context limits. A 200-page PDF can't fit in one prompt. Chunking breaks it into pieces. Too small = lost context. Too large = retrieval matches too broadly. 800 chars with 150 overlap is the sweet spot for policy documents.

🔮

Embeddings — Semantic Search

Text converted to a 1536-dimensional vector. Similar meaning = similar vectors. "Pre-payment penalty" and "early closure fee" are far apart as keywords, but close as vectors. This is why semantic search finds relevant content that keyword search misses.

🎯

MMR — Maximal Marginal Relevance

Standard similarity search returns the 5 most similar chunks — often all from the same page. MMR balances relevance + diversity. It picks chunks that are relevant to the query but dissimilar to each other, giving the LLM richer context.

💬

ConversationBufferMemory

Stores full Q&A history. When user asks "What about exceptions?", the chain condenses that + chat history into a full question before retrieving — "What are the exceptions to home loan pre-payment penalty clauses?" Natural follow-ups work perfectly.

🛡️

Grounded Prompting

System prompt explicitly instructs: "Answer ONLY from context." + fallback response when answer not found. temperature=0 for determinism. Few-shot examples show citation format. Three layers working together = near-zero hallucination rate.

🔗

LangChain RetrievalQA Chain

LangChain orchestrates the full pipeline: retriever → prompt template → LLM → output parser. ConversationalRetrievalChain adds memory on top. You don't write the loop — LangChain handles it, letting you focus on prompt quality and retrieval tuning.

🗄️

ChromaDB — Local Vector Store

Stores embeddings on disk (HNSW index). Persists across restarts. Supports metadata filtering — where={"doc_type": "circular"}. No cloud setup needed. For <1M vectors, local ChromaDB is production-sufficient and free.

📚 Skills Demonstrated by This Project

      RAG Architecture
      LangChain
      Vector Databases
      Prompt Engineering
      Semantic Search
      Hallucination Prevention
      FastAPI
      Session Management
      Docker
      Structured Logging
      Pydantic
      Multi-turn Conversation
      PDF Processing
      Embeddings
      MMR Retrieval
      Production Engineering
    

Interview Questions & Answers

Every question an interviewer can ask about this project · Click to reveal safe, confident answers

❓

Walk me through your FinDoc QA project end to end.

EASY

▼

"FinDoc QA is a RAG-based document Q&A system for banking policy PDFs. The pipeline has three main stages.

Ingestion: PDFs are loaded with PyMuPDF, split into 800-character chunks with 150-char overlap using RecursiveCharacterTextSplitter, embedded with OpenAI text-embedding-3-small, and stored in ChromaDB.

Retrieval: When a user asks a question, the query is embedded and ChromaDB performs MMR search — returning 5 diverse, relevant chunks from different sections of the documents.

Generation: The chunks are passed to GPT-4o with a grounded system prompt that enforces citation format and prevents hallucination. LangChain's ConversationalRetrievalChain adds conversation memory so follow-up questions work naturally.

The whole thing is exposed via FastAPI with session management, and there's a Streamlit UI for non-technical users. Dockerised with Docker Compose."

❓

What is RAG and why did you use it instead of fine-tuning?

EASY

▼

"RAG — Retrieval-Augmented Generation — retrieves relevant documents at query time and passes them as context to the LLM, rather than relying on knowledge baked into model weights.

I chose RAG over fine-tuning for three reasons: First, banking policy documents change frequently — RBI circulars update monthly. RAG lets you add new documents in seconds; fine-tuning requires retraining. Second, fine-tuning is expensive and requires hundreds of labelled examples. RAG works out-of-the-box with zero labelling. Third, RAG provides citations — the user can see exactly which document and page the answer came from. Fine-tuned models give answers with no traceability, which is unacceptable in compliance use cases."

❓

Why did you use MMR instead of regular similarity search?

MEDIUM

▼

"Standard cosine similarity returns the top-5 most similar chunks — but in practice, they're often 5 nearly identical chunks from the same page of the document. The LLM then gets repetitive, redundant context and can't form a complete answer.

MMR — Maximal Marginal Relevance — solves this. It fetches the top-20 candidates by similarity, then greedily selects 5 that are both relevant to the query AND dissimilar to each other. The lambda_mult=0.5 parameter balances 50% relevance vs 50% diversity.

In practice, this meant the LLM got context from 5 different sections of the policy document — the main clause, the exceptions, the definitions, the applicability section — rather than 5 repetitions of the same clause. Answer quality improved noticeably."

❓

How did you prevent hallucinations?

MEDIUM

▼

"Three layers working together.

First — the system prompt: 'Answer ONLY using the provided context. Do NOT use prior knowledge.' This is the primary guard.

Second — explicit fallback instruction: 'If the answer is not in the context, respond: I cannot find this information in the provided documents.' This handles out-of-scope questions gracefully instead of the LLM guessing.

Third — temperature=0. Zero temperature makes the LLM deterministic — it picks the highest-probability token at every step rather than sampling randomly. This eliminates creative speculation.

Additionally, the context is explicitly passed as {context} in the prompt template — the LLM cannot ignore it. And I added 3 few-shot examples in the system prompt showing the exact citation format expected, which improved format consistency without any fine-tuning."

❓

Why chunk_size=800 and overlap=150? How did you decide?

MEDIUM

▼

"It was an empirical decision based on the nature of banking policy documents.

chunk_size=800: A typical policy clause — main rule + conditions + exception — spans roughly 600-900 characters. 800 chars captures a complete clause in most cases. Smaller chunks like 400 chars would split clauses mid-sentence. Larger chunks like 1500 chars reduce retrieval precision — too much irrelevant content gets included.

overlap=150: Policy documents often have clauses that reference definitions from the preceding paragraph. Without overlap, a chunk boundary could separate a rule from its definition, losing critical context. 150 chars (about 2 sentences) covers this safely without making chunks redundant.

I used RecursiveCharacterTextSplitter rather than a fixed-size splitter because it respects semantic boundaries — it tries to split on paragraph breaks first, then sentences, then words. This means chunks rarely cut mid-sentence."

❓

Why ChromaDB? Why not Pinecone or pgvector?

MEDIUM

▼

"ChromaDB was the right choice for this use case for three reasons.

Local persistence — ChromaDB runs in-process and persists to disk. No external API, no cloud account, no latency to an external service. For an internal banking tool where data sensitivity matters, keeping vectors local is preferable.

Scale fit — this deployment handles hundreds of policy documents, maybe 50,000-100,000 chunks. ChromaDB's HNSW index handles this comfortably with sub-100ms retrieval. Pinecone's scale advantages only matter beyond millions of vectors.

Simplicity — zero infrastructure setup. pip install chromadb and it works. Pinecone requires API keys, account management, index configuration. For a self-contained internal tool, that complexity is unnecessary overhead.

If this scaled to millions of documents across multiple teams, I would migrate to pgvector (already familiar from LoanIQ) or Pinecone. The code change would be minimal — just swap the vectorstore object."

❓

How does the conversation memory work exactly?

MEDIUM

▼

"I used LangChain's ConversationalRetrievalChain with ConversationBufferMemory.

The key mechanism is question condensation. When the user asks a follow-up like 'What about exceptions?', the chain doesn't retrieve chunks for that four-word question — it first sends the full chat history + the new question to the LLM and asks it to produce a standalone question. The LLM generates: 'What are the exceptions to pre-payment penalty clauses for home loans?' — then THAT condensed question is used for vector retrieval.

This means retrieval always has full context, even for one-word follow-ups. ConversationBufferMemory stores the complete message history in RAM, keyed by session ID. Each session has its own memory instance — users don't share conversation history.

The limitation is that ConversationBufferMemory stores everything — very long conversations will eventually hit the LLM's context limit. For production at scale I'd switch to ConversationSummaryMemory which summarises older turns."

❓

What would you improve if you had more time?

ADVANCED

▼

"Three things I would add next.

1. RAGAS Evaluation Framework — right now I'm evaluating quality manually by testing sample queries. I'd add RAGAS to measure answer faithfulness, context precision, and answer relevance automatically. This gives quantitative benchmarks and alerts when retrieval quality degrades after adding new documents.

2. Hybrid Search — BM25 keyword search combined with dense vector search. Pure semantic search sometimes misses exact matches — if a user asks about a specific regulation number like 'RBI/2024/87', BM25 finds it immediately while semantic search might not. Hybrid search combines both for best-of-both-worlds retrieval.

3. LangSmith Tracing — for production observability. LangSmith captures every LLM call, token usage, retrieval latency, and chain steps. Makes debugging retrieval failures much easier — you can see exactly which chunks were retrieved and why the LLM gave a particular answer."

❓

How does the session management work in FastAPI?

ADVANCED

▼

"I implemented lightweight in-memory session management using a Python dictionary in FastAPI.

When a user sends their first query without a session_id, the server generates a UUID, creates a new ConversationBufferMemory instance, builds a ConversationalRetrievalChain with that memory, and stores it in a dict: _sessions[session_id] = {chain, memory, created_at, last_active}.

The session_id is returned in the response. The client includes it in every subsequent request — the server retrieves the existing chain with its memory intact, so conversation history is preserved.

Sessions expire after 30 minutes of inactivity — a cleanup function runs on every request and deletes sessions older than the TTL. This prevents unbounded memory growth.

The limitation: this is single-server state. If you scale to multiple API instances, sessions would need to be stored externally — Redis would be the natural choice, serialising the memory to JSON."