GenAI + RAG Complete Ecosystem — Architecture Poster

01 – 02

Transformer Architecture Self-Attention · Encoder–Decoder · Training

Self-Attention Mechanism

▸Q, K, V — Query, Key, Value projections of input embedding X via weight matrices W_Q, W_K, W_V

▸Scaled Dot-Product: Attention(Q,K,V) = softmax(QKᵀ/√d_k)·V

▸Scaling √d_k prevents vanishing gradients for large d_k

▸Multi-Head: h parallel attention heads → concat → project. Each head learns different relationship types

▸Positional Encoding — sinusoidal or learned (ROPE for LLaMA), injected at embedding layer

Attn(Q,K,V) = softmax( QKᵀ / √d_k ) · V

d_model=512 h=8 heads d_k=64 Flash Attn 2 Grouped Query Attn

Encoder–Decoder Flow

Input Tokens

→

Token + Pos Embedding

→

Encoder × N

↳ Multi-head Self-Attn → Add&Norm → FFN → Add&Norm

Target Tokens

→

Decoder × N

→

Linear+Softmax

↳ Masked Self-Attn → Cross-Attn (enc output) → FFN

Training Pipeline

▸Loss: Cross-entropy on next-token prediction (CLM) or masked (MLM)

▸Masking: Causal mask (decoder) prevents attending to future tokens

▸Backprop: AdamW optimizer, gradient clipping, LR warmup+cosine decay

▸Mixed Precision: BF16 + gradient checkpointing to save VRAM

03

RAG — Data Ingestion Pipeline 16+ Loaders · Full Preprocessing

Document Loaders (16+)

▸PDF — PyPDF, PyMuPDF, Unstructured

▸DOCX/PPT/XLS — python-docx, openpyxl

▸HTML/Web — BeautifulSoup, Playwright

▸REST APIs — OpenAPI spec, JSONLoader

▸Databases — SQLLoader, Mongo, Postgres

▸Email/Slack — Gmail, Outlook, Slack API

▸Code — Git repos, GitHub, Jira

▸S3 / GCS — cloud storage loaders

▸Markdown / RST — docs sites

▸CSV / JSON — structured data

▸YouTube / Video — transcript loaders

▸Wikipedia — WikipediaLoader

▸Confluence / Notion — knowledge bases

▸Arxiv / PubMed — research papers

▸OCR — Tesseract, AWS Textract

▸Audio — Whisper transcription

Preprocessing Pipeline

1

🧹 Cleaning

Strip HTML tags, remove headers/footers/boilerplate, fix encoding issues (UTF-8), normalize whitespace, remove control chars

2

🔍 Deduplication

MinHash LSH for near-duplicates, exact MD5/SHA hashing, semantic dedup via embedding cosine similarity threshold (>0.95)

3

🏷️ Metadata Enrichment

Source URL, doc type, author, date, page number, section title, language detection (langdetect), content category, doc ID

4

🔧 Normalization

Unicode NFC normalization, lowercasing (optional), entity resolution, date standardization, abbreviation expansion

04

RAG — Chunking Strategies 9 Strategies · WHAT / HOW / USE CASE / PARAMS

1. Recursive Character

WHAT

Split on hierarchy of separators, recursing until size limit

HOW

Try ["\n\n", "\n", ". ", " "] in order, recurse if chunk still too large

USE

General prose docs — maintains paragraph structure

PARAMS

chunk_size=1000, chunk_overlap=200, separators list

2. Token-Based

WHAT

Split at exact token boundaries using tokenizer (tiktoken/HF)

HOW

Tokenize text → slice at N tokens → decode back to string

USE

LLM context windows — precise token budget control

PARAMS

chunk_size=512 tokens, overlap=50 tokens, model="cl100k"

3. Semantic

WHAT

Embed sentences, split where cosine similarity drops (topic shift)

HOW

Embed consecutive sentences → detect similarity valleys → split

USE

Mixed-topic documents — Wikipedia, reports

PARAMS

threshold=0.3 drop, window=3 sentences, min_chunk=100

4. Sliding Window

WHAT

Fixed-size window sliding with stride over document

HOW

Chunk[i] = text[i*stride : i*stride + size]

USE

Dense technical content where context spans boundaries

PARAMS

window=512, stride=256 (50% overlap), max_chunks=100

5. Parent–Child

WHAT

Small chunks indexed, large parent retrieved on match

HOW

Index child chunks (128 tok); retrieve parent (512 tok) for LLM

USE

Precision + context — legal, medical, long docs

PARAMS

child_size=128, parent_size=512, docstore required

6. Markdown Header

WHAT

Split on markdown H1/H2/H3 heading boundaries

HOW

Regex detect headings → split, inherit header hierarchy in metadata

USE

Documentation sites, READMEs, structured wikis

PARAMS

headers=["#","##","###"], include_header_in_chunk=True

7. Sentence-Level

WHAT

Split at sentence boundaries using spaCy/NLTK

HOW

Sentence tokenize → group N sentences → overlap M

USE

QA tasks where answers fit within 1–3 sentences

PARAMS

sentences_per_chunk=5, overlap=1, min_length=50

8. Proposition

WHAT

LLM extracts atomic, self-contained factual statements

HOW

LLM prompt: "Extract all facts as standalone sentences"

USE

Dense factual QA — scientific papers, knowledge bases

PARAMS

model=gpt-4o-mini, max_props=20, dedup=True

9. Code-Aware

WHAT

AST-based split at function/class boundaries

HOW

Parse AST → split at top-level defs, preserve docstrings

USE

Code search, debugging — GitHub Copilot style RAG

PARAMS

language=python, max_lines=100, include_imports=True

05

RAG — Embeddings Dense · Sparse · Hybrid · MMR · Multi-vector

Dense Embedding Models

Model	Dims	Notes
text-embedding-3-small	1536	OpenAI, cost-efficient, strong
text-embedding-3-large	3072	OpenAI, best quality
BGE-M3	1024	Open source, multilingual SOTA
E5-Mistral-7B	4096	LLM-based, top MTEB scores
Cohere Embed v3	1024	Optimized for retrieval
Jina v3	1024	Task-specific, 8192 ctx

Sparse — BM25

score(D,Q) = Σ IDF(qᵢ) · f(qᵢ,D)·(k+1) / (f(qᵢ,D)+k·(1-b+b·|D|/avgdl))

▸Term frequency weighted by document length, great for keyword-heavy queries

▸Params: k1=1.2–2.0, b=0.75, no ML training needed

Hybrid Retrieval — RRF Fusion

RRF(d) = Σ 1 / (k + rank_i(d)) k=60

▸Combine dense + BM25 rankings without score normalization

▸MMR — Maximal Marginal Relevance: balance relevance vs diversity

▸MMR(d) = λ·sim(q,d) − (1-λ)·max[sim(d,dⱼ)] for already-selected dⱼ

Multi-Vector Representations

▸ColBERT: per-token vectors, late interaction MaxSim

▸SPLADE: sparse expansion via MLM head, hybrid of BM25+dense

▸Summary + Dense: embed chunk summary for retrieval, return full chunk

▸Hypothetical Questions: LLM generates Qs → embed Qs → store with chunk

06

Vector Databases + Indexing Pinecone · Weaviate · FAISS · HNSW · IVF

Vector Databases

DB	Strength	Best For
Pinecone	Managed, fast	Production SaaS
Weaviate	GraphQL API, hybrid	Rich metadata filter
FAISS	In-memory, CPU+GPU	Research, prototyping
Chroma	Lightweight, local	Dev/local RAG
pgvector	PostgreSQL native	Existing PG infra
Qdrant	Payload filtering	Metadata-heavy search
Milvus	Open, scalable	Billion-scale search

HNSW Algorithm

▸Hierarchical Navigable Small World graph — O(log N) search

▸M=16 connections per node, efConstruction=200 build-time

▸ef=100 query-time accuracy/speed tradeoff

IVF (Inverted File)

▸K-means clusters (nlist=1024) → search top nprobe clusters

▸IVF+PQ: product quantization compresses vectors 4–8× for RAM

▸Billion-scale: FAISS IVF256,PQ48 = 48 bytes/vec

ScaNN + Annoy

▸ScaNN (Google) — anisotropic quantization, best MTEB recall

▸Annoy (Spotify) — forest of binary trees, static index, read-heavy

07

Advanced Retrieval Techniques HyDE · Step-Back · Multi-Query · Decomposition

HyDE — Hypothetical Document Embeddings

HOW

LLM generates hypothetical answer → embed answer → retrieve real docs by proximity

WHY

Bridges query–document embedding space mismatch; better semantic alignment

User Query

→

LLM: Generate Hypothetical Doc

→

Embed Hypothesis

→

Retrieve Real Docs

Step-Back Prompting

▸Reframe specific question → abstract principle question → retrieve both

▸"What physics principle governs X?" before retrieving X specifics

▸Improves grounding for reasoning-heavy queries

Multi-Query

▸LLM generates N paraphrased versions of original query

▸Retrieve for each variant → union + deduplicate results

▸Covers vocabulary mismatch, broader recall

Query Expansion

▸Add synonyms, related terms (WordNet, LLM) to query

▸Pseudo-Relevance Feedback: expand from top-K initial results

Query Decomposition

▸Split complex query into sub-questions → answer each → synthesize

▸Self-Query Retriever: LLM extracts metadata filters from query (date > 2023, category=finance)

08

Re-ranking Cross-Encoders · ColBERT Deep Dive

Cross-Encoders

▸Concatenate query+doc → full attention → single relevance score

▸Much more accurate than bi-encoders but O(N) latency per doc

▸Models: ms-marco-MiniLM-L6, BGE-reranker, Cohere Rerank API

▸Strategy: retrieve top-100 (bi-encoder) → rerank → take top-5

ColBERT — Deep Dive

▸Architecture: BERT encodes Q and D independently → per-token vectors

▸Late Interaction: No full cross-attn at query time (fast!)

MaxSim(Q,D) = Σ_q max_d ( q·dᵀ )

▸Each query token matches its best passage token → sum scores

▸PLAID: Centroid interaction pruning for billion-scale ColBERT

▸ColBERTv2: Residual compression 64→32 bytes/token, quantized

colbert-v2.0 RAGatouille PLAID engine

09

Context Optimization Compression · MMR · Lost-in-Middle

Contextual Compression

▸LLMChainExtractor: LLM extracts only relevant sentences from chunk

▸EmbeddingsFilter: Keep only sentences above cosine similarity threshold

▸DocumentCompressorPipeline: Chain multiple compressors

Lost-in-the-Middle Problem

▸LLMs recall beginning + end best; middle chunks are "lost"

▸Fix 1: Place most relevant chunks at start/end of context

▸Fix 2: Reduce context window — fewer, higher-quality chunks

▸Fix 3: Recursive summarization of long contexts

▸Fix 4: LongContext models (Gemini 1M, Claude 200K)

Context Building Strategies

▸Contextual Retrieval: Prepend doc-level summary to each chunk before embed

▸Context-Aware Reranking: Score with full retrieved context, not individual chunks

▸Conversation history integration for multi-turn RAG

10

Full Query Pipeline — 9 Stages User Query → Response

①

User Query Input

Natural language query received → log, timestamp, session_id attach

②

Query Preprocessing

Clean input → HyDE / expansion / decomposition / self-query metadata extraction

③

Hybrid Retrieval

Dense vector search (top-100) + BM25 sparse (top-100) → RRF fusion → top-50

④

Metadata Filtering

Apply pre-filter (date range, source, category) before or after ANN search

⑤

Cross-Encoder Re-ranking

Score top-50 with cross-encoder → sort → select top-K (K=5–10)

⑥

Context Optimization

Compress chunks → MMR diversity filter → order (best first/last) → fit context window

⑦

Prompt Assembly

System prompt + context chunks + conversation history + user query → final prompt

⑧

LLM Generation

Streaming inference → temperature=0.1, top_p=0.9, max_tokens=1024

⑨

Response + Citations

Return answer + source citations + confidence score → log for RAGAS evaluation

11

Graph RAG Microsoft Style · Knowledge Graphs

Architecture

Raw Docs

→

Entity Extraction LLM

Knowledge Graph

→

Community Detection

Community Reports

→

Graph Traversal Search

Key Concepts

▸Entities: Named nodes (Person, Org, Concept) extracted by LLM

▸Relations: Typed edges between entities with weights

▸Communities: Leiden algorithm clusters → summarized as text

▸Local search: Entity-centric — find neighbors, traverse edges

▸Global search: Map-reduce over community reports for holistic QA

▸DRIFT search: Dynamic Reasoning with Iterative Feedback Traversal

▸Tools: Microsoft GraphRAG, LightRAG, Neo4j GraphRAG

When to Use

▸Multi-hop reasoning across entities (who knows whom, causal chains)

▸Corpus-level summarization, thematic analysis

12

Agentic RAG Self-RAG · CRAG · FLARE · ReAct

Self-RAG

▸LLM decides IF to retrieve (Retrieve token), evaluates relevance (IsRel), and checks support (IsSup)

▸Trained with special reflection tokens as discrete actions

CRAG — Corrective RAG

▸Evaluator grades retrieved docs: Correct / Ambiguous / Incorrect

▸If incorrect → web search fallback → knowledge refinement before generation

Adaptive RAG

▸Classifies query complexity → routes to: No RAG / Single-step RAG / Multi-step RAG

FLARE

▸Forward-Looking Active REtrieval — retrieves when model is uncertain (low probability tokens)

ReAct + Multi-Agent

▸ReAct: Reason → Act → Observe loop with tool calls

▸Multi-agent: Orchestrator delegates to specialist RAG agents (legal, financial, HR)

13

LLM Routing Semantic · Model · MoE · Tool

Semantic Router

▸Embed query → cosine sim to route exemplars → classify intent → select handler

▸Tool: semantic-router library (Auburn Uni) — fast, no LLM needed

Model Routing

▸Simple query → GPT-3.5 / Haiku (cheap, fast)

▸Complex reasoning → GPT-4o / Claude Opus (capable)

▸LLM-router: meta-model predicts best model per query

▸RouteLLM: trained preference-based router (LMSYS)

Mixture of Experts (MoE)

▸Gating network assigns tokens to top-K expert FFN layers

▸Sparse activation — only K/N experts fire per token (efficient)

▸Models: Mixtral 8×7B, GPT-4 (rumoured MoE), Grok 1

▸Auxiliary load-balancing loss prevents expert collapse

Tool Routing

▸LLM decides which tool to call (search, calculator, code, DB)

▸Function calling (OpenAI) / tool_use (Anthropic) as structured routing

14

LangChain LCEL · Chains · Agents · Memory

LCEL — LangChain Expression Language

▸Declarative pipe syntax: chain = prompt | model | parser

▸Streaming, async, batch built-in via Runnable interface

▸RunnablePassthrough, RunnableParallel for fan-out patterns

Chain Types

▸RetrievalQA: retriever → stuffing/map_reduce/refine → LLM

▸ConversationalRetrievalChain: chat history + RAG

▸MapReduceDocumentsChain: parallel map → reduce for long docs

▸RefineChain: sequential refinement over docs

Agents + Memory

▸Agent types: Zero-shot ReAct, OpenAI Functions, Structured Input, Plan-Execute

▸Memory: ConversationBufferMemory, ConversationSummaryMemory, VectorStoreRetrieverMemory, EntityMemory

▸Tools: Search, Python REPL, SQL, Wikipedia, custom functions

15

LangGraph Nodes · State · Checkpointing · Multi-Agent

Core Concepts

▸StateGraph: Directed graph where nodes are Python functions transforming TypedDict state

▸Nodes: graph.add_node("intake", intake_agent) — pure functions: state_in → state_out

▸Edges: add_edge, add_conditional_edges for branching

▸START/END: Built-in entry/exit nodes for graph lifecycle

▸Parallel fan-out: Multiple edges from one node → concurrent execution

LoanIQ Graph Architecture

S

START → intake_agent

Parse loan application, validate fields

↓

route_after_intake()

Conditional: valid → parallel | invalid → END

∥

ratio_calc ‖ policy_retrieval ‖ compliance

Fan-out parallel — Annotated[list, operator.add] merges

↓

underwriting_agent → decision → audit

Final decisioning, audit trail, END

State Management

▸TypedDict state: LoanApplicationState — type-safe, immutable per-turn

▸Annotated reducers: Annotated[list, operator.add] for parallel-safe list merging

▸State isolation: Each node receives full state snapshot, returns partial update

Checkpointing + Persistence

▸MemorySaver: In-memory, dev/test only

▸SqliteSaver: Local persistence, single-user

▸PostgresSaver: Production, multi-tenant

▸Thread IDs: config={"configurable":{"thread_id":"loan_123"}}

▸Time travel: Replay from any checkpoint snapshot

▸Human-in-loop: interrupt_before=["decision_agent"] for approval gates

Multi-Agent Patterns

▸Supervisor: LLM orchestrator routes to worker agents

▸Swarm: Agents hand off to each other peer-to-peer

▸Subgraphs: Compose graphs within graphs (modular)

▸LangGraph Platform: Deploy, scale, stream via REST+WS

16 – 17

Fine-Tuning LoRA · QLoRA · RLHF · DPO · Alignment · Serving

Methods

Method	VRAM	Approach
Full FT	High	All weights updated, best quality, most data
LoRA	Low	Low-rank ΔW=BA (r=8–64) added to frozen weights
QLoRA	Lowest	4-bit NF4 quantized base + bf16 LoRA adapters
Prefix/Prompt FT	Minimal	Only prepended soft prompt tokens trained

LoRA: W' = W₀ + α/r · B·A | B∈ℝ^(d×r), A∈ℝ^(r×k)

Data Requirements

▸Format: instruction/input/output triplets, chat format, preference pairs

▸Size: Task-specific: 100–10K examples. General: 100K+

▸Quality > Quantity: Self-Instruct, Alpaca, ShareGPT style

▸Tools: Unsloth (fast LoRA), Axolotl, TRL, HF Transformers

Training + Alignment

▸SFT: Supervised fine-tuning on demonstrations (cross-entropy loss)

▸RLHF: 1) SFT 2) Train reward model on human preferences 3) PPO optimize policy vs reward − β·KL(π||π_ref)

▸DPO: Direct Preference Optimization — no RL needed. Implicit reward via preference pairs. β-regulated KL penalty baked into loss

▸ORPO: Combined SFT+alignment in one pass, no reference model

▸Constitutional AI: Self-critique + revision against principles (Anthropic)

DPO: L = -log σ( β log π(y_w)/π_ref(y_w) - β log π(y_l)/π_ref(y_l) )

Serving Considerations

▸vLLM: PagedAttention, continuous batching, OpenAI-compatible API

▸GGUF/llama.cpp: CPU inference with Q4_K_M quantization

▸AWQ/GPTQ: 4-bit GPU-efficient weight quantization post-FT

▸Speculative decoding: Draft model + verifier for 2–3× throughput

▸Merge adapters: merge_and_unload() → single model for serving

18

Production Layer API · Caching · Guardrails · Safety

API Layer

▸FastAPI — async, OpenAPI auto-docs, Pydantic validation

▸Streaming: Server-Sent Events (SSE) via EventSourceResponse

▸Auth: JWT/OAuth2, API key middleware, rate limiting (slowapi)

▸LoanIQ stack: FastAPI + uvicorn + LangGraph compiled graph singleton

▸Horizontal scaling: Kubernetes, load balanced, stateless API pods

Caching Layers

▸Semantic cache: Embed query → check Redis/Faiss for similar past queries (cosine >0.97)

▸Exact cache: MD5 hash of (query+context) → Redis TTL cache

▸Embedding cache: Store computed embeddings for re-used chunks in Redis

▸LangChain GPTCache: Drop-in semantic caching layer

▸Typical savings: 40–80% LLM call reduction on repeated queries

Guardrails + Safety

▸Input guardrails: Prompt injection detection, topic filter, PII redaction

▸Output guardrails: Hallucination detector, toxicity filter (Perspective API), citation validator

▸NeMo Guardrails: Colang policy language, dialog rails, fact-checking rail

▸Llama Guard: Fine-tuned safety classifier for input/output

▸PII: Presidio (Microsoft) for detect + anonymize before embedding

19

Monitoring & Evaluation RAGAS · Observability · Feedback · Human-in-Loop

RAGAS Evaluation Metrics

Metric	Measures	How
Faithfulness	Hallucination rate	Claims in answer supported by context? LLM-judged
Answer Relevancy	On-topic quality	Reverse generate Qs from answer → cosine sim to original Q
Context Precision	Retrieval signal/noise	How much retrieved context is actually relevant?
Context Recall	Coverage	Ground truth claims found in retrieved context?
Context Relevancy	Retrieval accuracy	Retrieved chunks relevant to the query?
Answer Correctness	Factual accuracy	Semantic similarity + factual overlap vs ground truth

Observability Stack

▸LangSmith: Trace every LangChain/LangGraph run, latency, token cost per node

▸Arize Phoenix: LLM observability, embedding drift, retrieval quality dashboards

▸OpenTelemetry: OpenInference semantic conventions for LLM spans

▸Prometheus + Grafana: p50/p95/p99 latency, token/s, cost/query, error rate

▸Key metrics: TTFT (time-to-first-token), E2E latency, retrieval hit rate, faithfulness score

Feedback Loops + Human-in-Loop

▸Thumbs up/down: Log preference signal → fine-tune reward model (RLHF loop)

▸A/B testing: Route % of traffic to candidate model → compare RAGAS scores

▸LangGraph HIL: interrupt_before nodes pause for human approval (LoanIQ: underwriting decisions)

▸Active learning: Flag low-confidence responses for human annotation → training data

20

End-to-End System Diagram — Swimlane User → Backend → Retrieval → LLM → Monitoring

USER

Type Query

→

HTTPS POST /ask

→→→→→→→→→→→→→→→→→→→→→→→→→→

Stream Response

→

Rate Answer 👍/👎

BACKEND

FastAPI Auth

→

Cache Check

→

Guardrails Input

→

Query Preprocess

→

Prompt Assembly

→

Guardrails Output

→

Cache Set

RETRIEVAL

Embed Query

→

ANN Search (pgvector)

→

BM25 Sparse

→

RRF Fusion

→

Cross-Encoder Rerank

→

Top-K Chunks

LLM

Router: Model Select

→

OpenAI / Claude API

→

Streaming Tokens

→

Tool Calls (optional)

→

Final Answer + Citations

MONITOR

LangSmith Trace

→

RAGAS Score

→

Prometheus Metrics

→

Grafana Dashboard

→

Feedback → RLHF

21

Reference — Model + Tool Selection Guide When to use which

Task / Scenario	Recommended Model / Tool	Rationale
Production RAG (cost-sensitive)	GPT-4o-mini + text-embedding-3-small	80% quality at 10% cost vs GPT-4o
Complex reasoning / agent	Claude 3.5 Sonnet / GPT-4o	Best long-context, tool use, reasoning
Local / private deployment	Llama 3.1 8B / Mistral 7B via Ollama	No data leaves premises, free
Code generation	Claude 3.5 Sonnet / DeepSeek Coder	Top HumanEval scores
Embeddings (best quality)	text-embedding-3-large / E5-Mistral-7B	Highest MTEB BEIR scores
Embeddings (open source)	BGE-M3 / Jina v3	Multilingual, self-hosted, strong
Re-ranking	Cohere Rerank v3 / BGE-reranker-v2	Best retrieval precision gain
Orchestration	LangGraph (stateful) / LangChain (chains)	Cycles + state = LangGraph; simple pipelines = LC
Vector DB (managed)	Pinecone / Weaviate	Zero ops, SOC2, good SLAs
Vector DB (open source)	pgvector (existing PG) / Qdrant	Collocate with app DB / rich filtering
Fine-tuning (low resource)	QLoRA with Unsloth	4-bit + LoRA = 70% VRAM reduction
Graph RAG	Microsoft GraphRAG / LightRAG	Multi-hop reasoning, thematic summaries

22

Reference — Hyperparameters + Common Pitfalls Chunk Size · Top-K · Temperature · Mistakes

Critical Hyperparameters

Parameter	Recommended	Impact
chunk_size	512–1024 tokens	Too small = loss of context; too large = noise + lost-in-middle
chunk_overlap	10–20% of chunk	Prevents answer split across boundaries
top_k retrieval	20–100 (rerank to 5)	High recall → rerank for precision. Too small = miss answers
top_k final	3–7 chunks	Context window budget; quality vs completeness tradeoff
temperature	0.0–0.2 (RAG) · 0.7–1.0 (creative)	Low T = deterministic, factual; High T = diverse, creative
top_p (nucleus)	0.9	Truncates low-prob tokens; avoid using with temperature simultaneously
similarity threshold	0.7–0.8	Filter irrelevant retrieved chunks before LLM
embed batch_size	64–256	Throughput vs memory; larger = faster embedding
LoRA rank (r)	16–64	Higher r = more capacity, more memory; r=16 usually sufficient
LoRA alpha	2× rank	Effective learning rate of adapter; alpha/r = scale factor

⚠ Common Pitfalls

▸Hallucination: LLM answers without grounding → Fix: faithfulness guardrail, lower temp, explicit "only use context" prompt

▸Chunk too large: Embedding averages out meaning → retrieval misses → Fix: smaller chunks + parent retrieval

▸No reranking: ANN has recall errors → Fix: always add cross-encoder reranking stage

▸Stale index: Docs updated but not re-indexed → Fix: delta indexing pipeline with doc hash change detection

▸Query-doc mismatch: Query is short, doc is long → Fix: HyDE, multi-query, or doc summary embeddings

▸Ignoring metadata: Not filtering by date/source → Fix: always add metadata filter layer to retrieval

▸Lost-in-middle: Relevant chunk buried in context → Fix: reorder (best first/last), reduce K

▸No deduplication: Repeated chunks inflate context, waste tokens → Fix: MD5/cosine dedup in preprocessing

▸Embedding model mismatch: Different model for indexing vs query → Fix: Always use same model for both

▸No eval loop: Shipping RAG without RAGAS baseline → Fix: run offline RAGAS eval before every deployment

RAG Quality Checklist

✓Preprocessing removes noise before chunking

✓Chunk strategy matches doc structure (code → AST, prose → recursive)

✓Hybrid retrieval (dense + BM25) with RRF fusion

✓Cross-encoder reranking applied after ANN

✓Contextual compression to remove noise from chunks

✓Lost-in-middle mitigation (best chunks first/last)

✓Guardrails on input (injection) and output (faithfulness)

✓RAGAS metrics tracked per deployment (target: F>0.8, AR>0.8)

✓Semantic cache for repeated queries

✓Observability: traces, latency, cost per query in Grafana

✓Human-in-loop for high-stakes decisions (LoanIQ: approval gate)

✓Regular index refresh + deduplication pipeline

LoanIQ Tech Stack Summary

FastAPI LangGraph pgvector text-embedding-3-small BM25+RRF Cross-Encoder RAGAS LangSmith PostgresSaver NeMo Guardrails PolicyAgent · ComplianceAgent

Generative AI + RAG Complete Ecosystem