Complete Reference Poster · v2025

Generative AI + RAG Complete Ecosystem

Transformers · Retrieval-Augmented Generation · Agents · Fine-Tuning · Production
01 – 02
Transformer Architecture Self-Attention · Encoder–Decoder · Training
Self-Attention Mechanism
Q, K, V — Query, Key, Value projections of input embedding X via weight matrices WQ, WK, WV
Scaled Dot-Product: Attention(Q,K,V) = softmax(QKᵀ/√d_k)·V
Scaling √d_k prevents vanishing gradients for large d_k
Multi-Head: h parallel attention heads → concat → project. Each head learns different relationship types
Positional Encoding — sinusoidal or learned (ROPE for LLaMA), injected at embedding layer
Attn(Q,K,V) = softmax( QKᵀ / √d_k ) · V
d_model=512 h=8 heads d_k=64 Flash Attn 2 Grouped Query Attn
Encoder–Decoder Flow
Input Tokens
Token + Pos Embedding
Encoder × N
↳ Multi-head Self-Attn → Add&Norm → FFN → Add&Norm
Target Tokens
Decoder × N
Linear+Softmax
↳ Masked Self-Attn → Cross-Attn (enc output) → FFN
Training Pipeline
Loss: Cross-entropy on next-token prediction (CLM) or masked (MLM)
Masking: Causal mask (decoder) prevents attending to future tokens
Backprop: AdamW optimizer, gradient clipping, LR warmup+cosine decay
Mixed Precision: BF16 + gradient checkpointing to save VRAM
03
RAG — Data Ingestion Pipeline 16+ Loaders · Full Preprocessing
Document Loaders (16+)
PDF — PyPDF, PyMuPDF, Unstructured
DOCX/PPT/XLS — python-docx, openpyxl
HTML/Web — BeautifulSoup, Playwright
REST APIs — OpenAPI spec, JSONLoader
Databases — SQLLoader, Mongo, Postgres
Email/Slack — Gmail, Outlook, Slack API
Code — Git repos, GitHub, Jira
S3 / GCS — cloud storage loaders
Markdown / RST — docs sites
CSV / JSON — structured data
YouTube / Video — transcript loaders
Wikipedia — WikipediaLoader
Confluence / Notion — knowledge bases
Arxiv / PubMed — research papers
OCR — Tesseract, AWS Textract
Audio — Whisper transcription
Preprocessing Pipeline
1
🧹 Cleaning
Strip HTML tags, remove headers/footers/boilerplate, fix encoding issues (UTF-8), normalize whitespace, remove control chars
2
🔍 Deduplication
MinHash LSH for near-duplicates, exact MD5/SHA hashing, semantic dedup via embedding cosine similarity threshold (>0.95)
3
🏷️ Metadata Enrichment
Source URL, doc type, author, date, page number, section title, language detection (langdetect), content category, doc ID
4
🔧 Normalization
Unicode NFC normalization, lowercasing (optional), entity resolution, date standardization, abbreviation expansion
04
RAG — Chunking Strategies 9 Strategies · WHAT / HOW / USE CASE / PARAMS
1. Recursive Character
WHAT
Split on hierarchy of separators, recursing until size limit
HOW
Try ["\n\n", "\n", ". ", " "] in order, recurse if chunk still too large
USE
General prose docs — maintains paragraph structure
PARAMS
chunk_size=1000, chunk_overlap=200, separators list
2. Token-Based
WHAT
Split at exact token boundaries using tokenizer (tiktoken/HF)
HOW
Tokenize text → slice at N tokens → decode back to string
USE
LLM context windows — precise token budget control
PARAMS
chunk_size=512 tokens, overlap=50 tokens, model="cl100k"
3. Semantic
WHAT
Embed sentences, split where cosine similarity drops (topic shift)
HOW
Embed consecutive sentences → detect similarity valleys → split
USE
Mixed-topic documents — Wikipedia, reports
PARAMS
threshold=0.3 drop, window=3 sentences, min_chunk=100
4. Sliding Window
WHAT
Fixed-size window sliding with stride over document
HOW
Chunk[i] = text[i*stride : i*stride + size]
USE
Dense technical content where context spans boundaries
PARAMS
window=512, stride=256 (50% overlap), max_chunks=100
5. Parent–Child
WHAT
Small chunks indexed, large parent retrieved on match
HOW
Index child chunks (128 tok); retrieve parent (512 tok) for LLM
USE
Precision + context — legal, medical, long docs
PARAMS
child_size=128, parent_size=512, docstore required
6. Markdown Header
WHAT
Split on markdown H1/H2/H3 heading boundaries
HOW
Regex detect headings → split, inherit header hierarchy in metadata
USE
Documentation sites, READMEs, structured wikis
PARAMS
headers=["#","##","###"], include_header_in_chunk=True
7. Sentence-Level
WHAT
Split at sentence boundaries using spaCy/NLTK
HOW
Sentence tokenize → group N sentences → overlap M
USE
QA tasks where answers fit within 1–3 sentences
PARAMS
sentences_per_chunk=5, overlap=1, min_length=50
8. Proposition
WHAT
LLM extracts atomic, self-contained factual statements
HOW
LLM prompt: "Extract all facts as standalone sentences"
USE
Dense factual QA — scientific papers, knowledge bases
PARAMS
model=gpt-4o-mini, max_props=20, dedup=True
9. Code-Aware
WHAT
AST-based split at function/class boundaries
HOW
Parse AST → split at top-level defs, preserve docstrings
USE
Code search, debugging — GitHub Copilot style RAG
PARAMS
language=python, max_lines=100, include_imports=True
05
RAG — Embeddings Dense · Sparse · Hybrid · MMR · Multi-vector
Dense Embedding Models
ModelDimsNotes
text-embedding-3-small1536OpenAI, cost-efficient, strong
text-embedding-3-large3072OpenAI, best quality
BGE-M31024Open source, multilingual SOTA
E5-Mistral-7B4096LLM-based, top MTEB scores
Cohere Embed v31024Optimized for retrieval
Jina v31024Task-specific, 8192 ctx
Sparse — BM25
score(D,Q) = Σ IDF(qᵢ) · f(qᵢ,D)·(k+1) / (f(qᵢ,D)+k·(1-b+b·|D|/avgdl))
Term frequency weighted by document length, great for keyword-heavy queries
Params: k1=1.2–2.0, b=0.75, no ML training needed
Hybrid Retrieval — RRF Fusion
RRF(d) = Σ 1 / (k + rank_i(d))    k=60
Combine dense + BM25 rankings without score normalization
MMR — Maximal Marginal Relevance: balance relevance vs diversity
MMR(d) = λ·sim(q,d) − (1-λ)·max[sim(d,dⱼ)] for already-selected dⱼ
Multi-Vector Representations
ColBERT: per-token vectors, late interaction MaxSim
SPLADE: sparse expansion via MLM head, hybrid of BM25+dense
Summary + Dense: embed chunk summary for retrieval, return full chunk
Hypothetical Questions: LLM generates Qs → embed Qs → store with chunk
06
Vector Databases + Indexing Pinecone · Weaviate · FAISS · HNSW · IVF
Vector Databases
DBStrengthBest For
PineconeManaged, fastProduction SaaS
WeaviateGraphQL API, hybridRich metadata filter
FAISSIn-memory, CPU+GPUResearch, prototyping
ChromaLightweight, localDev/local RAG
pgvectorPostgreSQL nativeExisting PG infra
QdrantPayload filteringMetadata-heavy search
MilvusOpen, scalableBillion-scale search
HNSW Algorithm
Hierarchical Navigable Small World graph — O(log N) search
M=16 connections per node, efConstruction=200 build-time
ef=100 query-time accuracy/speed tradeoff
IVF (Inverted File)
K-means clusters (nlist=1024) → search top nprobe clusters
IVF+PQ: product quantization compresses vectors 4–8× for RAM
Billion-scale: FAISS IVF256,PQ48 = 48 bytes/vec
ScaNN + Annoy
ScaNN (Google) — anisotropic quantization, best MTEB recall
Annoy (Spotify) — forest of binary trees, static index, read-heavy
07
Advanced Retrieval Techniques HyDE · Step-Back · Multi-Query · Decomposition
HyDE — Hypothetical Document Embeddings
HOW
LLM generates hypothetical answer → embed answer → retrieve real docs by proximity
WHY
Bridges query–document embedding space mismatch; better semantic alignment
User Query
LLM: Generate Hypothetical Doc
Embed Hypothesis
Retrieve Real Docs
Step-Back Prompting
Reframe specific question → abstract principle question → retrieve both
"What physics principle governs X?" before retrieving X specifics
Improves grounding for reasoning-heavy queries
Multi-Query
LLM generates N paraphrased versions of original query
Retrieve for each variant → union + deduplicate results
Covers vocabulary mismatch, broader recall
Query Expansion
Add synonyms, related terms (WordNet, LLM) to query
Pseudo-Relevance Feedback: expand from top-K initial results
Query Decomposition
Split complex query into sub-questions → answer each → synthesize
Self-Query Retriever: LLM extracts metadata filters from query (date > 2023, category=finance)
08
Re-ranking Cross-Encoders · ColBERT Deep Dive
Cross-Encoders
Concatenate query+doc → full attention → single relevance score
Much more accurate than bi-encoders but O(N) latency per doc
Models: ms-marco-MiniLM-L6, BGE-reranker, Cohere Rerank API
Strategy: retrieve top-100 (bi-encoder) → rerank → take top-5
ColBERT — Deep Dive
Architecture: BERT encodes Q and D independently → per-token vectors
Late Interaction: No full cross-attn at query time (fast!)
MaxSim(Q,D) = Σ_q max_d ( q·dᵀ )
Each query token matches its best passage token → sum scores
PLAID: Centroid interaction pruning for billion-scale ColBERT
ColBERTv2: Residual compression 64→32 bytes/token, quantized
colbert-v2.0 RAGatouille PLAID engine
09
Context Optimization Compression · MMR · Lost-in-Middle
Contextual Compression
LLMChainExtractor: LLM extracts only relevant sentences from chunk
EmbeddingsFilter: Keep only sentences above cosine similarity threshold
DocumentCompressorPipeline: Chain multiple compressors
Lost-in-the-Middle Problem
LLMs recall beginning + end best; middle chunks are "lost"
Fix 1: Place most relevant chunks at start/end of context
Fix 2: Reduce context window — fewer, higher-quality chunks
Fix 3: Recursive summarization of long contexts
Fix 4: LongContext models (Gemini 1M, Claude 200K)
Context Building Strategies
Contextual Retrieval: Prepend doc-level summary to each chunk before embed
Context-Aware Reranking: Score with full retrieved context, not individual chunks
Conversation history integration for multi-turn RAG
10
Full Query Pipeline — 9 Stages User Query → Response
User Query Input
Natural language query received → log, timestamp, session_id attach
Query Preprocessing
Clean input → HyDE / expansion / decomposition / self-query metadata extraction
Hybrid Retrieval
Dense vector search (top-100) + BM25 sparse (top-100) → RRF fusion → top-50
Metadata Filtering
Apply pre-filter (date range, source, category) before or after ANN search
Cross-Encoder Re-ranking
Score top-50 with cross-encoder → sort → select top-K (K=5–10)
Context Optimization
Compress chunks → MMR diversity filter → order (best first/last) → fit context window
Prompt Assembly
System prompt + context chunks + conversation history + user query → final prompt
LLM Generation
Streaming inference → temperature=0.1, top_p=0.9, max_tokens=1024
Response + Citations
Return answer + source citations + confidence score → log for RAGAS evaluation
11
Graph RAG Microsoft Style · Knowledge Graphs
Architecture
Raw Docs
Entity Extraction LLM
Knowledge Graph
Community Detection
Community Reports
Graph Traversal Search
Key Concepts
Entities: Named nodes (Person, Org, Concept) extracted by LLM
Relations: Typed edges between entities with weights
Communities: Leiden algorithm clusters → summarized as text
Local search: Entity-centric — find neighbors, traverse edges
Global search: Map-reduce over community reports for holistic QA
DRIFT search: Dynamic Reasoning with Iterative Feedback Traversal
Tools: Microsoft GraphRAG, LightRAG, Neo4j GraphRAG
When to Use
Multi-hop reasoning across entities (who knows whom, causal chains)
Corpus-level summarization, thematic analysis
12
Agentic RAG Self-RAG · CRAG · FLARE · ReAct
Self-RAG
LLM decides IF to retrieve (Retrieve token), evaluates relevance (IsRel), and checks support (IsSup)
Trained with special reflection tokens as discrete actions
CRAG — Corrective RAG
Evaluator grades retrieved docs: Correct / Ambiguous / Incorrect
If incorrect → web search fallback → knowledge refinement before generation
Adaptive RAG
Classifies query complexity → routes to: No RAG / Single-step RAG / Multi-step RAG
FLARE
Forward-Looking Active REtrieval — retrieves when model is uncertain (low probability tokens)
ReAct + Multi-Agent
ReAct: Reason → Act → Observe loop with tool calls
Multi-agent: Orchestrator delegates to specialist RAG agents (legal, financial, HR)
13
LLM Routing Semantic · Model · MoE · Tool
Semantic Router
Embed query → cosine sim to route exemplars → classify intent → select handler
Tool: semantic-router library (Auburn Uni) — fast, no LLM needed
Model Routing
Simple query → GPT-3.5 / Haiku (cheap, fast)
Complex reasoning → GPT-4o / Claude Opus (capable)
LLM-router: meta-model predicts best model per query
RouteLLM: trained preference-based router (LMSYS)
Mixture of Experts (MoE)
Gating network assigns tokens to top-K expert FFN layers
Sparse activation — only K/N experts fire per token (efficient)
Models: Mixtral 8×7B, GPT-4 (rumoured MoE), Grok 1
Auxiliary load-balancing loss prevents expert collapse
Tool Routing
LLM decides which tool to call (search, calculator, code, DB)
Function calling (OpenAI) / tool_use (Anthropic) as structured routing
14
LangChain LCEL · Chains · Agents · Memory
LCEL — LangChain Expression Language
Declarative pipe syntax: chain = prompt | model | parser
Streaming, async, batch built-in via Runnable interface
RunnablePassthrough, RunnableParallel for fan-out patterns
Chain Types
RetrievalQA: retriever → stuffing/map_reduce/refine → LLM
ConversationalRetrievalChain: chat history + RAG
MapReduceDocumentsChain: parallel map → reduce for long docs
RefineChain: sequential refinement over docs
Agents + Memory
Agent types: Zero-shot ReAct, OpenAI Functions, Structured Input, Plan-Execute
Memory: ConversationBufferMemory, ConversationSummaryMemory, VectorStoreRetrieverMemory, EntityMemory
Tools: Search, Python REPL, SQL, Wikipedia, custom functions
15
LangGraph Nodes · State · Checkpointing · Multi-Agent
Core Concepts
StateGraph: Directed graph where nodes are Python functions transforming TypedDict state
Nodes: graph.add_node("intake", intake_agent) — pure functions: state_in → state_out
Edges: add_edge, add_conditional_edges for branching
START/END: Built-in entry/exit nodes for graph lifecycle
Parallel fan-out: Multiple edges from one node → concurrent execution
LoanIQ Graph Architecture
S
START → intake_agent
Parse loan application, validate fields
route_after_intake()
Conditional: valid → parallel | invalid → END
ratio_calc ‖ policy_retrieval ‖ compliance
Fan-out parallel — Annotated[list, operator.add] merges
underwriting_agent → decision → audit
Final decisioning, audit trail, END
State Management
TypedDict state: LoanApplicationState — type-safe, immutable per-turn
Annotated reducers: Annotated[list, operator.add] for parallel-safe list merging
State isolation: Each node receives full state snapshot, returns partial update
Checkpointing + Persistence
MemorySaver: In-memory, dev/test only
SqliteSaver: Local persistence, single-user
PostgresSaver: Production, multi-tenant
Thread IDs: config={"configurable":{"thread_id":"loan_123"}}
Time travel: Replay from any checkpoint snapshot
Human-in-loop: interrupt_before=["decision_agent"] for approval gates
Multi-Agent Patterns
Supervisor: LLM orchestrator routes to worker agents
Swarm: Agents hand off to each other peer-to-peer
Subgraphs: Compose graphs within graphs (modular)
LangGraph Platform: Deploy, scale, stream via REST+WS
16 – 17
Fine-Tuning LoRA · QLoRA · RLHF · DPO · Alignment · Serving
Methods
MethodVRAMApproach
Full FTHighAll weights updated, best quality, most data
LoRALowLow-rank ΔW=BA (r=8–64) added to frozen weights
QLoRALowest4-bit NF4 quantized base + bf16 LoRA adapters
Prefix/Prompt FTMinimalOnly prepended soft prompt tokens trained
LoRA: W' = W₀ + α/r · B·A  |  B∈ℝ^(d×r), A∈ℝ^(r×k)
Data Requirements
Format: instruction/input/output triplets, chat format, preference pairs
Size: Task-specific: 100–10K examples. General: 100K+
Quality > Quantity: Self-Instruct, Alpaca, ShareGPT style
Tools: Unsloth (fast LoRA), Axolotl, TRL, HF Transformers
Training + Alignment
SFT: Supervised fine-tuning on demonstrations (cross-entropy loss)
RLHF: 1) SFT 2) Train reward model on human preferences 3) PPO optimize policy vs reward − β·KL(π||π_ref)
DPO: Direct Preference Optimization — no RL needed. Implicit reward via preference pairs. β-regulated KL penalty baked into loss
ORPO: Combined SFT+alignment in one pass, no reference model
Constitutional AI: Self-critique + revision against principles (Anthropic)
DPO: L = -log σ( β log π(y_w)/π_ref(y_w) - β log π(y_l)/π_ref(y_l) )
Serving Considerations
vLLM: PagedAttention, continuous batching, OpenAI-compatible API
GGUF/llama.cpp: CPU inference with Q4_K_M quantization
AWQ/GPTQ: 4-bit GPU-efficient weight quantization post-FT
Speculative decoding: Draft model + verifier for 2–3× throughput
Merge adapters: merge_and_unload() → single model for serving
18
Production Layer API · Caching · Guardrails · Safety
API Layer
FastAPI — async, OpenAPI auto-docs, Pydantic validation
Streaming: Server-Sent Events (SSE) via EventSourceResponse
Auth: JWT/OAuth2, API key middleware, rate limiting (slowapi)
LoanIQ stack: FastAPI + uvicorn + LangGraph compiled graph singleton
Horizontal scaling: Kubernetes, load balanced, stateless API pods
Caching Layers
Semantic cache: Embed query → check Redis/Faiss for similar past queries (cosine >0.97)
Exact cache: MD5 hash of (query+context) → Redis TTL cache
Embedding cache: Store computed embeddings for re-used chunks in Redis
LangChain GPTCache: Drop-in semantic caching layer
Typical savings: 40–80% LLM call reduction on repeated queries
Guardrails + Safety
Input guardrails: Prompt injection detection, topic filter, PII redaction
Output guardrails: Hallucination detector, toxicity filter (Perspective API), citation validator
NeMo Guardrails: Colang policy language, dialog rails, fact-checking rail
Llama Guard: Fine-tuned safety classifier for input/output
PII: Presidio (Microsoft) for detect + anonymize before embedding
19
Monitoring & Evaluation RAGAS · Observability · Feedback · Human-in-Loop
RAGAS Evaluation Metrics
MetricMeasuresHow
FaithfulnessHallucination rateClaims in answer supported by context? LLM-judged
Answer RelevancyOn-topic qualityReverse generate Qs from answer → cosine sim to original Q
Context PrecisionRetrieval signal/noiseHow much retrieved context is actually relevant?
Context RecallCoverageGround truth claims found in retrieved context?
Context RelevancyRetrieval accuracyRetrieved chunks relevant to the query?
Answer CorrectnessFactual accuracySemantic similarity + factual overlap vs ground truth
Observability Stack
LangSmith: Trace every LangChain/LangGraph run, latency, token cost per node
Arize Phoenix: LLM observability, embedding drift, retrieval quality dashboards
OpenTelemetry: OpenInference semantic conventions for LLM spans
Prometheus + Grafana: p50/p95/p99 latency, token/s, cost/query, error rate
Key metrics: TTFT (time-to-first-token), E2E latency, retrieval hit rate, faithfulness score
Feedback Loops + Human-in-Loop
Thumbs up/down: Log preference signal → fine-tune reward model (RLHF loop)
A/B testing: Route % of traffic to candidate model → compare RAGAS scores
LangGraph HIL: interrupt_before nodes pause for human approval (LoanIQ: underwriting decisions)
Active learning: Flag low-confidence responses for human annotation → training data
20
End-to-End System Diagram — Swimlane User → Backend → Retrieval → LLM → Monitoring
USER
Type Query
HTTPS POST /ask
→→→→→→→→→→→→→→→→→→→→→→→→→→
Stream Response
Rate Answer 👍/👎
BACKEND
FastAPI Auth
Cache Check
Guardrails Input
Query Preprocess
Prompt Assembly
Guardrails Output
Cache Set
RETRIEVAL
Embed Query
ANN Search (pgvector)
BM25 Sparse
RRF Fusion
Cross-Encoder Rerank
Top-K Chunks
LLM
Router: Model Select
OpenAI / Claude API
Streaming Tokens
Tool Calls (optional)
Final Answer + Citations
MONITOR
LangSmith Trace
RAGAS Score
Prometheus Metrics
Grafana Dashboard
Feedback → RLHF
21
Reference — Model + Tool Selection Guide When to use which
Task / ScenarioRecommended Model / ToolRationale
Production RAG (cost-sensitive)GPT-4o-mini + text-embedding-3-small80% quality at 10% cost vs GPT-4o
Complex reasoning / agentClaude 3.5 Sonnet / GPT-4oBest long-context, tool use, reasoning
Local / private deploymentLlama 3.1 8B / Mistral 7B via OllamaNo data leaves premises, free
Code generationClaude 3.5 Sonnet / DeepSeek CoderTop HumanEval scores
Embeddings (best quality)text-embedding-3-large / E5-Mistral-7BHighest MTEB BEIR scores
Embeddings (open source)BGE-M3 / Jina v3Multilingual, self-hosted, strong
Re-rankingCohere Rerank v3 / BGE-reranker-v2Best retrieval precision gain
OrchestrationLangGraph (stateful) / LangChain (chains)Cycles + state = LangGraph; simple pipelines = LC
Vector DB (managed)Pinecone / WeaviateZero ops, SOC2, good SLAs
Vector DB (open source)pgvector (existing PG) / QdrantCollocate with app DB / rich filtering
Fine-tuning (low resource)QLoRA with Unsloth4-bit + LoRA = 70% VRAM reduction
Graph RAGMicrosoft GraphRAG / LightRAGMulti-hop reasoning, thematic summaries
22
Reference — Hyperparameters + Common Pitfalls Chunk Size · Top-K · Temperature · Mistakes
Critical Hyperparameters
ParameterRecommendedImpact
chunk_size512–1024 tokensToo small = loss of context; too large = noise + lost-in-middle
chunk_overlap10–20% of chunkPrevents answer split across boundaries
top_k retrieval20–100 (rerank to 5)High recall → rerank for precision. Too small = miss answers
top_k final3–7 chunksContext window budget; quality vs completeness tradeoff
temperature0.0–0.2 (RAG) · 0.7–1.0 (creative)Low T = deterministic, factual; High T = diverse, creative
top_p (nucleus)0.9Truncates low-prob tokens; avoid using with temperature simultaneously
similarity threshold0.7–0.8Filter irrelevant retrieved chunks before LLM
embed batch_size64–256Throughput vs memory; larger = faster embedding
LoRA rank (r)16–64Higher r = more capacity, more memory; r=16 usually sufficient
LoRA alpha2× rankEffective learning rate of adapter; alpha/r = scale factor
⚠ Common Pitfalls
Hallucination: LLM answers without grounding → Fix: faithfulness guardrail, lower temp, explicit "only use context" prompt
Chunk too large: Embedding averages out meaning → retrieval misses → Fix: smaller chunks + parent retrieval
No reranking: ANN has recall errors → Fix: always add cross-encoder reranking stage
Stale index: Docs updated but not re-indexed → Fix: delta indexing pipeline with doc hash change detection
Query-doc mismatch: Query is short, doc is long → Fix: HyDE, multi-query, or doc summary embeddings
Ignoring metadata: Not filtering by date/source → Fix: always add metadata filter layer to retrieval
Lost-in-middle: Relevant chunk buried in context → Fix: reorder (best first/last), reduce K
No deduplication: Repeated chunks inflate context, waste tokens → Fix: MD5/cosine dedup in preprocessing
Embedding model mismatch: Different model for indexing vs query → Fix: Always use same model for both
No eval loop: Shipping RAG without RAGAS baseline → Fix: run offline RAGAS eval before every deployment
RAG Quality Checklist
Preprocessing removes noise before chunking
Chunk strategy matches doc structure (code → AST, prose → recursive)
Hybrid retrieval (dense + BM25) with RRF fusion
Cross-encoder reranking applied after ANN
Contextual compression to remove noise from chunks
Lost-in-middle mitigation (best chunks first/last)
Guardrails on input (injection) and output (faithfulness)
RAGAS metrics tracked per deployment (target: F>0.8, AR>0.8)
Semantic cache for repeated queries
Observability: traces, latency, cost per query in Grafana
Human-in-loop for high-stakes decisions (LoanIQ: approval gate)
Regular index refresh + deduplication pipeline
LoanIQ Tech Stack Summary
FastAPI LangGraph pgvector text-embedding-3-small BM25+RRF Cross-Encoder RAGAS LangSmith PostgresSaver NeMo Guardrails PolicyAgent · ComplianceAgent