RAG — Retrieval Augmented Generation

Module 2: Retrieval Pipeline

What is the Retrieval Pipeline? After documents are ingested into a vector store, the retrieval pipeline is responsible for finding the most relevant chunks for a given user query. It's the search engine of your RAG system.

LoanIQ's retrieval pipeline is a 5-stage process: Query Rewriting → Dense Search → Sparse Search → RRF Fusion → MMR → Reranking

LoanIQ Advanced Retrieval Pipeline
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
User Query
    │
    ▼
[Stage 1] Query Rewriting
  Multi-query expansion → 3 query variants
    │
    ▼ (run all 3 queries)
[Stage 2a] Dense Retrieval ─────┐
  Embed → pgvector cosine search │
  Returns: Top-15 chunks         ├──► [Stage 3] RRF Fusion
                                 │    Merge + deduplicate
[Stage 2b] Sparse Retrieval ────┘    Returns: Top-10 chunks
  BM25 keyword search                   │
  Returns: Top-15 chunks                ▼
                                 [Stage 4] MMR Filter
                                  Diversity selection
                                  Returns: 7 diverse chunks
                                       │
                                       ▼
                                 [Stage 5] Cross-Encoder Rerank
                                  Deep scoring of each chunk
                                  Returns: Top-5 final chunks
                                       │
                                       ▼
                                  LLM Generation
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Sub-topic 1: Dense Retrieval

What, Why, How

What: Dense retrieval converts the user's query into an embedding vector and finds the K chunks with the highest cosine similarity in the vector database.

Why: It finds semantically similar content even when the exact words don't match. "What's the maximum debt ratio?" will retrieve chunks about "DTI limit" even though the query uses different words.

How: 1. Embed the query using the same model used during ingestion (critical!) 2. Run a cosine similarity search in pgvector 3. Return top-K results sorted by similarity score

In LoanIQ Project

From pdf_ingestion.py → PolicyRetrieverService:

async def retrieve_relevant_policy(self, query, state, loan_type, top_k):
    # Step 1: Embed the query (same provider as ingestion!)
    query_embedding = await router.embed_query(query)

    # Step 2: Vector search with state/loan_type filtering
    results = await self._vector_search(session, query_embedding, state, loan_type, top_k)

SQL using pgvector's <=> cosine distance operator:

SELECT *, 1 - (pc.embedding <=> CAST(:embedding AS vector)) AS similarity_score
FROM policy_chunks pc
JOIN policy_documents pd ON pc.document_id = pd.id
WHERE
    pd.is_active = true
    AND pc.applicable_states @> CAST(:state_filter AS jsonb)
    AND 1 - (pc.embedding <=> CAST(:embedding AS vector)) >= :threshold  -- 0.7
ORDER BY pc.embedding <=> CAST(:embedding AS vector)
LIMIT :top_k

Implementation

from sqlalchemy import text as sql_text

async def dense_search(query: str, top_k: int = 15) -> list[dict]:
    query_embedding = await embed_query(query)

    sql = sql_text("""
        SELECT 
            chunk_id,
            content,
            1 - (embedding <=> CAST(:embedding AS vector)) as similarity
        FROM policy_chunks
        WHERE 1 - (embedding <=> CAST(:embedding AS vector)) >= 0.6
        ORDER BY embedding <=> CAST(:embedding AS vector)
        LIMIT :top_k
    """)

    result = await session.execute(sql, {
        "embedding": str(query_embedding),
        "top_k": top_k
    })
    return [dict(row) for row in result.fetchall()]

Key Concepts

Bi-encoder architecture: The query and each document are encoded separately by the same model. This is fast because you pre-compute all document embeddings once. But it's less accurate than reading query+document together (that's the reranker's job).

Similarity threshold: LoanIQ uses >= 0.7 — chunks with similarity below 0.7 are filtered out. This prevents retrieving vaguely related content. Too high → misses relevant chunks. Too low → includes noise.

Top-K tradeoff: Higher K = better recall (less likely to miss the answer) but more noise passed downstream, slower reranking, larger LLM context.


Sub-topic 2: Sparse Retrieval — BM25

What, Why, How

What: BM25 (Best Match 25) is a classic keyword-based ranking algorithm. It scores documents based on how often query terms appear in them, with adjustments for document length and term rarity.

Why: Dense retrieval is great for semantic similarity but terrible at exact matches. If a user asks about "Appendix B, Section 4.2.3" or "FICO score 680", the dense retriever might return chunks about general credit requirements. BM25 finds the exact document containing "4.2.3" or "680."

How:

BM25 Score Formula:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                     tf(t,d) × (k₁+1)
score(d,q) = Σ IDF(t) × ──────────────────────────────────
              t            tf(t,d) + k₁(1 - b + b × |d|/avgdl)

Where:
  tf(t,d)  = frequency of term t in document d
  IDF(t)   = log((N - df(t) + 0.5) / (df(t) + 0.5))  [rare terms score higher]
  k₁=1.5   = TF saturation parameter
  b=0.75   = length normalization
  |d|      = document length
  avgdl    = average document length in corpus
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

TF Saturation: In BM25, TF is saturated — a word appearing 10 times doesn't score 10x more than a word appearing once. This prevents keyword spam. In TF-IDF, 10 occurrences = 10x score (gameable). In BM25, it asymptotically approaches a ceiling.

When sparse beats dense: - Exact numbers: "LTV 97.5%" - Names: "Freddie Mac 5702" - Codes: "HMDA regulation Z" - Acronyms: "CLTV", "HLTV", "PMI" - Section references: "Appendix A Section 3"

In LoanIQ Project

BM25 runs over PolicyChunk.content using the rank_bm25 library:

from rank_bm25 import BM25Okapi

class BM25Retriever:
    def __init__(self, chunks: list[str]):
        tokenized = [doc.lower().split() for doc in chunks]
        self.bm25 = BM25Okapi(tokenized)
        self.chunks = chunks

    def search(self, query: str, top_k: int = 15) -> list[tuple[int, float]]:
        tokenized_query = query.lower().split()
        scores = self.bm25.get_scores(tokenized_query)
        top_indices = scores.argsort()[-top_k:][::-1]
        return [(idx, float(scores[idx])) for idx in top_indices]

Limitation

BM25 is in-memory in LoanIQ. The BM25Okapi index is rebuilt on startup from all policy chunks. This is fine for ~500-1000 chunks but would break at 100K+ chunks.

Production alternatives: - Elasticsearch/OpenSearch for persistent BM25 at scale - PostgreSQL full-text search (tsvector) — built-in, no extra service


Sub-topic 3: RRF Fusion

What, Why, How

What: Reciprocal Rank Fusion (RRF) is an algorithm that merges ranked lists from multiple retrievers into a single unified ranking. It takes the rank (position) of each document in each list and combines them.

Why: Dense retrieval and BM25 each have blind spots. RRF combines their results without needing to know the actual scores (which are on different scales and incomparable — cosine similarity vs BM25 scores can't simply be added).

How:

RRF Formula:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                     1
RRF_score(d) = Σ  ───────
              r   k + rank_r(d)

Where:
  rank_r(d) = position of document d in ranked list r (1-indexed)
  k = 60    (constant that dampens the effect of top positions)

Example:
  Chunk A: rank 1 in dense, rank 5 in BM25
  → RRF = 1/(60+1) + 1/(60+5) = 0.01639 + 0.01538 = 0.03177

  Chunk B: rank 3 in dense, rank 3 in BM25  
  → RRF = 1/(60+3) + 1/(60+3) = 0.01587 + 0.01587 = 0.03174

  Both score similarly — appeared in both lists = strong signal!
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Why k=60? The constant k prevents the top-1 result from dominating too much. Without it, rank 1 = 1.0 and rank 2 = 0.5 — a huge gap. With k=60, rank 1 = 1/61 = 0.0164 and rank 2 = 1/62 = 0.0161 — much smaller difference. This makes the fusion more robust.

In LoanIQ Project

def reciprocal_rank_fusion(
    dense_results: list[dict],    # [(chunk_id, score), ...]
    sparse_results: list[dict],   # [(chunk_id, score), ...]
    k: int = 60,
) -> list[dict]:
    rrf_scores: dict[str, float] = {}
    chunk_map: dict[str, dict] = {}

    # Score from dense ranking
    for rank, chunk in enumerate(dense_results, start=1):
        cid = chunk["chunk_id"]
        rrf_scores[cid] = rrf_scores.get(cid, 0) + 1 / (k + rank)
        chunk_map[cid] = chunk

    # Score from BM25 ranking
    for rank, chunk in enumerate(sparse_results, start=1):
        cid = chunk["chunk_id"]
        rrf_scores[cid] = rrf_scores.get(cid, 0) + 1 / (k + rank)
        if cid not in chunk_map:
            chunk_map[cid] = chunk

    # Sort by RRF score, return merged list
    sorted_ids = sorted(rrf_scores, key=lambda c: rrf_scores[c], reverse=True)
    return [{"chunk": chunk_map[cid], "rrf_score": rrf_scores[cid]} 
            for cid in sorted_ids]

Why RRF over Weighted Sum?


Sub-topic 4: Query Rewriting

What, Why, How

What: Instead of searching with the raw user question, query rewriting generates multiple variations of the query (or transforms it) to improve retrieval recall.

Why: Users phrase questions in unpredictable ways. "How much can I borrow?" might miss chunks about "maximum loan amount" or "loan limits". Generating 3 variations covers more ground.

How (Multi-query expansion):

Original: "What is the maximum LTV for FHA loans?"

Generated variations:
  1. "FHA loan maximum loan-to-value ratio requirement"
  2. "What percentage of home value can I borrow with FHA?"
  3. "FHA LTV limit guideline"

Run all 3 → retrieve 15 chunks each → merge with RRF → deduplicate

Query Rewriting Techniques

Technique How When to Use
Multi-query expansion LLM generates N variants Default — LoanIQ uses this
HyDE LLM generates hypothetical answer, embed the answer When query is very short/vague
Step-back prompting LLM abstracts the specific question Complex multi-hop questions
Query decomposition Break compound question into sub-questions "Compare FHA and conventional LTV limits"

In LoanIQ Project

The PolicyRetrievalAgent calls the LLM to generate multi-query variants before searching:

query_expansion_prompt = """
Generate 3 different versions of this mortgage policy question.
Each version should emphasize different aspects.
Return as JSON array of strings.

Question: {query}
"""
# Returns: ["LTV requirement FHA", "maximum loan-to-value FHA loan", "FHA LTV limit"]

HyDE — Hypothetical Document Embedding

HyDE is particularly powerful when queries are very short. Instead of embedding "DTI?", you generate a hypothetical answer like "The maximum debt-to-income ratio for conventional loans is 45%. This includes all monthly debt obligations..." and embed that — which is much richer and closer to the actual policy chunk's semantic space.

# HyDE implementation
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

hyde_prompt = ChatPromptTemplate.from_messages([
    ("system", "Generate a hypothetical policy document section that would answer this question."),
    ("human", "{question}")
])

llm = ChatOpenAI(model="gpt-4o-mini")
hypothetical_doc = await (hyde_prompt | llm).ainvoke({"question": query})
embedding = await embed(hypothetical_doc.content)  # embed the hypothetical answer, not the question

Sub-topic 5: MMR — Maximal Marginal Relevance

What, Why, How

What: MMR is a diversity filter. After RRF fusion gives you a ranked list of relevant chunks, MMR selects a subset that is both relevant AND diverse — preventing you from passing 5 nearly identical chunks to the LLM.

Why: Without MMR, the top-5 chunks might all be slight variations of "LTV must not exceed 80%." This wastes the LLM's context window and provides no additional information. MMR ensures the selected chunks cover different aspects of the topic.

How:

MMR Formula:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MMR(d) = λ × Sim(d, query) - (1 - λ) × max[Sim(d, selected)]

Where:
  Sim(d, query)        = similarity to the query (relevance)
  Sim(d, selected)     = max similarity to any already-selected chunk
  λ = 0.5              = tradeoff (0.5 = equal relevance/diversity)

Algorithm (greedy):
  selected = []
  while len(selected) < k:
    best = argmax MMR(d) for all remaining d
    selected.append(best)
    remove best from candidates
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In LoanIQ Project

From app/services/rag/reranker.py → MMRFilter:

class MMRFilter:
    def filter(self, chunks, query_embedding, chunk_embeddings, k=5, lambda_=0.5):
        q_vec = np.array(query_embedding)
        selected = []
        selected_vecs = []
        remaining = list(chunks)

        for _ in range(min(k, len(remaining))):
            best_chunk, best_score = None, -inf

            for chunk in remaining:
                c_vec = np.array(chunk_embeddings[chunk.chunk_id])
                relevance = cosine(q_vec, c_vec)

                if selected_vecs:
                    max_sim = max(cosine(c_vec, sv) for sv in selected_vecs)
                else:
                    max_sim = 0.0

                mmr_score = lambda_ * relevance - (1 - lambda_) * max_sim

                if mmr_score > best_score:
                    best_score = mmr_score
                    best_chunk = chunk

            selected.append(best_chunk)
            selected_vecs.append(np.array(chunk_embeddings[best_chunk.chunk_id]))
            remaining.remove(best_chunk)

        return selected

λ tuning: - λ=1.0 → pure relevance (same as sorting by similarity) - λ=0.5 → LoanIQ default: equal balance - λ=0.3 → more diversity (useful for survey-type questions)


Sub-topic 6: Reranking

What, Why, How

What: A reranker is a cross-encoder model that reads the query and each candidate chunk together and produces a precise relevance score. It's a second-stage filter after the initial retrieval.

Why: Embedding similarity is a bi-encoder approach — query and document are encoded separately and compared via dot product. This is fast but approximate. A cross-encoder reads both together with full attention across both texts — much more accurate, but too slow to run on the full corpus.

Bi-Encoder vs Cross-Encoder:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Bi-Encoder (embedding):
  Query  → [ENCODER] → vector_q ─┐
                                  ├── cosine_sim → score
  Doc    → [ENCODER] → vector_d ─┘

  Fast: pre-compute doc vectors, only encode query at runtime
  Approximate: query/doc encoded independently

Cross-Encoder (reranker):
  [Query + Document] → [ENCODER] → single relevance score

  Slow: must encode each (query, doc) pair together
  Accurate: full attention across query and document tokens
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Why only rerank top 5-10? Cross-encoders are slow (~100ms per pair on CPU). Reranking all 1000 chunks would take ~100 seconds. So we use cheap retrieval to get 10-15 candidates, then use the expensive reranker to precisely score just those.

In LoanIQ Project

From app/services/rag/reranker.py → CrossEncoderReranker:

class CrossEncoderReranker:
    def __init__(self):
        self._model = None   # lazy-loaded on first use

    def rerank(self, query: str, chunks: list, top_k: int = 5) -> list:
        model = self._load_model()  # cross-encoder/ms-marco-MiniLM-L-6-v2

        # Build (query, chunk_text) pairs
        pairs = [(query, chunk.content) for chunk in chunks]

        # Run cross-encoder: scores each pair together
        scores = model.predict(pairs, show_progress_bar=False)

        # Attach scores and sort
        for chunk, score in zip(chunks, scores):
            chunk.rerank_score = float(score)

        return sorted(chunks, key=lambda c: c.rerank_score, reverse=True)[:top_k]

Model used: cross-encoder/ms-marco-MiniLM-L-6-v2 - Free HuggingFace model, ~25MB - Trained on MS MARCO (Microsoft question-answering dataset) - Runs on CPU in ~100ms per pair - Zero API cost

Alternatives

Reranker Type Cost Accuracy
ms-marco-MiniLM-L-6-v2 Local HuggingFace Free Good
ms-marco-MiniLM-L-12-v2 Local HuggingFace Free Better
Cohere Rerank API Managed cloud ~$1/1K queries Very high
ColBERT Local, token-level Free High

Sub-topic 7: Advanced Techniques

Contextual Retrieval

Instead of embedding raw chunks, prepend a document-level summary to each chunk before embedding:

Original chunk: "The maximum LTV is 80% for second homes."

With contextual prefix: 
"[This chunk is from Section 4: Property Requirements of the Eligibility Policy.
 Context: This section describes LTV requirements by occupancy type.]
 The maximum LTV is 80% for second homes."

This makes the chunk embedding "aware" of its context in the larger document. Significantly improves retrieval for chunks whose meaning depends on context.

Self-Querying

LLM generates structured metadata filters from the natural language query:

User: "What are the LTV requirements for FHA loans in California?"

LLM extracts:
{
  "query": "LTV requirements",
  "filters": {
    "applicable_loan_types": ["FHA"],
    "applicable_states": ["CA"]
  }
}

LoanIQ already stores applicable_states and applicable_loan_types on each chunk — these become WHERE clause filters in the SQL, dramatically reducing the search space before vector comparison.


Sub-topic 8: Retrieval Metrics

Key Metrics

Metric Formula What it Measures
Precision@K Relevant in top-K / K Of what you returned, how much was useful?
Recall@K Relevant in top-K / Total relevant Did you find everything that matters?
MRR mean(1/rank_of_first_relevant) How early does the first relevant result appear?
NDCG Sum(relevance/log2(rank)) Position-weighted graded relevance
Hit Rate 1 if any relevant chunk in top-K Did you find at least one relevant chunk?

In LoanIQ Project (from rag_evaluator.py)

def _score_context_precision(self, sample, rag_result) -> float:
    """Precision: what fraction of retrieved chunks contain relevant keywords?"""
    relevant_count = sum(
        1 for chunk in rag_result.source_chunks
        if any(kw.lower() in chunk.content.lower() for kw in sample.context_keywords)
    )
    return relevant_count / len(rag_result.source_chunks)

def _score_context_recall(self, sample, rag_result) -> float:
    """Recall: what fraction of expected keywords appear in ANY retrieved chunk?"""
    all_context = " ".join(c.content.lower() for c in rag_result.source_chunks)
    found = sum(1 for kw in sample.context_keywords if kw.lower() in all_context)
    return found / len(sample.context_keywords)

Interpreting low scores: - Low precision → retrieval is bringing in noisy/irrelevant chunks → tighten similarity threshold or improve chunking - Low recall → missing relevant content → increase K, improve chunking, check if content was ingested correctly


10 Interview Questions — RAG Retrieval Pipeline

Q1: What is hybrid retrieval and why does LoanIQ use it?

A: Hybrid retrieval combines dense (vector/semantic) search and sparse (BM25/keyword) search to get the best of both worlds.

In LoanIQ, policy documents contain both: general guidelines (better with dense) and specific rule references with exact numbers and acronyms (better with BM25).

Example: Query "What is the maximum DTI for USDA loans?" - Dense: Returns chunks about DTI limits broadly (conceptually similar) - BM25: Returns chunks that literally contain "USDA" and "DTI" (keyword match) - RRF: Chunks appearing in both lists get boosted — highest confidence


Q2: Explain the BM25 formula. Why does it outperform simple TF-IDF?

A: BM25 builds on TF-IDF with two key improvements:

TF saturation: In plain TF-IDF, a word appearing 20 times scores 20× more than a word appearing once. This is gameable and often wrong — the 20th occurrence adds little information. BM25's TF component is: tf / (tf + k₁(1-b+b×|d|/avgdl)), which asymptotically approaches a ceiling. At k₁=1.5, doubling TF gives ~80% more score (not 2×).

Length normalization: Shorter documents have higher TF naturally (fewer total words). BM25 normalizes by comparing document length to corpus average length (avgdl). Parameter b=0.75 controls this normalization strength.

IDF stays the same: Rare terms (appear in few documents) get higher IDF. "Mortgage" in a mortgage corpus would have low IDF; "HECM" (Home Equity Conversion Mortgage) in a general corpus would have high IDF.


Q3: Walk me through the RRF formula. Why is k=60?

A: RRF score for document d: Σ 1/(k + rank_r(d)) summed over all retrieval lists r.

The role of k=60: - Without k: rank 1 → score 1.0, rank 2 → 0.5, rank 3 → 0.33. The gap between rank 1 and 2 (0.5) is huge — rank 1 dominates. - With k=60: rank 1 → 1/61 = 0.0164, rank 2 → 1/62 = 0.0161. Tiny gap — rankings are treated more equally.

This means a document that ranks 5th in dense AND 5th in sparse will beat a document that ranks 1st in only one list. "Consistency across retrievers" is rewarded over "best in one retriever."

The constant k=60 was empirically determined to work well across many datasets. You can tune it but 60 is a safe default.


Q4: What is the "lost in the middle" problem and how does MMR help?

A: Research by Liu et al. (2023) found that LLMs perform significantly worse at retrieving information placed in the middle of a long context. They pay attention to the beginning and end of the context window but "lose" information in the middle.

MMR doesn't directly solve "lost in the middle" at the generation stage (that's addressed by careful prompt formatting — putting most important chunks first and last). However, MMR prevents a worse problem: redundant chunks.

Without MMR, if your top-5 chunks all say "LTV must be ≤ 80%" (slightly differently worded), the LLM spends most of its attention window on one fact. With MMR, chunk 1 covers LTV, chunk 2 covers DTI, chunk 3 covers credit score requirements — the context window is used efficiently.


Q5: Explain bi-encoder vs cross-encoder. Why do we use both in LoanIQ?

A:

Bi-encoder: Query encoded independently from document. Result = two vectors. Score = cosine similarity. Speed: O(1) per query (docs pre-computed). Accuracy: moderate.

Cross-encoder: Query and document encoded together in one forward pass. Full attention across all query-document tokens. Speed: O(n) per query (n = number of candidates). Accuracy: high.

LoanIQ uses both in a two-stage pipeline: 1. Stage 1 (bi-encoder): Dense retrieval finds top-15 candidates fast — eliminates 99.9% of chunks with O(1) embedding comparison 2. Stage 2 (cross-encoder): Reranker reads each of the 15 candidates together with the query — precise scoring of just 15 pairs

If we used only the cross-encoder, scoring 1000 chunks × 100ms = 100 seconds per query. Unacceptable. The bi-encoder pre-filter makes this feasible.


Q6: What does MMR's λ parameter control? What would you change it to for different use cases?

A: λ controls the tradeoff between relevance and diversity:

MMR(d) = λ × Sim(d, query) - (1-λ) × max_Sim(d, selected)

In LoanIQ, the PolicyRetrievalAgent uses λ=0.5 because mortgage policy questions often need multiple policy rules (credit, LTV, DTI, property type) — diversity matters.


Q7: A user asks "DTI?" and gets no useful results. What's wrong and how would you fix it?

A: "DTI?" is too short and vague for both dense and sparse retrieval: - Dense: The embedding of "DTI?" is a low-information vector — embedding three characters produces a noisy, non-specific vector. - BM25: "DTI" might match, but without context, it matches every chunk mentioning DTI.

Fixes: 1. Query expansion/rewriting: Before retrieval, use an LLM to expand "DTI?" to "What is the maximum debt-to-income ratio requirement for mortgage loans?" This gives both retrievers rich, meaningful input. 2. HyDE: Generate a hypothetical answer and embed that instead. The hypothetical answer will be a rich paragraph about DTI limits — much better for dense retrieval. 3. Conversation context: If this is part of a multi-turn conversation about FHA loans, prepend the conversation context to the query. 4. Minimum query length check: Reject queries shorter than N characters and prompt the user to be more specific.


Q8: How does LoanIQ filter chunks by state and loan type during retrieval? Why is this important?

A: Mortgage regulations vary dramatically by state. A California FHA rule may be completely different from a Texas FHA rule. Retrieving the wrong state's policy and generating an answer based on it would be a compliance violation.

In the SQL query:

WHERE pc.applicable_states @> CAST(:state_filter AS jsonb)
   OR pc.applicable_states @> '["ALL"]'::jsonb

@> is PostgreSQL's JSON containment operator: "this JSON column contains this value."

Each PolicyChunk has applicable_states = ["CA"] or ["ALL"]. The query only returns: - Chunks tagged for the specific state - Chunks tagged as "ALL" (universal rules)

This means the LLM only sees policy content relevant to the borrower's actual situation — no risk of applying California rules to a Texas loan.

This is called pre-retrieval filtering — reducing the search space before vector comparison, which also makes the query faster.


Q9: What is the difference between context precision and context recall in RAGAS?

A:

Context Precision: Of all the chunks you retrieved, what fraction were actually relevant to answering the question? High precision means low noise in your context.

Example: Retrieve 5 chunks, 4 are about the right topic, 1 is about property insurance (off-topic) → Precision = 4/5 = 0.80

Context Recall: Of all the information needed to correctly answer the question, did your retrieval system find it all? High recall means no important information was missed.

Example: The correct answer requires knowing (a) max LTV and (b) PMI requirements. If retrieval found chunks covering LTV but not PMI → Recall ≈ 0.5

Fixing low scores: - Low precision → Tighten similarity threshold; improve chunk quality; use better reranking - Low recall → Increase top-K; improve chunking (maybe a single chunk was split badly); check if content is in the corpus at all

In LoanIQ's rag_evaluator.py, both are measured by keyword overlap with expected keywords from the evaluation dataset.


Q10: BM25 is in-memory in LoanIQ. What breaks at scale and how would you fix it?

A: BM25Okapi from rank_bm25 is built in-memory at startup from all policy chunks. Problems at scale:

  1. Memory: 100K chunks × average 200 tokens × ~4 bytes/token = ~80MB. At 10M chunks → 8GB RAM just for BM25 index.
  2. Startup time: Rebuilding index from DB on every restart = slow cold start
  3. Updates: New document ingested → entire BM25 index must be rebuilt

Production fixes: 1. Elasticsearch/OpenSearch: Persistent, distributed BM25. Supports incremental updates, handles billions of documents, has its own scoring API. 2. PostgreSQL full-text search: tsvector columns with GIN index. Already in the same DB, no extra infrastructure. Less tunable than Elasticsearch but sufficient for most cases. 3. Hybrid approach: Keep BM25 in-memory for small policy corpus (<10K chunks) since performance is fine. Switch to PostgreSQL FTS or Elasticsearch only when chunks exceed 10K.

In LoanIQ's case (~500-1000 policy chunks), in-memory BM25 is absolutely the right call. Over-engineering here would add complexity with zero benefit.


Next: RAG Generation & RAGAS Evaluation →