RAG — Retrieval Augmented Generation

Module 3: Generation & RAGAS Evaluation

What is Generation in RAG? After retrieval finds the relevant chunks, the Generation stage assembles them into a prompt and passes it to an LLM to produce a grounded, cited answer. The LLM is explicitly instructed to only use the provided context — not its training knowledge.

Sub-topic 1: Prompt Construction

What, Why, How

What: Prompt construction is the art of assembling the system prompt, retrieved chunks, and user query into a single prompt that guides the LLM to generate accurate, grounded, cited answers.

Why: A poorly constructed prompt leads to: - Hallucinations (LLM invents information not in chunks) - Ignored citations (LLM doesn't reference sources) - Lost-in-the-middle failures (LLM misses information in the middle of a long context)

How: Structure the prompt in three parts: 1. System prompt: Ground the LLM — "Answer ONLY from the provided context" 2. Context block: Retrieved chunks, formatted with source labels 3. User question: The actual question to answer

Effective RAG Prompt Structure:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SYSTEM:
You are a mortgage policy expert. Answer ONLY using the policy 
sections provided below. If the answer is not in the context, 
say "I cannot find this in the provided policy documents."
Cite sources as [POLICY SOURCE N].

CONTEXT:
[POLICY SOURCE 1] Eligibility Guidelines — Credit Requirements (p.12)
Relevance: 0.94
────────────────────────────────────────────────────────────
The minimum credit score for conventional loans is 620...

[POLICY SOURCE 2] Underwriting Borrowers — DTI Limits (p.8)  
Relevance: 0.88
────────────────────────────────────────────────────────────
The maximum back-end DTI ratio is 45% for most loan types...

USER QUESTION:
What is the minimum credit score and DTI limit for a conventional loan?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In LoanIQ Project

From app/services/rag/context_builder.py, chunks are formatted with source labels and relevance scores before being injected into the prompt:

def _format_context(self, chunks: list[RetrievedChunk]) -> str:
    parts = []
    for i, chunk in enumerate(chunks, 1):
        score = chunk.rerank_score if chunk.rerank_score != 0 else chunk.rrf_score
        header = (
            f"[POLICY SOURCE {i}] {chunk.document_title} — {chunk.section_title}\n"
            f"Relevance Score: {score:.3f}\n"
            f"{'─' * 60}"
        )
        parts.append(f"{header}\n{chunk.content.strip()}")
    return "\n\n{'═' * 60}\n\n".join(parts)

Lost in the Middle Problem

Research shows LLMs have U-shaped attention over long contexts: strongest at the beginning and end, weakest in the middle. For a 5-chunk context, the LLM may effectively ignore chunks 2, 3, 4.

Mitigation in LoanIQ: - Keep context concise (max 6000 tokens via context_builder.py) - Put highest-scoring chunk first and second-highest last - Use only top-5 chunks after reranking (not 15)

Fallback When No Relevant Chunks Found

if not chunks or max(c.rrf_score for c in chunks) < 0.5:
    return "I cannot find relevant policy information for this question. Please consult your loan officer."

Never let the LLM hallucinate an answer when there's no supporting context.

Sub-topic 2: Hallucination Prevention

What, Why, How

What: Hallucination occurs when the LLM generates statements not supported by the retrieved context. In mortgage decisioning, a hallucinated "you qualify" response could expose the company to legal liability.

Why: LLMs are trained to be helpful and produce fluent text — they'd rather make something up than say "I don't know." The grounding prompt reduces this but doesn't eliminate it. Post-generation validation is the safety net.

How: After the LLM generates a response, validate it against the retrieved context using a secondary LLM call or NLI (Natural Language Inference) model.

Hallucination Prevention Methods

Method	How	Speed	Accuracy	LoanIQ?
Grounding prompt	"Only answer from context"	No overhead	Moderate	✅ First line of defense
Post-generation NLI	Check entailment of each claim	Fast	Good	Research-grade
LLM-as-judge	Secondary LLM validates faithfulness	Slow (extra LLM call)	High	✅ Used in RAGAS eval
Self-consistency	Generate multiple answers, check agreement	Very slow (N × LLM calls)	High	Not used
Citation verification	Check cited chunks actually support claims	Fast	Good	✅ RAGAS faithfulness

In LoanIQ Project

The Stage 7 AuditAgent performs post-generation validation. The RAGAS evaluator measures faithfulness:

async def _score_faithfulness(self, rag_result) -> float:
    """LLM-as-judge: Is every claim in the answer supported by the context?"""
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    prompt = ChatPromptTemplate.from_messages([
        ("system", """Rate if the answer is fully supported by the context.
Return ONLY a number between 0.0 and 1.0.
1.0 = every claim is directly supported by context
0.5 = partially supported
0.0 = answer contains claims not in context (hallucination)"""),
        ("human", "CONTEXT:\n{context}\n\nANSWER:\n{answer}\n\nScore:"),
    ])

    chain = prompt | llm | StrOutputParser()
    raw = await chain.ainvoke({"context": context_text[:3000], "answer": answer[:1000]})
    return max(0.0, min(1.0, float(raw.strip())))

Abstaining Pattern

When context is insufficient, the system should abstain rather than guess:

ABSTAIN_PHRASES = ["not in the provided policy", "cannot find", "not covered"]

def should_abstain(answer: str, chunks: list) -> bool:
    # Abstain if no chunks were retrieved
    if not chunks:
        return True
    # Abstain if LLM signals uncertainty
    if any(phrase in answer.lower() for phrase in ABSTAIN_PHRASES):
        return True
    # Abstain if faithfulness score is too low
    return False

Sub-topic 3: RAGAS — Faithfulness

What, Why, How

What: RAGAS (Retrieval Augmented Generation Assessment) is a framework to evaluate RAG pipelines without requiring human labels for every query. Faithfulness is its most important metric.

Why: A RAG system that produces fluent, confident-sounding answers that aren't grounded in retrieved context is dangerous. Faithfulness measures this risk.

Definition: Faithfulness = (number of claims in answer supported by context) / (total claims in answer)

How the RAGAS faithfulness algorithm works:

Step 1: Decompose answer into atomic claims
  Answer: "The max DTI is 45% and PMI is required for LTV > 80%."
  Claims: ["max DTI is 45%", "PMI required when LTV > 80%"]

Step 2: For each claim, check if it can be inferred from context
  "max DTI is 45%" → present in chunk 2 → SUPPORTED
  "PMI required when LTV > 80%" → present in chunk 1 → SUPPORTED

Step 3: Faithfulness = 2/2 = 1.0 (perfect)

If one claim was not in context:
  Faithfulness = 1/2 = 0.5 (hallucination detected!)

In LoanIQ Project

In rag_evaluator.py, faithfulness is scored using LLM-as-judge:

# The LLM breaks down the answer and checks each claim
# Returns a float 0.0-1.0
faithfulness = await self._score_faithfulness(rag_result)

What a low faithfulness score means: - LLM is using training knowledge instead of retrieved context - Chunks retrieved don't contain the answer → LLM fills in the gap - Fix: Improve retrieval to find the right chunks; strengthen grounding prompt

Sub-topic 4: RAGAS — Answer Relevancy

What, Why, How

What: Answer Relevancy measures whether the generated answer actually addresses the user's question. A high-faithfulness answer can have low relevancy if it's grounded in context but answers the wrong question.

Algorithm (Reverse Generation):

Step 1: Generate synthetic questions FROM the answer
  Answer: "The minimum credit score is 620 for conventional loans."
  Generated questions: 
    - "What is the minimum credit score for conventional loans?"
    - "What credit score do I need for a conventional mortgage?"

Step 2: Measure cosine similarity between generated questions and original question
  Original: "What is the minimum FICO for a conventional loan?"
  Sim("What credit score for conventional loans?", original) = 0.94

Step 3: Answer Relevancy = mean cosine similarity of generated questions
  → 0.94 (high — the answer is on topic)

Why this approach? If the answer is on-topic, questions generated FROM it will be similar to the original question. If the answer is off-topic (e.g., talking about FHA when asked about conventional), the generated questions would be very different.

In LoanIQ Project

async def _score_answer_relevancy(self, question: str, answer: str) -> float:
    """LLM rates: does the answer actually address the question? 0.0-1.0"""
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Rate how well the answer addresses the question. Return ONLY a float 0.0-1.0."),
        ("human", "QUESTION: {question}\n\nANSWER: {answer}\n\nScore:"),
    ])
    # Simplified LLM-as-judge version (full RAGAS does reverse generation)
    raw = await chain.ainvoke({"question": question, "answer": answer[:800]})
    return max(0.0, min(1.0, float(raw.strip())))

Sub-topic 5: RAGAS — Context Precision & Recall

Context Precision

What: Of all chunks retrieved, what fraction were actually useful for generating the answer?

Why it matters: If you retrieved 10 chunks but only 2 were relevant, the LLM's context is 80% noise. This leads to distracted, inaccurate answers and wasted tokens.

Low precision fix: - Tighten similarity threshold (e.g., from 0.6 to 0.75) - Reduce top-K (retrieve fewer but better chunks) - Improve chunking (finer, more specific chunks) - Better reranking (more aggressive filtering)

Context Recall

What: Of all the information needed to correctly answer the question, did retrieval find all of it?

Requires ground truth: You need a reference answer to know what information was supposed to be retrieved.

Low recall fix: - Increase top-K (retrieve more candidates) - Fix chunking (important facts might be split across chunks) - Improve query expansion (more query variants cover more ground) - Check corpus completeness (is the answer even in the ingested documents?)

In LoanIQ Project

From rag_evaluator.py:

def _score_context_precision(self, sample, rag_result) -> float:
    """What fraction of retrieved chunks contain relevant keywords?"""
    relevant = sum(
        1 for chunk in rag_result.source_chunks
        if any(kw.lower() in chunk.content.lower() for kw in sample.context_keywords)
    )
    return relevant / len(rag_result.source_chunks)

def _score_context_recall(self, sample, rag_result) -> float:
    """What fraction of expected keywords appear in retrieved chunks?"""
    all_context = " ".join(c.content.lower() for c in rag_result.source_chunks)
    found = sum(1 for kw in sample.context_keywords if kw.lower() in all_context)
    return found / len(sample.context_keywords)

Sub-topic 6: Evaluation Dataset

What, Why, How

What: An evaluation dataset is a collection of (question, ground_truth_answer, relevant_context) triplets used to objectively measure RAG pipeline quality.

Why: You cannot improve what you don't measure. Without an eval dataset, you're guessing whether your pipeline improvements actually help or hurt.

How to build it: 1. Manual curation: Loan officers write real questions they'd ask; annotators write ground truth answers. 2. Synthetic generation: Use GPT-4 to generate Q&A pairs from policy documents. 3. Mining real queries: Use logs of actual questions asked to the system.

LoanIQ Eval Dataset Format

From app/evaluation/datasets/loan_eval_dataset.json:

[
  {
    "id": "eval_001",
    "question": "What is the maximum LTV for a conventional purchase loan with credit score 680?",
    "ground_truth": "The maximum LTV for a conventional purchase loan with a credit score of 680 is 95% for owner-occupied primary residences, subject to PMI requirements above 80% LTV.",
    "category": "ltv_requirements",
    "loan_type": "CONVENTIONAL",
    "state": "CA",
    "context_keywords": ["LTV", "conventional", "680", "PMI", "95%", "primary"]
  }
]

Evaluation Flow

From rag_evaluator.py:

async def run_evaluation(self, dataset_path, sample_size):
    # 1. Load dataset
    samples = self._load_dataset(dataset_path, sample_size)

    # 2. For each sample, run the RAG pipeline
    for sample in samples:
        rag_result = await self.pipeline.run(
            query=sample.question,
            state_filter=sample.state,
            loan_type_filter=sample.loan_type,
        )

        # 3. Score the result
        sr = await self._score_sample(sample, rag_result, latency_ms)

    # 4. Aggregate and push to LangSmith
    report = self._aggregate(run_name, sample_results)
    await self._push_to_langsmith(report)

Interpreting RAGAS Scores

Score Range	Meaning	Action
Faithfulness < 0.7	High hallucination risk	Strengthen grounding prompt; improve retrieval
Faithfulness > 0.9	Good grounding	Monitor, no action needed
Answer Relevancy < 0.7	Answers off-topic questions	Improve prompt; check query handling
Context Precision < 0.6	Too much noise in context	Tighten similarity threshold; better reranking
Context Recall < 0.7	Missing relevant information	Increase K; fix chunking; add more content

10 Interview Questions — Generation & RAGAS Evaluation

Q1: What is RAG hallucination and how does LoanIQ prevent it?

A: RAG hallucination occurs when the LLM generates factual claims that aren't in the retrieved context — it's "making things up" despite being given reference material. In mortgage decisioning, this is a compliance risk: a hallucinated "you qualify" or incorrect LTV limit could expose the bank to legal liability.

LoanIQ's three-layer defense: 1. Prompt grounding: System prompt says "Answer ONLY from the provided policy sections. If not in context, say so." 2. Abstaining fallback: When no relevant chunks are retrieved (similarity < threshold), the system returns "I cannot find this in the policy" instead of letting the LLM guess. 3. Post-generation validation (Stage 7 AuditAgent): A secondary LLM call checks if the generated decision is supported by the retrieved policy context. Evaluated by RAGAS faithfulness score.

If faithfulness drops below 0.75 in production monitoring, it triggers an alert to review the retrieval pipeline.

Q2: Explain RAGAS faithfulness. How is it computed?

A: Faithfulness measures whether every factual claim in the generated answer can be traced back to the retrieved context. It ranges from 0 (pure hallucination) to 1 (fully grounded).

Algorithm: 1. Use an LLM to decompose the generated answer into individual atomic claims: "The max DTI is 45%" and "PMI is required above 80% LTV" are two separate claims. 2. For each claim, ask an LLM (or NLI model): "Can this claim be inferred from the context?" → Yes/No 3. Faithfulness = (claims supported by context) / (total claims)

Example in LoanIQ: - Answer: "The maximum LTV is 80% and credit score must be at least 620." - Claim 1: "max LTV 80%" → in context → ✅ - Claim 2: "credit score ≥ 620" → in context → ✅ - Faithfulness = 2/2 = 1.0

If the LLM adds "and PMI is typically 0.5-1.5%" (not in context) → 2/3 = 0.67

Q3: What is the difference between faithfulness and answer correctness? Can an answer be faithful but incorrect?

A: Yes — a faithful answer can be incorrect if the retrieved context itself contains wrong or outdated information.

Faithfulness: Does the answer match the retrieved context? (Is the LLM grounded?)
Answer Correctness: Does the answer match the ground truth? (Is the answer actually right?)

Example: - Ground truth: "Max LTV for FHA is 96.5%" - Retrieved context contains an outdated policy: "Max LTV for FHA is 95%" - LLM answer: "Max LTV for FHA is 95% per the policy." - Faithfulness = 1.0 (perfectly grounded in context) - Answer Correctness = low (wrong answer — old policy was retrieved)

This is why you also need context recall (did you retrieve the current correct policy?) and periodic re-ingestion when policies update.

Q4: How does LoanIQ build its evaluation dataset? What's in each record?

A: The eval dataset in loan_eval_dataset.json is built as (question, ground_truth, metadata) triplets:

{
  "id": "eval_001",
  "question": "What is the max LTV for conventional loans with credit score 680?",
  "ground_truth": "Maximum LTV is 95% for primary residence with credit score 680...",
  "category": "ltv_requirements",
  "loan_type": "CONVENTIONAL",
  "state": "CA",
  "context_keywords": ["LTV", "conventional", "680", "95%", "primary"]
}

The dataset has several origins: 1. Manual: Mortgage underwriters wrote real questions they encounter daily 2. Synthetic: GPT-4o generated Q&A pairs from policy PDFs with the prompt "Generate a question and answer that could be asked about this policy section" 3. Edge cases: Questions about borderline scenarios (LTV exactly 80%, DTI exactly 45%) that stress-test boundary conditions

The context_keywords field is used for context precision/recall scoring — the evaluator checks if retrieved chunks contain these keywords.

Q5: Why does LoanIQ use LLM-as-judge for faithfulness instead of traditional NLP metrics like BLEU or ROUGE?

A: BLEU and ROUGE measure n-gram overlap between the generated answer and a reference answer. They were designed for translation and summarization where there's a "correct" target text.

For faithfulness in RAG, we're asking a different question: "Is this claim supported by this context?" — which requires reading comprehension, not string matching.

Example: - Context: "Debt-to-income ratio must not exceed 45%" - Answer claim: "The maximum DTI is 45%" - BLEU/ROUGE: Low score — different words ("must not exceed" vs "maximum is") - LLM-as-judge: ✅ Supported — same meaning, different phrasing

LLM-as-judge uses GPT-4o-mini in LoanIQ's evaluator because it understands semantic equivalence, can handle domain-specific terminology, and produces calibrated 0-1 scores. The downside: adds latency and cost to evaluation. We use gpt-4o-mini (cheaper) instead of gpt-4o because the faithfulness task is relatively simple.

Q6: What does "low context precision" tell you and how do you fix it?

A: Low context precision means your retrieval is returning irrelevant chunks that are polluting the LLM's context.

Example: Query "What is the PMI requirement?" retrieves: - Chunk 1: ✅ "PMI is required when LTV exceeds 80%" - Chunk 2: ✅ "PMI rate is typically 0.5-1.5% annually" - Chunk 3: ❌ "Property insurance requirements for flood zones" (off-topic) - Chunk 4: ❌ "Flood insurance requirements in FEMA Zone A" (off-topic)

Precision = 2/4 = 0.50 — poor. The LLM might confuse property insurance with PMI.

Fixes: 1. Raise similarity threshold: In LoanIQ, increase from 0.7 to 0.75 — more strict 2. Reduce top-K: Retrieve fewer candidates (15 → 10) — less noise 3. Better reranking: The cross-encoder should catch off-topic chunks — check if reranker model needs upgrading 4. Better chunking: If "property insurance" and "PMI" appear in the same chunk due to bad splitting, they'd be inseparable 5. Metadata filtering: Add category tags to chunks ("pmi", "flood_insurance") and filter pre-retrieval

Q7: How would you detect and handle the "lost in the middle" problem in LoanIQ?

A: The "lost in the middle" problem (Liu et al., 2023): LLMs attend strongly to the beginning and end of context but ignore the middle. For a prompt with 5 chunks, chunks 2 and 3 are most likely to be ignored.

Detection in LoanIQ: - RAGAS answer correctness drops as context window grows - Questions whose answers are in middle-positioned chunks score lower

Mitigation strategies: 1. Best chunks at boundaries: Sort so the highest-scoring chunk is first, second-highest is last, rest in middle. 2. Limit context size: LoanIQ's ContextBuilder uses 6000 token budget — don't stuff 20 chunks. Top 5 reranked chunks only. 3. Reduce redundancy with MMR: If all chunks say slightly different versions of the same thing, the LLM wastes middle attention on duplicates. MMR ensures diverse content. 4. Chunk summaries in the prompt header: Before the full chunks, include a 1-sentence summary of each: "Source 1: Credit requirements; Source 2: DTI limits; Source 3: PMI rules" — helps the LLM scan before reading deeply.

Q8: What is the RAGAS "answer correctness" metric? How does it differ from faithfulness?

A: Answer correctness measures how close the generated answer is to the ground truth answer. It uses token-level F1 score (same as SQuAD evaluation):

def _score_answer_correctness(self, ground_truth: str, answer: str) -> float:
    gt_tokens = set(re.findall(r'\b\w+\b', ground_truth.lower()))
    ans_tokens = set(re.findall(r'\b\w+\b', answer.lower()))

    common = gt_tokens & ans_tokens
    precision = len(common) / len(ans_tokens)
    recall = len(common) / len(gt_tokens)
    f1 = 2 * precision * recall / (precision + recall)
    return f1

vs Faithfulness: - Faithfulness: Does the answer match the retrieved context? (Doesn't require ground truth) - Answer Correctness: Does the answer match the ground truth? (Requires labeled dataset)

You can have high faithfulness and low correctness (retrieved wrong context) or low faithfulness and high correctness (hallucinated the right answer by luck). Both matter.

Q9: How does LoanIQ push evaluation results to LangSmith and why?

A: LangSmith is Anthropic/LangChain's tracing and evaluation platform. LoanIQ pushes evaluation results there for visual analysis:

def _setup_langsmith(self):
    if settings.langchain_tracing_v2 and settings.langchain_api_key:
        os.environ["LANGCHAIN_TRACING_V2"] = "true"
        os.environ["LANGCHAIN_API_KEY"] = settings.langchain_api_key
        os.environ["LANGCHAIN_PROJECT"] = settings.langchain_project
        self.langsmith_client = Client()

Every evaluation run gets pushed as a named run (e.g., loaniq-eval-20240115-143022) with all per-sample scores, latencies, and the final aggregated report.

Why LangSmith: 1. Trend tracking: See if faithfulness improved after a chunking change 2. Failure analysis: Click into a specific sample that scored 0.0 — see exactly which chunk was retrieved and what the LLM generated 3. Dataset management: Store eval datasets centrally, run evals against them consistently 4. Cost tracking: Token counts per run, per agent

Alternative: MLflow for experiment tracking, but LangSmith is purpose-built for LLM evaluation and integrates natively with LangChain/LangGraph.

Q10: Walk me through the complete evaluation flow in LoanIQ from query to score.

A: Complete flow from run_evaluation() in rag_evaluator.py:

Load dataset: Read loan_eval_dataset.json — 50 Q&A triplets
For each sample:
Record start time
Call AdvancedRAGPipeline.run(question, state="CA", loan_type="CONVENTIONAL")
This runs: query rewriting → dense + sparse retrieval → RRF → MMR → reranking → context building → LLM generation
Record end time → latency_ms
Score the result:
faithfulness: LLM-as-judge (GPT-4o-mini) checks if answer claims are in context
answer_relevancy: LLM rates if answer addresses the question
context_precision: Keyword overlap — what % of chunks have expected keywords?
context_recall: What % of expected keywords appear in any retrieved chunk?
answer_correctness: Token F1 between generated answer and ground truth
citation_accuracy: Do all cited sources exist?
Aggregate: Mean scores across all 50 samples; breakdown by category (ltv, dti, credit, etc.)
Push to LangSmith: Named run with full results
Print report: Console output with pass/fail indicators

If mean faithfulness < 0.75, the evaluation script exits with error code 1 — blocks CI/CD deployment until the issue is fixed.

Next: LangChain Core Concepts →