| Model | Dims | Notes |
|---|---|---|
| text-embedding-3-small | 1536 | OpenAI, cost-efficient, strong |
| text-embedding-3-large | 3072 | OpenAI, best quality |
| BGE-M3 | 1024 | Open source, multilingual SOTA |
| E5-Mistral-7B | 4096 | LLM-based, top MTEB scores |
| Cohere Embed v3 | 1024 | Optimized for retrieval |
| Jina v3 | 1024 | Task-specific, 8192 ctx |
| DB | Strength | Best For |
|---|---|---|
| Pinecone | Managed, fast | Production SaaS |
| Weaviate | GraphQL API, hybrid | Rich metadata filter |
| FAISS | In-memory, CPU+GPU | Research, prototyping |
| Chroma | Lightweight, local | Dev/local RAG |
| pgvector | PostgreSQL native | Existing PG infra |
| Qdrant | Payload filtering | Metadata-heavy search |
| Milvus | Open, scalable | Billion-scale search |
graph.add_node("intake", intake_agent) — pure functions: state_in → state_outadd_edge, add_conditional_edges for branchingLoanApplicationState — type-safe, immutable per-turnAnnotated[list, operator.add] for parallel-safe list mergingconfig={"configurable":{"thread_id":"loan_123"}}interrupt_before=["decision_agent"] for approval gates| Method | VRAM | Approach |
|---|---|---|
| Full FT | High | All weights updated, best quality, most data |
| LoRA | Low | Low-rank ΔW=BA (r=8–64) added to frozen weights |
| QLoRA | Lowest | 4-bit NF4 quantized base + bf16 LoRA adapters |
| Prefix/Prompt FT | Minimal | Only prepended soft prompt tokens trained |
merge_and_unload() → single model for serving| Metric | Measures | How |
|---|---|---|
| Faithfulness | Hallucination rate | Claims in answer supported by context? LLM-judged |
| Answer Relevancy | On-topic quality | Reverse generate Qs from answer → cosine sim to original Q |
| Context Precision | Retrieval signal/noise | How much retrieved context is actually relevant? |
| Context Recall | Coverage | Ground truth claims found in retrieved context? |
| Context Relevancy | Retrieval accuracy | Retrieved chunks relevant to the query? |
| Answer Correctness | Factual accuracy | Semantic similarity + factual overlap vs ground truth |
interrupt_before nodes pause for human approval (LoanIQ: underwriting decisions)| Task / Scenario | Recommended Model / Tool | Rationale |
|---|---|---|
| Production RAG (cost-sensitive) | GPT-4o-mini + text-embedding-3-small | 80% quality at 10% cost vs GPT-4o |
| Complex reasoning / agent | Claude 3.5 Sonnet / GPT-4o | Best long-context, tool use, reasoning |
| Local / private deployment | Llama 3.1 8B / Mistral 7B via Ollama | No data leaves premises, free |
| Code generation | Claude 3.5 Sonnet / DeepSeek Coder | Top HumanEval scores |
| Embeddings (best quality) | text-embedding-3-large / E5-Mistral-7B | Highest MTEB BEIR scores |
| Embeddings (open source) | BGE-M3 / Jina v3 | Multilingual, self-hosted, strong |
| Re-ranking | Cohere Rerank v3 / BGE-reranker-v2 | Best retrieval precision gain |
| Orchestration | LangGraph (stateful) / LangChain (chains) | Cycles + state = LangGraph; simple pipelines = LC |
| Vector DB (managed) | Pinecone / Weaviate | Zero ops, SOC2, good SLAs |
| Vector DB (open source) | pgvector (existing PG) / Qdrant | Collocate with app DB / rich filtering |
| Fine-tuning (low resource) | QLoRA with Unsloth | 4-bit + LoRA = 70% VRAM reduction |
| Graph RAG | Microsoft GraphRAG / LightRAG | Multi-hop reasoning, thematic summaries |
| Parameter | Recommended | Impact |
|---|---|---|
| chunk_size | 512–1024 tokens | Too small = loss of context; too large = noise + lost-in-middle |
| chunk_overlap | 10–20% of chunk | Prevents answer split across boundaries |
| top_k retrieval | 20–100 (rerank to 5) | High recall → rerank for precision. Too small = miss answers |
| top_k final | 3–7 chunks | Context window budget; quality vs completeness tradeoff |
| temperature | 0.0–0.2 (RAG) · 0.7–1.0 (creative) | Low T = deterministic, factual; High T = diverse, creative |
| top_p (nucleus) | 0.9 | Truncates low-prob tokens; avoid using with temperature simultaneously |
| similarity threshold | 0.7–0.8 | Filter irrelevant retrieved chunks before LLM |
| embed batch_size | 64–256 | Throughput vs memory; larger = faster embedding |
| LoRA rank (r) | 16–64 | Higher r = more capacity, more memory; r=16 usually sufficient |
| LoRA alpha | 2× rank | Effective learning rate of adapter; alpha/r = scale factor |