// gen ai engineer · interview prep · 2026

Complete Interview Checklist

Project edition · 4+ yrs IT experience · Python beginner → Gen AI Engineer

9 sections
100+ items
Project hands-on proof
MUST Every interview
GOOD Strong differentiator
NICE Bonus points
Project You built this — own it
Overall Readiness 0%
0 / 0 completed
01
🏦 Your Project (Most Important)
Elevator Pitch
One-line summary — say it naturally in under 60 seconds
«Agentic RAG system
Tell me about a Gen AI project you built
MUSTProject
Draw the full pipeline from memory without hesitation
Each agent reads previous state, returns only changed fields, LangGraph merges.
Walk me through your pipeline architecture
MUSTProject
Technical Decisions — Defend Every One
Why LangGraph over plain LangChain?
Stateful graph — each agent reads all previous outputs. Conditional edges — compliance FAIL → skip to decline. Auditability — every state change is traceable. LangChain is stateless chains, not suitable for multi-agent with shared state.
MUSTProject
Why hybrid retrieval (Dense + BM25 + RRF)?
Dense catches semantic meaning — «debt ratio» matches «DTI». BM25 catches exact terms — «43%» or «QM rule». Either alone misses things. RRF merges both fairly without tuning weights.
Why not just use vector embeddings for retrieval?
MUSTProject
Why different models per agent?
Cost optimization. Llama free for simple validation (Intake/Ratio). GPT-4o for complex reasoning (Underwriting/Decision). Haiku cheap for PII agents in prod. Sonnet for final decisions. Wrong model assignment = 10x unnecessary cost.
How did you handle LLM costs?
MUSTProject
Why AWS Bedrock for PII agents in production?
Data never leaves AWS VPC — regulatory requirement for financial PII. External API calls (OpenAI) are prohibited for SSN, income, credit data under GLBA. Bedrock = same Claude/Llama models, zero data egress.
How did you handle data privacy compliance?
MUSTProject
Why pgvector instead of Pinecone?
Single DB for vectors + metadata + application data — no extra service. Metadata filtering built-in (loan_type, state). PostgreSQL transactions = ACID compliance. Simpler ops, lower cost, no vendor lock-in.
Why not a dedicated vector database?
MUSTProject
Why fine-tune Llama 3.1 8B for RatioAgent?
$0 marginal cost vs $0.005/call GPT-4o. After 500 examples + 3 epochs QLoRA, narrative quality reaches ~87% of GPT-4o. Knowledge distillation — GPT-4o teacher generates training data for Llama student.
GOODProject
How do you handle LLM failures gracefully?
Every agent has try/except with fallback defaults. IntakeAgent failure → score=7, PROCEED. Pipeline never crashes — errors propagate in state.errors[]. Audit agent logs all failures with full trace.
GOODProject
Numbers You Must Know
Cost per pipeline run, pipeline latency, token usage
Dev: ~$0.03–0.05/run (OpenAI only). Latency: 20–30s end-to-end (7 agents). ~2000 tokens per application. Prod: ~60–70% cheaper via Bedrock vs OpenAI for PII agents.
MUSTProject
RAGAS scores — run evaluation before interview, know your actual numbers
Run: PYTHONPATH=. python scripts/run_evaluation.py — know your faithfulness, context_precision, context_recall, answer_relevancy scores. Real numbers beat estimates every time.
MUSTProject
Fine-tuning specs: model size, training time, data size, hardware
Llama 3.1 8B, 4-bit QLoRA, r=16, 500 examples, 3 epochs, ~3 hours on GTX 1650 (10GB VRAM), adapter weights ~50MB. Cost: $0.50 for training data, $0 for local compute.
MUSTProject
02
🔍 RAG Pipeline — Retrieval Augmented Generation
Core Concepts
What is RAG and why do we need it?
LLMs hallucinate and have knowledge cutoffs. RAG grounds the model in real documents. Instead of relying on training data, the model reads actual policy chunks before answering. Results are verifiable and citable.
What is RAG and when would you use it?
MUST
Chunking: fixed-size, sentence, semantic, recursive — and chunk overlap
Project: 512 words, 50-word overlap. Overlap prevents losing context at boundaries — sentence at the end of chunk 1 also appears at start of chunk 2. Word-based not char-based for natural boundaries.
How do you prepare documents for RAG?
MUSTProject
What are embeddings? What model, cost, dimensions?
1536 floats encoding semantic meaning. Similar texts → similar vectors → small cosine distance. Project: text-embedding-3-small, $0.02/1M tokens, 1536 dims. Fast, cheap, strong for retrieval.
What are vector embeddings and how do you choose a model?
MUSTProject
Metadata preservation — why it matters for citations
Every chunk tagged with doc_name, page_num, section, subsection, char_offset. Extracted BEFORE chunking — each chunk inherits source metadata. Dashboard shows «mortgage_policy.pdf page 4 — Section 2.1» not just «some chunk».
MUSTProject
Retrieval Techniques — The 5-Stage Pipeline
Stage 1: Query rewriting — why expand the query?
One query misses synonyms. «FHA loan LTV» misses «Federal Housing Administration loan-to-value ratio». Generate 3 variations with GPT-4o-mini, cast wider net, dedup before retrieval.
MUSTProject
Stage 2: Dense (pgvector cosine) + BM25 — why both?
Dense: understands meaning, catches synonyms. BM25: catches exact terms — «43%» or «QM rule» — that dense misses. Together they cover semantic AND lexical relevance. Neither alone is sufficient for regulatory documents.
Why hybrid retrieval over just embeddings?
MUSTProject
Stage 3: RRF fusion — formula and why not weighted sum
RRF(chunk) = 1/(k+rank_dense) + 1/(k+rank_bm25). k=60 standard. Chunks ranked consistently high in both lists win. Weighted sum requires tuning per dataset. RRF works out of the box — no hyperparameter search needed.
How do you merge dense and BM25 results?
MUSTProject
Stage 4: MMR — relevance vs diversity tradeoff
Without MMR, top 5 chunks repeat same section. MMR score = lambda * relevance - (1-lambda) * max_similarity_to_selected. lambda=0.7 → 70% relevance, 30% diversity. Keeps context window useful.
How do you handle redundant retrieved chunks?
GOODProject
Stage 5: Cross-encoder reranker — bi-encoder vs cross-encoder
Bi-encoder (embeddings): scores query and chunk separately — fast but approximate. Cross-encoder: reads query AND chunk together — much more accurate. Too slow for 1000s of chunks, perfect for final top-5. Model: ms-marco-MiniLM-L-6-v2.
What is a reranker and when do you use it?
GOODProject
Vector Databases
pgvector — cosine similarity, HNSW vs IVFFlat indexes
pgvector adds vector(1536) column type to Postgres. <=> operator = cosine distance. HNSW: faster queries, more memory. IVFFlat: faster build, approximate. Project: default index, small dataset.
MUSTProject
Pinecone vs Weaviate vs Chroma vs Qdrant vs pgvector — tradeoffs
Pinecone: managed, expensive, no SQL joins. Chroma: local dev only. Qdrant: self-hosted prod. Weaviate: multimodal. pgvector: integrated with Postgres, ACID, free. Choose pgvector when you already have Postgres.
MUST
ANN (Approximate Nearest Neighbor) — why not exact search at scale?
Exact search = O(n) comparisons. 10M vectors × 1536 dims = too slow. ANN trades tiny accuracy loss for 100x speed. HNSW builds a multi-layer graph. Query traverses layers to find approximate nearest neighbors.
GOOD
Agentic RAG & Orchestration
LangGraph — StateGraph, nodes, edges, ainvoke, state merging
StateGraph defines typed state dict. add_node() adds agent functions. add_edge() connects them. Conditional edges route based on state values. ainvoke() runs full pipeline async. Each node returns only changed fields — LangGraph merges.
How does state flow between your agents?
MUSTProject
Multi-agent patterns: sequential, parallel, supervisor, hierarchical
Project: sequential pipeline (each agent waits for previous). Parallel: run compliance + policy simultaneously. Supervisor: orchestrator agent routes to sub-agents. Hierarchical: nested agent graphs.
MUST
Tool use / function calling — structured JSON outputs from LLMs
Project: response_format=json_object for all agents. Prompt specifies exact JSON schema. Parser validates and falls back to defaults on parse error. More reliable than parsing prose.
GOODProject
Graph RAG, RAPTOR, corrective RAG, self-RAG — advanced patterns
NICE
03
📊 RAGAS — Evaluation & Hallucination Detection
Core RAGAS Metrics
Faithfulness — is every claim traceable to retrieved context?
Score 0–1. LLM checks if each claim in the answer can be found in the context chunks. High faithfulness = no hallucination. Project Stage 7 validator does this check per-agent.
What is faithfulness in RAGAS and how do you measure it?
MUSTProject
Context Precision — of retrieved chunks, how many were relevant?
Precision = relevant retrieved / total retrieved. High precision = retrieval is clean, not noisy. Low precision = lots of irrelevant chunks wasting context window. Improved by reranker.
MUSTProject
Context Recall — of all relevant chunks in DB, how many did we retrieve?
Recall = relevant retrieved / total relevant in DB. Low recall = missed important policy sections. Improved by hybrid retrieval + query rewriting. You want BOTH precision and recall high.
MUSTProject
Answer Relevancy — does the answer actually address the question?
A faithful answer can still be irrelevant (answers different question). Measured by asking LLM to generate questions from the answer, then checking similarity to original question. Score 0–1.
MUSTProject
Hallucination detection — how do you catch it?
Low faithfulness = hallucination signal. LLM generates answer then checks each claim against context. Claims not in context = hallucinated. Project: Stage 7 validator + fallback to «insufficient context» rather than hallucinate.
How do you prevent LLM hallucinations in production?
MUSTProject
LLM-as-judge — what it is, pros, cons
Use GPT-4o to score GPT-4o outputs. Pro: cheap, scalable, no human labelers. Con: self-consistency bias, inconsistent on edge cases. Mitigation: multiple judges + average, use stronger judge than judged model.
What is LLM-as-judge? What are the risks?
GOOD
Evaluation Best Practices
BLEU, ROUGE — classic NLP metrics, when to use vs RAGAS
BLEU/ROUGE: n-gram overlap. Fast, no LLM needed. Problem: «The maximum LTV is 97%» and «97% LTV is the max» have low ROUGE but mean the same thing. RAGAS is semantic — better for LLM output evaluation.
GOOD
Evaluation dataset — how to build a ground truth set
Project: 15 Q&A pairs in loan_eval_dataset.json with ground truth answers and reference chunks. Generated with GPT-4o reading actual policy docs. Run periodically to detect RAG quality drift.
GOODProject
LangSmith — tracing, dataset management, prompt versioning
Set LANGCHAIN_TRACING_V2=true → every LangGraph run traced in LangSmith dashboard.
NICEProject
04
🧠 AI / ML Foundations & LLM Theory
Core ML
Supervised vs Unsupervised vs Reinforcement Learning
Supervised: labeled data, predict output. Unsupervised: find patterns without labels. RL: learn from rewards — used in RLHF for LLM alignment. RAG is supervised retrieval + generative LLM.
MUST
Backpropagation and chain rule — whiteboard this
Forward pass computes loss. Backward pass computes dL/dW for each layer using chain rule: dL/dW = dL/dy × dy/dW. Gradients flow backward. Optimizer updates W = W - lr × dL/dW.
Explain backpropagation step by step
MUST
Loss functions: Cross-entropy, MSE, BCE, Contrastive
Cross-entropy: classification (LLM next-token prediction). MSE: regression. BCE: binary classification. Contrastive: pull similar embeddings together, push different ones apart — used to train embedding models.
MUST
Gradient Descent: SGD, Adam, AdamW — why AdamW for LLMs?
SGD: fixed learning rate, noisy. Adam: adaptive per-parameter rates, fast convergence. AdamW: Adam + weight decay decoupled — prevents overfitting. Default for all LLM fine-tuning. Project uses adamw_8bit.
MUST
Activation functions: ReLU, GELU, Sigmoid, Softmax
ReLU: max(0,x), fast, dead neuron problem. GELU: smooth version of ReLU — used in GPT/BERT. Sigmoid: 0–1, binary gates. Softmax: converts logits to probabilities (LLM output layer, attention scores).
MUST
Batch Norm vs Layer Norm — why Transformers use LayerNorm
BatchNorm: normalizes across batch dimension — breaks with batch size 1 or variable-length sequences. LayerNorm: normalizes across feature dimension per sample — works for any batch size. All transformers use LayerNorm.
MUST
Transformer Architecture
Self-attention: Q, K, V matrices — the formula
Attention(Q,K,V) = softmax(QKᵀ / √dk) × V. Q=what I'm looking for. K=what I have. V=what I return. Divide by √dk prevents softmax saturation in high dimensions. Must know this cold.
Explain the attention mechanism mathematically
MUST
Multi-head attention — why multiple heads?
Each head can attend to different aspects simultaneously — syntax, semantics, coreference. 12 heads in GPT-2, 32 in Llama 3.1. Outputs concatenated then projected. Richer representations than single head.
MUST
Causal masking (decoder) vs bidirectional (encoder)
GPT = decoder-only, causal mask (can't look at future tokens). BERT = encoder-only, bidirectional. T5 = encoder-decoder. For generation tasks (Project agents): decoder-only models (GPT-4o, Llama).
MUST
Positional encoding: sinusoidal vs learned vs RoPE
Sinusoidal: fixed formula, original transformer. Learned: trained embeddings. RoPE (Rotary Position Embedding): Llama, GPT-NeoX — rotates Q,K vectors by position angle. Better long-context extrapolation than absolute positional encodings.
MUST
Context window, tokens, tokenization (BPE) — impact on RAG design
~1.3 tokens per word. GPT-4o: 128k context. Llama 3.1: 128k. BPE: byte-pair encoding splits words into subwords. Project: top 3 chunks × 512 words ≈ 2000 tokens — well within limits. Chunk size chosen to respect context budget.
MUSTProject
KV Cache — what it is, why it speeds up inference
During generation, Key and Value matrices are computed once per token and cached. Subsequent tokens reuse the cache instead of recomputing. Reduces O(n²) attention to O(n) per new token. Critical for long outputs.
GOOD
O(n²) attention complexity — why long contexts are expensive
Every token attends to every other token. 1000 tokens = 1M attention scores. 10k tokens = 100M. Directly explains why RAG is better than stuffing full document in context — retrieval is O(log n) with HNSW.
GOODProject
Temperature, Top-K, Top-P — what each controls
Temperature: randomness. 0=deterministic, 1=creative. Top-K: sample from top K tokens only. Top-P: sample from smallest set summing to P probability mass. Project: temperature=0.1 for agents (consistent JSON), 0.3 for training data generation.
How did you set temperature in your system?
MUSTProject
Pre-training vs SFT vs RLHF — what each stage does
Pre-training: predict next token on massive corpus — base model. SFT (Supervised Fine-Tuning): train on instruction-response pairs — follows instructions. RLHF: human preference rankings train reward model → PPO optimizes LLM output quality.
MUST
Residual connections — why critical for deep networks
Skip connections: output = layer(x) + x. Gradients flow directly to earlier layers bypassing intermediate layers — prevents vanishing gradients. Enables training 100+ layer networks. Every transformer block has residual connections.
GOOD
05
⚙️ Fine-Tuning — QLoRA on Llama 3.1
Core Concepts
Full fine-tuning vs LoRA vs QLoRA — key differences
Full: update all 8B params, 80GB+ GPU, days. LoRA: freeze base weights, add trainable A×B matrices (0.1% params), 16GB. QLoRA: LoRA + 4-bit quantization, fits on 10GB consumer GPU. Project: QLoRA on GTX 1650.
What is the difference between LoRA and QLoRA?
MUSTProject
LoRA math — what A and B matrices do
Original W frozen. Add ΔW = A×B where A is d×r, B is r×d. Output = Wx + scale×ABx. Rank r=16 in Project. Total trainable params = 2×d×r per layer ≈ 50MB vs 16GB full model. Scale = lora_alpha/r.
How does LoRA reduce memory requirements?
MUSTProject
Why fine-tune vs just prompt engineer?
Prompt engineering has limits — can't change model's intrinsic style. Fine-tuning teaches domain vocabulary, output format, tone. Project: underwriter narrative style impossible via prompting alone. $0/call vs $0.005 GPT-4o after fine-tuning.
When would you fine-tune vs prompt engineer?
MUSTProject
Knowledge distillation — GPT-4o teacher → Llama student
Use large model (GPT-4o) to generate training data for smaller model (Llama 3.1 8B). Student learns teacher's style and quality. Project: 500 GPT-4o narratives at $0.50 → fine-tuned Llama achieves 87% of GPT-4o quality at $0/call.
What is knowledge distillation?
MUSTProject
LoRA rank (r), alpha, target_modules — how you chose them
r=16: sweet spot for 8B on consumer GPU. r=8 if OOM, r=32 if 24GB. alpha=16: scale=alpha/r=1.0. target_modules: q,k,v,o projections + gate,up,down FFN layers — all attention + FFN in Llama architecture. lora_dropout=0 with QLoRA (Unsloth recommendation).
GOODProject
Catastrophic forgetting — why LoRA avoids it
Full fine-tuning can overwrite base weights — model forgets general knowledge. LoRA freezes all original weights — additive only. Base capabilities fully preserved. Model stays good at general tasks while gaining domain expertise.
GOOD
Alpaca prompt format, SFTTrainer, gradient accumulation
Alpaca: «### Instruction: {task} ### Response: {output}». SFTTrainer: HuggingFace trainer optimized for SFT. Gradient accumulation: batch_size=1 × steps=8 = effective batch 8 — same as batch_size=8 but fits in 4GB VRAM.
GOODProject
GGUF format, Ollama Modelfile — deploy fine-tuned model locally
GGUF: llama.cpp quantization format. Q4_K_M: 4-bit, good quality/size balance. Modelfile: FROM /path/to/model.gguf → ollama create Project-ratio. Now Ollama serves it like any other model at localhost:11434.
GOODProject
Model merging, speculative decoding, MoE — advanced serving
NICE
06
🏗️ System Design & Architecture
Architecture Decisions
Draw Project full architecture from memory
Client → FastAPI → LangGraph pipeline → 7 agents → PostgreSQL/pgvector. Ollama on laptop (dev) → LAN to Mac Mini. OpenAI API for GPT-4o agents. Bedrock for prod PII agents. RAGAS eval pipeline separate. LangSmith for traces.
Walk me through your system architecture
MUSTProject
Why async/await throughout? What problem does it solve?
FastAPI is async. LLM API calls and DB queries are I/O bound — CPU does nothing while waiting. Without async: server blocks all other requests. With async: handles 100s of concurrent requests while awaiting LLM responses. asyncio = single-threaded cooperative multitasking.
Why did you use async throughout your codebase?
MUST
Dependency injection — FastAPI Depends() pattern
FastAPI calls Depends(get_db) before each endpoint, injects AsyncSession. Handles open/commit/rollback/close automatically. Never leaks connections. Testable — swap real DB for mock in unit tests. Same pattern for config, auth.
How does dependency injection work in FastAPI?
MUST
How would you scale to 10,000 loans/day?
Async FastAPI handles concurrency already. Add Redis+Celery queue for background processing. Horizontal scale API servers behind load balancer. Read replicas for pgvector queries. Cache embeddings in Redis. Bedrock auto-scales. Connection pool tuning.
How would you scale your system?
GOODProject
What breaks first at scale? Monitoring strategy.
LLM API rate limits hit first, then DB connection pool. Monitor: latency per agent, token usage/cost, RAG faithfulness drift (signals retrieval quality degradation), error rate per agent, queue depth. Alert on faithfulness < 0.7.
What are the failure points in your system?
GOODProject
Common System Design Questions
Design a RAG system for a bank from scratch
Ingest: PDF → pypdf → chunker → embedder → pgvector. Retrieval: hybrid dense+BM25 → RRF → MMR → reranker. Generation: top-3 chunks + question → LLM → answer + citations. Evaluation: RAGAS faithfulness + context precision. Monitoring: LangSmith traces.
MUST
Design a multi-agent pipeline for document processing
LangGraph StateGraph. TypedDict state schema. Each agent: receives full state, returns changed fields only. Conditional edges on agent verdicts. Checkpoints for resume on failure. Async throughout. Audit node at end logs everything to DB.
MUST
How would you A/B test two LLM models in production?
Feature flag routes X% traffic to model B. Log all outputs + latency + cost. Evaluate with RAGAS on both. Statistical significance test before full rollout. Shadow mode first — run both, serve model A, compare B offline.
GOOD
What would you improve in Project v4?
Streaming responses to frontend (SSE). Parallel execution of compliance + policy agents (currently sequential). Cache embeddings for repeated queries. HNSW index on pgvector for faster retrieval. Add confidence intervals to RAGAS scores.
What are the limitations of your current design?
GOODProject
07
🐍 Python — What They Actually Test
Python Fundamentals for Gen AI
async/await — read and modify async code confidently
async def = pausable function. await = pause here, let others run. asyncio.gather() = run multiple coroutines concurrently. asyncio.run() = start event loop. Don't need to write from scratch — need to understand and modify.
MUSTProject
Pydantic models — define, validate, nest models
class LoanRequest(BaseModel): credit_score: int = Field(ge=300, le=850). Auto validation, type coercion, JSON serialization. model_config for env file reading. FastAPI uses Pydantic for all request/response bodies.
MUSTProject
TypedDict — LangGraph state schema pattern
from typing import TypedDict. class LoanState(TypedDict): credit_score: int; agent_traces: list. LangGraph requires TypedDict for state. Provides type safety without Pydantic overhead. Each agent returns Partial[LoanState].
MUSTProject
Dict/list operations — comprehensions, sorting, filtering
[x for x in items if x['score'] > 0.5]. sorted(chunks, key=lambda c: c.rerank_score, reverse=True). {k: v for k,v in d.items() if v}. {**dict1, **dict2} to merge. d.get('key', default) for safe access.
MUST
Exception handling — try/except/finally with fallbacks
try: await llm_call(). except Exception as e: logger.error(e); return fallback_defaults. finally: cleanup. In Project every agent has this — pipeline never crashes, errors go to state.errors[].
MUST
JSON handling — loads, dumps, nested access, safe parsing
json.loads(string)→dict. json.dumps(dict)→string. Nested: data['traces'][0]['reasoning']. Safe: data.get('key', default). Strip markdown before parsing LLM JSON: text.replace('```json','').replace('```','').strip().
MUST
Decorators, context managers, generators
Decorator: @lru_cache on settings singleton. Context manager: async with AsyncSessionLocal() as db — handles cleanup. Generator: yield in FastAPI dependency injection — Depends(get_db) uses yield.
GOOD
AI/ML Libraries
OpenAI SDK — AsyncOpenAI, chat.completions, embeddings
AsyncOpenAI(api_key=key). await client.chat.completions.create(model, messages, temperature, max_tokens, response_format). await client.embeddings.create(model, input). Project uses this for all GPT-4o + embedding calls.
MUSTProject
FastAPI — routers, Depends, async endpoints, startup events
@router.post('/loans'). async def endpoint(request: LoanRequest, db: AsyncSession = Depends(get_db)). @app.on_event('startup'): await create_tables(). APIRouter for modular routing. HTTPException for errors.
MUSTProject
SQLAlchemy async — AsyncSession, select, execute, scalar
async with AsyncSession as db. result = await db.execute(select(Loan).where(Loan.id==id)). loan = result.scalar_one_or_none(). db.add(new_obj). await db.commit(). echo=True logs all SQL in debug mode.
MUSTProject
NumPy — arrays, dot product, cosine similarity implementation
import numpy as np. np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b)) = cosine similarity. np.argsort(scores)[::-1][:k] = top-k indices. Broadcasting for batch operations on embeddings.
MUST
HuggingFace — sentence-transformers, CrossEncoder, AutoTokenizer
CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2'). model.predict([(query, passage)]) → relevance score. Project reranker uses this. sentence_transformers for bi-encoder embeddings as alternative to OpenAI.
GOODProject
boto3 — Bedrock InvokeModel, credential chain
import boto3. client = boto3.client('bedrock-runtime', region_name='us-east-1'). response = client.converse(modelId='anthropic.claude-3-5-haiku...', messages=[...]). Credentials from ~/.aws/credentials or IAM role.
GOODProject
Unsloth + PEFT + TRL — QLoRA fine-tuning stack
FastLanguageModel.from_pretrained(model, load_in_4bit=True). get_peft_model(r=16, target_modules=[...]). SFTTrainer(model, dataset, args=TrainingArguments(...)). trainer.train(). save_pretrained_gguf() for Ollama.
GOODProject
DSA — What Gen AI Interviews Actually Test
Heaps / Priority Queues — top-K retrieval problems
import heapq. heapq.nlargest(k, items, key=...) for top-k chunks. O(n log k) vs O(n log n) sort. Directly applicable to RAG — finding top-K similar embeddings efficiently.
MUSTProject
HashMaps / dicts — O(1) lookup, deduplication patterns
RRF fusion uses dicts: {chunk_id: score}. Dedup retrieved chunks: seen=set(); [x for x in chunks if x.id not in seen and not seen.add(x.id)]. Counter for frequency analysis.
MUST
BFS/DFS on graphs — LangGraph traversal
LangGraph IS a graph. Nodes = agents. Edges = transitions. Conditional edges = branching. BFS would process all agents at same depth level first. DFS would go deep in one branch. LangGraph does topological sort — dependencies first.
MUST
Sliding window — chunking with overlap implementation
Project chunker IS a sliding window: start=0, while start < len(words): end=start+chunk_size; chunk=words[start:end]; start+=(chunk_size-overlap). O(n) time, O(chunk_size) space per chunk.
GOODProject
Time/space complexity — Big O for every solution
Always state O(n), O(log n) etc. In RAG context: exact search O(n×d), HNSW O(log n), BM25 O(n×q), RRF O(n log n). Interviewers respect candidates who reason about complexity.
GOOD
08
☁️ Docker & Cloud — AWS Bedrock
Docker
docker-compose — Project uses it for postgres+pgvector
docker run pgvector/pgvector:pg16 — one command, postgres+pgvector ready. services, volumes, networks, depends_on, environment variables. Project: single compose file for full dev stack.
MUSTProject
Dockerfile — FROM, RUN, COPY, ENV, CMD, multi-stage builds
Multi-stage: builder stage installs deps, final stage copies only artifacts. Reduces image size 10x. Layer caching: COPY requirements.txt → RUN pip install → COPY app (requirements change less than code).
MUST
Kubernetes — pods, deployments, HPA for LLM workloads
NICE
AWS for Gen AI
AWS Bedrock — Converse API, model IDs, why it exists
Managed LLM API in AWS. Access Claude, Llama, Titan without managing GPUs. Data stays in your VPC — critical for financial PII. Converse API: unified interface for all models. Model IDs: anthropic.claude-3-5-sonnet-20241022-v2:0.
Why did you use AWS Bedrock instead of OpenAI in production?
MUSTProject
IAM — roles, policies, least privilege for Bedrock
Project IAM policy: bedrock:InvokeModel on specific model ARNs only — not *. Least privilege: service gets only what it needs. IAM role on EC2 instance → no hardcoded keys. SCPs at org level block non-approved models.
MUSTProject
Environment variables and config management — 12-factor app
.env → pydantic-settings reads → settings singleton. Never hardcode secrets. Different .env per environment (dev/prod). Project: APP_ENV=production activates Bedrock routing. Secrets in AWS Secrets Manager for prod.
MUST
Ollama — local LLM serving, LAN setup, Modelfile
OLLAMA_HOST=0.0.0.0:11434 on Windows laptop. Firewall rule for port 11434. Mac Mini .env: OLLAMA_BASE_URL=http://192.168.1.5:11434/v1. Ollama speaks OpenAI-compatible API — same SDK, just different base_url.
GOODProject
EC2 + RDS — deploy FastAPI backend with managed PostgreSQL
GOOD
vLLM, TGI — production LLM serving at scale
NICE
09
🎯 Behavioural — Stories to Prepare
Key Stories
A technical decision you made and defended
«I chose hybrid RAG because in early testing, exact regulatory figures like '43% DTI limit' weren't reliably retrieved by dense search alone. Adding BM25 with RRF fusion improved context recall significantly — I measured it with RAGAS before committing to the architecture.»
Tell me about a technical decision you're proud of
MUSTProject
A hard bug you debugged
«UndefinedTableError on first run. Traced it: PostgreSQL was running, DB existed, but pgvector extension wasn't created. Fixed with CREATE EXTENSION IF NOT EXISTS vector. Also learned to always run PYTHONPATH=. to avoid ModuleNotFoundError. Systematic debugging — environment first, then code.»
Tell me about a hard problem you solved
MUSTProject
How you learn new technologies quickly
«I learn by building. LangGraph, QLoRA, RAGAS — all new to me. I built Project to understand them hands-on. I read the docs, find one working example, then extend it. I also ask specific questions to understand the WHY not just the HOW.»
How do you keep up with the rapidly changing AI landscape?
MUST
How your 4 years IT experience transfers to Gen AI
«Systems thinking, debugging mindset, production constraints, API design, data flows — all transfer directly. Gen AI is still software engineering. My IT background means I think about reliability, monitoring, cost, and scale — not just model accuracy. That's rare in pure ML people.»
Why are you transitioning to Gen AI?
MUST
Trade-off you made between quality and cost
«Used Llama 3.1 8B for Intake and Ratio agents instead of GPT-4o. Saved ~80% cost for those calls. Measured quality with RAGAS — no significant difference for structured data validation tasks. Kept GPT-4o only for complex reasoning agents.»
GOODProject
Something you would do differently
«I'd add streaming responses earlier. Right now the user waits 20–30 seconds for the full pipeline. Server-Sent Events to stream each agent result as it completes would dramatically improve perceived performance.»
What would you improve about your project?
GOODProject