Gen AI Engineer Interview Checklist

01

🏦 Your Project (Most Important)

Start here
Own every answer

▾

Elevator Pitch

One-line summary — say it naturally in under 60 seconds

«Agentic RAG system

Tell me about a Gen AI project you built

MUSTProject

Draw the full pipeline from memory without hesitation

Each agent reads previous state, returns only changed fields, LangGraph merges.

Walk me through your pipeline architecture

MUSTProject

Technical Decisions — Defend Every One

Why LangGraph over plain LangChain?

Stateful graph — each agent reads all previous outputs. Conditional edges — compliance FAIL → skip to decline. Auditability — every state change is traceable. LangChain is stateless chains, not suitable for multi-agent with shared state.

MUSTProject

Why hybrid retrieval (Dense + BM25 + RRF)?

Dense catches semantic meaning — «debt ratio» matches «DTI». BM25 catches exact terms — «43%» or «QM rule». Either alone misses things. RRF merges both fairly without tuning weights.

Why not just use vector embeddings for retrieval?

MUSTProject

Why different models per agent?

Cost optimization. Llama free for simple validation (Intake/Ratio). GPT-4o for complex reasoning (Underwriting/Decision). Haiku cheap for PII agents in prod. Sonnet for final decisions. Wrong model assignment = 10x unnecessary cost.

How did you handle LLM costs?

MUSTProject

Why AWS Bedrock for PII agents in production?

Data never leaves AWS VPC — regulatory requirement for financial PII. External API calls (OpenAI) are prohibited for SSN, income, credit data under GLBA. Bedrock = same Claude/Llama models, zero data egress.

How did you handle data privacy compliance?

MUSTProject

Why pgvector instead of Pinecone?

Single DB for vectors + metadata + application data — no extra service. Metadata filtering built-in (loan_type, state). PostgreSQL transactions = ACID compliance. Simpler ops, lower cost, no vendor lock-in.

Why not a dedicated vector database?

MUSTProject

Why fine-tune Llama 3.1 8B for RatioAgent?

$0 marginal cost vs $0.005/call GPT-4o. After 500 examples + 3 epochs QLoRA, narrative quality reaches ~87% of GPT-4o. Knowledge distillation — GPT-4o teacher generates training data for Llama student.

GOODProject

How do you handle LLM failures gracefully?

Every agent has try/except with fallback defaults. IntakeAgent failure → score=7, PROCEED. Pipeline never crashes — errors propagate in state.errors[]. Audit agent logs all failures with full trace.

GOODProject

Numbers You Must Know

Cost per pipeline run, pipeline latency, token usage

Dev: ~$0.03–0.05/run (OpenAI only). Latency: 20–30s end-to-end (7 agents). ~2000 tokens per application. Prod: ~60–70% cheaper via Bedrock vs OpenAI for PII agents.

MUSTProject

RAGAS scores — run evaluation before interview, know your actual numbers

Run: PYTHONPATH=. python scripts/run_evaluation.py — know your faithfulness, context_precision, context_recall, answer_relevancy scores. Real numbers beat estimates every time.

MUSTProject

Fine-tuning specs: model size, training time, data size, hardware

Llama 3.1 8B, 4-bit QLoRA, r=16, 500 examples, 3 epochs, ~3 hours on GTX 1650 (10GB VRAM), adapter weights ~50MB. Cost: $0.50 for training data, $0 for local compute.

MUSTProject

02

🔍 RAG Pipeline — Retrieval Augmented Generation

Ingestion · Retrieval
Vector DB · Eval

▾

Core Concepts

What is RAG and why do we need it?

LLMs hallucinate and have knowledge cutoffs. RAG grounds the model in real documents. Instead of relying on training data, the model reads actual policy chunks before answering. Results are verifiable and citable.

What is RAG and when would you use it?

MUST

Chunking: fixed-size, sentence, semantic, recursive — and chunk overlap

Project: 512 words, 50-word overlap. Overlap prevents losing context at boundaries — sentence at the end of chunk 1 also appears at start of chunk 2. Word-based not char-based for natural boundaries.

How do you prepare documents for RAG?

MUSTProject

What are embeddings? What model, cost, dimensions?

1536 floats encoding semantic meaning. Similar texts → similar vectors → small cosine distance. Project: text-embedding-3-small, $0.02/1M tokens, 1536 dims. Fast, cheap, strong for retrieval.

What are vector embeddings and how do you choose a model?

MUSTProject

Metadata preservation — why it matters for citations

Every chunk tagged with doc_name, page_num, section, subsection, char_offset. Extracted BEFORE chunking — each chunk inherits source metadata. Dashboard shows «mortgage_policy.pdf page 4 — Section 2.1» not just «some chunk».

MUSTProject

Retrieval Techniques — The 5-Stage Pipeline

Stage 1: Query rewriting — why expand the query?

One query misses synonyms. «FHA loan LTV» misses «Federal Housing Administration loan-to-value ratio». Generate 3 variations with GPT-4o-mini, cast wider net, dedup before retrieval.

MUSTProject

Stage 2: Dense (pgvector cosine) + BM25 — why both?

Dense: understands meaning, catches synonyms. BM25: catches exact terms — «43%» or «QM rule» — that dense misses. Together they cover semantic AND lexical relevance. Neither alone is sufficient for regulatory documents.

Why hybrid retrieval over just embeddings?

MUSTProject

Stage 3: RRF fusion — formula and why not weighted sum

RRF(chunk) = 1/(k+rank_dense) + 1/(k+rank_bm25). k=60 standard. Chunks ranked consistently high in both lists win. Weighted sum requires tuning per dataset. RRF works out of the box — no hyperparameter search needed.

How do you merge dense and BM25 results?

MUSTProject

Stage 4: MMR — relevance vs diversity tradeoff

Without MMR, top 5 chunks repeat same section. MMR score = lambda * relevance - (1-lambda) * max_similarity_to_selected. lambda=0.7 → 70% relevance, 30% diversity. Keeps context window useful.

How do you handle redundant retrieved chunks?

GOODProject

Stage 5: Cross-encoder reranker — bi-encoder vs cross-encoder

Bi-encoder (embeddings): scores query and chunk separately — fast but approximate. Cross-encoder: reads query AND chunk together — much more accurate. Too slow for 1000s of chunks, perfect for final top-5. Model: ms-marco-MiniLM-L-6-v2.

What is a reranker and when do you use it?

GOODProject

Vector Databases

pgvector — cosine similarity, HNSW vs IVFFlat indexes

pgvector adds vector(1536) column type to Postgres. <=> operator = cosine distance. HNSW: faster queries, more memory. IVFFlat: faster build, approximate. Project: default index, small dataset.

MUSTProject

Pinecone vs Weaviate vs Chroma vs Qdrant vs pgvector — tradeoffs

Pinecone: managed, expensive, no SQL joins. Chroma: local dev only. Qdrant: self-hosted prod. Weaviate: multimodal. pgvector: integrated with Postgres, ACID, free. Choose pgvector when you already have Postgres.

MUST

ANN (Approximate Nearest Neighbor) — why not exact search at scale?

Exact search = O(n) comparisons. 10M vectors × 1536 dims = too slow. ANN trades tiny accuracy loss for 100x speed. HNSW builds a multi-layer graph. Query traverses layers to find approximate nearest neighbors.

GOOD

Agentic RAG & Orchestration

LangGraph — StateGraph, nodes, edges, ainvoke, state merging

StateGraph defines typed state dict. add_node() adds agent functions. add_edge() connects them. Conditional edges route based on state values. ainvoke() runs full pipeline async. Each node returns only changed fields — LangGraph merges.

How does state flow between your agents?

MUSTProject

Multi-agent patterns: sequential, parallel, supervisor, hierarchical

Project: sequential pipeline (each agent waits for previous). Parallel: run compliance + policy simultaneously. Supervisor: orchestrator agent routes to sub-agents. Hierarchical: nested agent graphs.

MUST

Tool use / function calling — structured JSON outputs from LLMs

Project: response_format=json_object for all agents. Prompt specifies exact JSON schema. Parser validates and falls back to defaults on parse error. More reliable than parsing prose.

GOODProject

Graph RAG, RAPTOR, corrective RAG, self-RAG — advanced patterns

NICE

03

📊 RAGAS — Evaluation & Hallucination Detection

Metrics · LLM-as-Judge
Quality Measurement

▾

Core RAGAS Metrics

Faithfulness — is every claim traceable to retrieved context?

Score 0–1. LLM checks if each claim in the answer can be found in the context chunks. High faithfulness = no hallucination. Project Stage 7 validator does this check per-agent.

What is faithfulness in RAGAS and how do you measure it?

MUSTProject

Context Precision — of retrieved chunks, how many were relevant?

Precision = relevant retrieved / total retrieved. High precision = retrieval is clean, not noisy. Low precision = lots of irrelevant chunks wasting context window. Improved by reranker.

MUSTProject

Context Recall — of all relevant chunks in DB, how many did we retrieve?

Recall = relevant retrieved / total relevant in DB. Low recall = missed important policy sections. Improved by hybrid retrieval + query rewriting. You want BOTH precision and recall high.

MUSTProject

Answer Relevancy — does the answer actually address the question?

A faithful answer can still be irrelevant (answers different question). Measured by asking LLM to generate questions from the answer, then checking similarity to original question. Score 0–1.

MUSTProject

Hallucination detection — how do you catch it?

Low faithfulness = hallucination signal. LLM generates answer then checks each claim against context. Claims not in context = hallucinated. Project: Stage 7 validator + fallback to «insufficient context» rather than hallucinate.

How do you prevent LLM hallucinations in production?

MUSTProject

LLM-as-judge — what it is, pros, cons

Use GPT-4o to score GPT-4o outputs. Pro: cheap, scalable, no human labelers. Con: self-consistency bias, inconsistent on edge cases. Mitigation: multiple judges + average, use stronger judge than judged model.

What is LLM-as-judge? What are the risks?

GOOD

Evaluation Best Practices

BLEU, ROUGE — classic NLP metrics, when to use vs RAGAS

BLEU/ROUGE: n-gram overlap. Fast, no LLM needed. Problem: «The maximum LTV is 97%» and «97% LTV is the max» have low ROUGE but mean the same thing. RAGAS is semantic — better for LLM output evaluation.

GOOD

Evaluation dataset — how to build a ground truth set

Project: 15 Q&A pairs in loan_eval_dataset.json with ground truth answers and reference chunks. Generated with GPT-4o reading actual policy docs. Run periodically to detect RAG quality drift.

GOODProject

LangSmith — tracing, dataset management, prompt versioning

Set LANGCHAIN_TRACING_V2=true → every LangGraph run traced in LangSmith dashboard.

NICEProject

04

🧠 AI / ML Foundations & LLM Theory

Theory · Architecture
Core Concepts

▾

Core ML

Supervised vs Unsupervised vs Reinforcement Learning

Supervised: labeled data, predict output. Unsupervised: find patterns without labels. RL: learn from rewards — used in RLHF for LLM alignment. RAG is supervised retrieval + generative LLM.

MUST

Backpropagation and chain rule — whiteboard this

Forward pass computes loss. Backward pass computes dL/dW for each layer using chain rule: dL/dW = dL/dy × dy/dW. Gradients flow backward. Optimizer updates W = W - lr × dL/dW.

Explain backpropagation step by step

MUST

Loss functions: Cross-entropy, MSE, BCE, Contrastive

Cross-entropy: classification (LLM next-token prediction). MSE: regression. BCE: binary classification. Contrastive: pull similar embeddings together, push different ones apart — used to train embedding models.

MUST

Gradient Descent: SGD, Adam, AdamW — why AdamW for LLMs?

SGD: fixed learning rate, noisy. Adam: adaptive per-parameter rates, fast convergence. AdamW: Adam + weight decay decoupled — prevents overfitting. Default for all LLM fine-tuning. Project uses adamw_8bit.

MUST

Activation functions: ReLU, GELU, Sigmoid, Softmax

ReLU: max(0,x), fast, dead neuron problem. GELU: smooth version of ReLU — used in GPT/BERT. Sigmoid: 0–1, binary gates. Softmax: converts logits to probabilities (LLM output layer, attention scores).

MUST

Batch Norm vs Layer Norm — why Transformers use LayerNorm

BatchNorm: normalizes across batch dimension — breaks with batch size 1 or variable-length sequences. LayerNorm: normalizes across feature dimension per sample — works for any batch size. All transformers use LayerNorm.

MUST

Transformer Architecture

Self-attention: Q, K, V matrices — the formula

Attention(Q,K,V) = softmax(QKᵀ / √dk) × V. Q=what I'm looking for. K=what I have. V=what I return. Divide by √dk prevents softmax saturation in high dimensions. Must know this cold.

Explain the attention mechanism mathematically

MUST

Multi-head attention — why multiple heads?

Each head can attend to different aspects simultaneously — syntax, semantics, coreference. 12 heads in GPT-2, 32 in Llama 3.1. Outputs concatenated then projected. Richer representations than single head.

MUST

Causal masking (decoder) vs bidirectional (encoder)

GPT = decoder-only, causal mask (can't look at future tokens). BERT = encoder-only, bidirectional. T5 = encoder-decoder. For generation tasks (Project agents): decoder-only models (GPT-4o, Llama).

MUST

Positional encoding: sinusoidal vs learned vs RoPE

Sinusoidal: fixed formula, original transformer. Learned: trained embeddings. RoPE (Rotary Position Embedding): Llama, GPT-NeoX — rotates Q,K vectors by position angle. Better long-context extrapolation than absolute positional encodings.

MUST

Context window, tokens, tokenization (BPE) — impact on RAG design

~1.3 tokens per word. GPT-4o: 128k context. Llama 3.1: 128k. BPE: byte-pair encoding splits words into subwords. Project: top 3 chunks × 512 words ≈ 2000 tokens — well within limits. Chunk size chosen to respect context budget.

MUSTProject

KV Cache — what it is, why it speeds up inference

During generation, Key and Value matrices are computed once per token and cached. Subsequent tokens reuse the cache instead of recomputing. Reduces O(n²) attention to O(n) per new token. Critical for long outputs.

GOOD

O(n²) attention complexity — why long contexts are expensive

Every token attends to every other token. 1000 tokens = 1M attention scores. 10k tokens = 100M. Directly explains why RAG is better than stuffing full document in context — retrieval is O(log n) with HNSW.

GOODProject

Temperature, Top-K, Top-P — what each controls

Temperature: randomness. 0=deterministic, 1=creative. Top-K: sample from top K tokens only. Top-P: sample from smallest set summing to P probability mass. Project: temperature=0.1 for agents (consistent JSON), 0.3 for training data generation.

How did you set temperature in your system?

MUSTProject

Pre-training vs SFT vs RLHF — what each stage does

Pre-training: predict next token on massive corpus — base model. SFT (Supervised Fine-Tuning): train on instruction-response pairs — follows instructions. RLHF: human preference rankings train reward model → PPO optimizes LLM output quality.

MUST

Residual connections — why critical for deep networks

Skip connections: output = layer(x) + x. Gradients flow directly to earlier layers bypassing intermediate layers — prevents vanishing gradients. Enables training 100+ layer networks. Every transformer block has residual connections.

GOOD

05

⚙️ Fine-Tuning — QLoRA on Llama 3.1

LoRA · QLoRA
Training Strategy

▾

Core Concepts

Full fine-tuning vs LoRA vs QLoRA — key differences

Full: update all 8B params, 80GB+ GPU, days. LoRA: freeze base weights, add trainable A×B matrices (0.1% params), 16GB. QLoRA: LoRA + 4-bit quantization, fits on 10GB consumer GPU. Project: QLoRA on GTX 1650.

What is the difference between LoRA and QLoRA?

MUSTProject

LoRA math — what A and B matrices do

Original W frozen. Add ΔW = A×B where A is d×r, B is r×d. Output = Wx + scale×ABx. Rank r=16 in Project. Total trainable params = 2×d×r per layer ≈ 50MB vs 16GB full model. Scale = lora_alpha/r.

How does LoRA reduce memory requirements?

MUSTProject

Why fine-tune vs just prompt engineer?

Prompt engineering has limits — can't change model's intrinsic style. Fine-tuning teaches domain vocabulary, output format, tone. Project: underwriter narrative style impossible via prompting alone. $0/call vs $0.005 GPT-4o after fine-tuning.

When would you fine-tune vs prompt engineer?

MUSTProject

Knowledge distillation — GPT-4o teacher → Llama student

Use large model (GPT-4o) to generate training data for smaller model (Llama 3.1 8B). Student learns teacher's style and quality. Project: 500 GPT-4o narratives at $0.50 → fine-tuned Llama achieves 87% of GPT-4o quality at $0/call.

What is knowledge distillation?

MUSTProject

LoRA rank (r), alpha, target_modules — how you chose them

r=16: sweet spot for 8B on consumer GPU. r=8 if OOM, r=32 if 24GB. alpha=16: scale=alpha/r=1.0. target_modules: q,k,v,o projections + gate,up,down FFN layers — all attention + FFN in Llama architecture. lora_dropout=0 with QLoRA (Unsloth recommendation).

GOODProject

Catastrophic forgetting — why LoRA avoids it

Full fine-tuning can overwrite base weights — model forgets general knowledge. LoRA freezes all original weights — additive only. Base capabilities fully preserved. Model stays good at general tasks while gaining domain expertise.

GOOD

Alpaca prompt format, SFTTrainer, gradient accumulation

Alpaca: «### Instruction: {task} ### Response: {output}». SFTTrainer: HuggingFace trainer optimized for SFT. Gradient accumulation: batch_size=1 × steps=8 = effective batch 8 — same as batch_size=8 but fits in 4GB VRAM.

GOODProject

GGUF format, Ollama Modelfile — deploy fine-tuned model locally

GGUF: llama.cpp quantization format. Q4_K_M: 4-bit, good quality/size balance. Modelfile: FROM /path/to/model.gguf → ollama create Project-ratio. Now Ollama serves it like any other model at localhost:11434.

GOODProject

Model merging, speculative decoding, MoE — advanced serving

NICE

06

🏗️ System Design & Architecture

Scale · Tradeoffs
Production Patterns

▾

Architecture Decisions

Draw Project full architecture from memory

Client → FastAPI → LangGraph pipeline → 7 agents → PostgreSQL/pgvector. Ollama on laptop (dev) → LAN to Mac Mini. OpenAI API for GPT-4o agents. Bedrock for prod PII agents. RAGAS eval pipeline separate. LangSmith for traces.

Walk me through your system architecture

MUSTProject

Why async/await throughout? What problem does it solve?

FastAPI is async. LLM API calls and DB queries are I/O bound — CPU does nothing while waiting. Without async: server blocks all other requests. With async: handles 100s of concurrent requests while awaiting LLM responses. asyncio = single-threaded cooperative multitasking.

Why did you use async throughout your codebase?

MUST

Dependency injection — FastAPI Depends() pattern

FastAPI calls Depends(get_db) before each endpoint, injects AsyncSession. Handles open/commit/rollback/close automatically. Never leaks connections. Testable — swap real DB for mock in unit tests. Same pattern for config, auth.

How does dependency injection work in FastAPI?

MUST

How would you scale to 10,000 loans/day?

Async FastAPI handles concurrency already. Add Redis+Celery queue for background processing. Horizontal scale API servers behind load balancer. Read replicas for pgvector queries. Cache embeddings in Redis. Bedrock auto-scales. Connection pool tuning.

How would you scale your system?

GOODProject

What breaks first at scale? Monitoring strategy.

LLM API rate limits hit first, then DB connection pool. Monitor: latency per agent, token usage/cost, RAG faithfulness drift (signals retrieval quality degradation), error rate per agent, queue depth. Alert on faithfulness < 0.7.

What are the failure points in your system?

GOODProject

Common System Design Questions

Design a RAG system for a bank from scratch

Ingest: PDF → pypdf → chunker → embedder → pgvector. Retrieval: hybrid dense+BM25 → RRF → MMR → reranker. Generation: top-3 chunks + question → LLM → answer + citations. Evaluation: RAGAS faithfulness + context precision. Monitoring: LangSmith traces.

MUST

Design a multi-agent pipeline for document processing

LangGraph StateGraph. TypedDict state schema. Each agent: receives full state, returns changed fields only. Conditional edges on agent verdicts. Checkpoints for resume on failure. Async throughout. Audit node at end logs everything to DB.

MUST

How would you A/B test two LLM models in production?

Feature flag routes X% traffic to model B. Log all outputs + latency + cost. Evaluate with RAGAS on both. Statistical significance test before full rollout. Shadow mode first — run both, serve model A, compare B offline.

GOOD

What would you improve in Project v4?

Streaming responses to frontend (SSE). Parallel execution of compliance + policy agents (currently sequential). Cache embeddings for repeated queries. HNSW index on pgvector for faster retrieval. Add confidence intervals to RAGAS scores.

What are the limitations of your current design?

GOODProject

07

🐍 Python — What They Actually Test

Async · Pydantic · DSA
AI Libraries

▾

Python Fundamentals for Gen AI

async/await — read and modify async code confidently

async def = pausable function. await = pause here, let others run. asyncio.gather() = run multiple coroutines concurrently. asyncio.run() = start event loop. Don't need to write from scratch — need to understand and modify.

MUSTProject

Pydantic models — define, validate, nest models

class LoanRequest(BaseModel): credit_score: int = Field(ge=300, le=850). Auto validation, type coercion, JSON serialization. model_config for env file reading. FastAPI uses Pydantic for all request/response bodies.

MUSTProject

TypedDict — LangGraph state schema pattern

from typing import TypedDict. class LoanState(TypedDict): credit_score: int; agent_traces: list. LangGraph requires TypedDict for state. Provides type safety without Pydantic overhead. Each agent returns Partial[LoanState].

MUSTProject

Dict/list operations — comprehensions, sorting, filtering

[x for x in items if x['score'] > 0.5]. sorted(chunks, key=lambda c: c.rerank_score, reverse=True). {k: v for k,v in d.items() if v}. {**dict1, **dict2} to merge. d.get('key', default) for safe access.

MUST

Exception handling — try/except/finally with fallbacks

try: await llm_call(). except Exception as e: logger.error(e); return fallback_defaults. finally: cleanup. In Project every agent has this — pipeline never crashes, errors go to state.errors[].

MUST

JSON handling — loads, dumps, nested access, safe parsing

json.loads(string)→dict. json.dumps(dict)→string. Nested: data['traces'][0]['reasoning']. Safe: data.get('key', default). Strip markdown before parsing LLM JSON: text.replace('```json','').replace('```','').strip().

MUST

Decorators, context managers, generators

Decorator: @lru_cache on settings singleton. Context manager: async with AsyncSessionLocal() as db — handles cleanup. Generator: yield in FastAPI dependency injection — Depends(get_db) uses yield.

GOOD

AI/ML Libraries

OpenAI SDK — AsyncOpenAI, chat.completions, embeddings

AsyncOpenAI(api_key=key). await client.chat.completions.create(model, messages, temperature, max_tokens, response_format). await client.embeddings.create(model, input). Project uses this for all GPT-4o + embedding calls.

MUSTProject

FastAPI — routers, Depends, async endpoints, startup events

@router.post('/loans'). async def endpoint(request: LoanRequest, db: AsyncSession = Depends(get_db)). @app.on_event('startup'): await create_tables(). APIRouter for modular routing. HTTPException for errors.

MUSTProject

SQLAlchemy async — AsyncSession, select, execute, scalar

async with AsyncSession as db. result = await db.execute(select(Loan).where(Loan.id==id)). loan = result.scalar_one_or_none(). db.add(new_obj). await db.commit(). echo=True logs all SQL in debug mode.

MUSTProject

NumPy — arrays, dot product, cosine similarity implementation

import numpy as np. np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b)) = cosine similarity. np.argsort(scores)[::-1][:k] = top-k indices. Broadcasting for batch operations on embeddings.

MUST

HuggingFace — sentence-transformers, CrossEncoder, AutoTokenizer

CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2'). model.predict([(query, passage)]) → relevance score. Project reranker uses this. sentence_transformers for bi-encoder embeddings as alternative to OpenAI.

GOODProject

boto3 — Bedrock InvokeModel, credential chain

import boto3. client = boto3.client('bedrock-runtime', region_name='us-east-1'). response = client.converse(modelId='anthropic.claude-3-5-haiku...', messages=[...]). Credentials from ~/.aws/credentials or IAM role.

GOODProject

Unsloth + PEFT + TRL — QLoRA fine-tuning stack

FastLanguageModel.from_pretrained(model, load_in_4bit=True). get_peft_model(r=16, target_modules=[...]). SFTTrainer(model, dataset, args=TrainingArguments(...)). trainer.train(). save_pretrained_gguf() for Ollama.

GOODProject

DSA — What Gen AI Interviews Actually Test

Heaps / Priority Queues — top-K retrieval problems

import heapq. heapq.nlargest(k, items, key=...) for top-k chunks. O(n log k) vs O(n log n) sort. Directly applicable to RAG — finding top-K similar embeddings efficiently.

MUSTProject

HashMaps / dicts — O(1) lookup, deduplication patterns

RRF fusion uses dicts: {chunk_id: score}. Dedup retrieved chunks: seen=set(); [x for x in chunks if x.id not in seen and not seen.add(x.id)]. Counter for frequency analysis.

MUST

BFS/DFS on graphs — LangGraph traversal

LangGraph IS a graph. Nodes = agents. Edges = transitions. Conditional edges = branching. BFS would process all agents at same depth level first. DFS would go deep in one branch. LangGraph does topological sort — dependencies first.

MUST

Sliding window — chunking with overlap implementation

Project chunker IS a sliding window: start=0, while start < len(words): end=start+chunk_size; chunk=words[start:end]; start+=(chunk_size-overlap). O(n) time, O(chunk_size) space per chunk.

GOODProject

Time/space complexity — Big O for every solution

Always state O(n), O(log n) etc. In RAG context: exact search O(n×d), HNSW O(log n), BM25 O(n×q), RRF O(n log n). Interviewers respect candidates who reason about complexity.

GOOD

08

☁️ Docker & Cloud — AWS Bedrock

Containers · AWS
MLOps · Deployment

▾

Docker

docker-compose — Project uses it for postgres+pgvector

docker run pgvector/pgvector:pg16 — one command, postgres+pgvector ready. services, volumes, networks, depends_on, environment variables. Project: single compose file for full dev stack.

MUSTProject

Dockerfile — FROM, RUN, COPY, ENV, CMD, multi-stage builds

Multi-stage: builder stage installs deps, final stage copies only artifacts. Reduces image size 10x. Layer caching: COPY requirements.txt → RUN pip install → COPY app (requirements change less than code).

MUST

Kubernetes — pods, deployments, HPA for LLM workloads

NICE

AWS for Gen AI

AWS Bedrock — Converse API, model IDs, why it exists

Managed LLM API in AWS. Access Claude, Llama, Titan without managing GPUs. Data stays in your VPC — critical for financial PII. Converse API: unified interface for all models. Model IDs: anthropic.claude-3-5-sonnet-20241022-v2:0.

Why did you use AWS Bedrock instead of OpenAI in production?

MUSTProject

IAM — roles, policies, least privilege for Bedrock

Project IAM policy: bedrock:InvokeModel on specific model ARNs only — not *. Least privilege: service gets only what it needs. IAM role on EC2 instance → no hardcoded keys. SCPs at org level block non-approved models.

MUSTProject

Environment variables and config management — 12-factor app

.env → pydantic-settings reads → settings singleton. Never hardcode secrets. Different .env per environment (dev/prod). Project: APP_ENV=production activates Bedrock routing. Secrets in AWS Secrets Manager for prod.

MUST

Ollama — local LLM serving, LAN setup, Modelfile

OLLAMA_HOST=0.0.0.0:11434 on Windows laptop. Firewall rule for port 11434. Mac Mini .env: OLLAMA_BASE_URL=http://192.168.1.5:11434/v1. Ollama speaks OpenAI-compatible API — same SDK, just different base_url.

GOODProject

EC2 + RDS — deploy FastAPI backend with managed PostgreSQL

GOOD

vLLM, TGI — production LLM serving at scale

NICE

09

🎯 Behavioural — Stories to Prepare

4 yrs IT experience
→ Gen AI Engineer

▾

Key Stories

A technical decision you made and defended

«I chose hybrid RAG because in early testing, exact regulatory figures like '43% DTI limit' weren't reliably retrieved by dense search alone. Adding BM25 with RRF fusion improved context recall significantly — I measured it with RAGAS before committing to the architecture.»

Tell me about a technical decision you're proud of

MUSTProject

A hard bug you debugged

«UndefinedTableError on first run. Traced it: PostgreSQL was running, DB existed, but pgvector extension wasn't created. Fixed with CREATE EXTENSION IF NOT EXISTS vector. Also learned to always run PYTHONPATH=. to avoid ModuleNotFoundError. Systematic debugging — environment first, then code.»

Tell me about a hard problem you solved

MUSTProject

How you learn new technologies quickly

«I learn by building. LangGraph, QLoRA, RAGAS — all new to me. I built Project to understand them hands-on. I read the docs, find one working example, then extend it. I also ask specific questions to understand the WHY not just the HOW.»

How do you keep up with the rapidly changing AI landscape?

MUST

How your 4 years IT experience transfers to Gen AI

«Systems thinking, debugging mindset, production constraints, API design, data flows — all transfer directly. Gen AI is still software engineering. My IT background means I think about reliability, monitoring, cost, and scale — not just model accuracy. That's rare in pure ML people.»

Why are you transitioning to Gen AI?

MUST

Trade-off you made between quality and cost

«Used Llama 3.1 8B for Intake and Ratio agents instead of GPT-4o. Saved ~80% cost for those calls. Measured quality with RAGAS — no significant difference for structured data validation tasks. Kept GPT-4o only for complex reasoning agents.»

GOODProject

Something you would do differently

«I'd add streaming responses earlier. Right now the user waits 20–30 seconds for the full pipeline. Server-Sent Events to stream each agent result as it completes would dramatically improve perceived performance.»

What would you improve about your project?

GOODProject