ML Concepts & Transformer Architecture

Complete Study Guide

PART 1: ML FUNDAMENTALS

Sub-topic 1: Training Basics

Loss Function

What: A loss function measures how wrong the model's predictions are. During training, the model tries to minimize this number.

Cross-Entropy Loss (used in LLMs): For a classification task (or next-token prediction), cross-entropy measures the difference between predicted probability distribution and the true label:

Cross-Entropy Loss = -log(P(correct_token))

If model predicts P("mortgage") = 0.9  → loss = -log(0.9) = 0.105 (good!)
If model predicts P("mortgage") = 0.1  → loss = -log(0.1) = 2.303 (bad!)

In LoanIQ fine-tuning: During QLoRA training, training loss starts around 2.5 and should decrease to < 0.5 by epoch 3. A loss of 0.3 means the model correctly predicts ~74% of the next tokens in training examples.

Backpropagation

What: After computing the loss, backpropagation uses the chain rule to compute how much each weight contributed to the error, then updates them.

Forward pass:  Input → Layers → Output → Loss
Backward pass: dLoss/dWeights via chain rule → update weights

Chain rule: dLoss/dW₁ = dLoss/dOutput × dOutput/dLayer₂ × dLayer₂/dLayer₁ × dLayer₁/dW₁

In LoRA: During backpropagation, gradients flow to the LoRA A and B matrices only (not the frozen base weights). The frozen weights have their gradients detached — requires_grad=False.

Learning Rate

What: The step size for each weight update. W_new = W_old - lr × gradient

Too high → gradients "explode," loss diverges
Too low → training takes forever
Standard LoRA: lr = 2e-4 (0.0002)

LR Schedulers:

# Linear warmup + cosine decay (most common for LLMs)
from transformers import get_cosine_schedule_with_warmup
scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=5,      # gradually ramp up LR for first 5 steps
    num_training_steps=total # then cosine decay to near zero
)

Warmup prevents the model from making huge updates at the start of training when gradients are noisy.

Batch Sizes

Type	Description	Memory	Gradient Quality
Batch GD	All data in one step	Very high	Best (low noise)
SGD	One sample per step	Very low	Very noisy
Mini-batch	N samples per step	Moderate	Good balance

LoanIQ uses mini-batch with gradient accumulation (effective batch = 8).

Sub-topic 2: Optimizers

Adam vs AdamW

Adam (Adaptive Moment Estimation): Maintains per-parameter learning rates based on first (mean) and second (variance) moments of gradients. Parameters with large gradient variance get smaller updates.

m_t = β₁ × m_{t-1} + (1-β₁) × g_t       # exponential moving average of gradient
v_t = β₂ × v_{t-1} + (1-β₂) × g_t²      # exponential moving average of gradient squared
θ_t = θ_{t-1} - lr × m̂_t / (√v̂_t + ε)   # update

AdamW: Adam + decoupled weight decay. In standard Adam, weight decay is mixed into the gradient update (incorrect). AdamW applies weight decay directly to weights: θ = θ - lr × weight_decay × θ. This is the correct L2 regularization and trains better in practice.

LoanIQ: Uses adamw_8bit — AdamW with quantized optimizer states (8-bit instead of FP32). Saves ~75% of optimizer VRAM at negligible quality cost.

Sub-topic 3: Overfitting & Regularization

Overfitting

Training loss:   decreasing ↓↓↓
Validation loss: decreasing then increasing ↑↑↑ ← overfitting!

The model has memorized training data and can't generalize.

Regularization Techniques

Technique	How	Effect
Dropout	Zero random neurons with probability p during training	Forces redundancy, prevents co-adaptation
Weight decay (L2)	Add `λ × \|\|W\|\|²` to loss	Penalizes large weights, simpler models
Early stopping	Stop when validation loss stops improving	Prevents memorization
Data augmentation	Synthetic training examples	More diverse training distribution

In QLoRA: lora_dropout=0 is actually optimal. Dettmers et al. found that with 4-bit quantization adding noise (dropout) provides no benefit. The quantization noise itself acts as regularization.

Sub-topic 4: Evaluation Metrics

Classification Metrics

Confusion Matrix:
                 Predicted
              Positive  Negative
Actual  Positive   TP       FN
        Negative   FP       TN

Precision = TP / (TP + FP)  ← when you say yes, how often are you right?
Recall    = TP / (TP + FN)  ← of all actual yes, how many did you find?
F1        = 2 × P × R / (P + R)  ← harmonic mean of precision and recall
Accuracy  = (TP + TN) / Total

In mortgage decisioning: - High precision = low false positive rate (don't approve bad loans) - High recall = low false negative rate (don't decline good borrowers) - F1 balances both

LLM-Specific Metrics

Perplexity: How surprised the model is by test data. Lower = better. Perplexity = exp(mean cross-entropy loss over test set)

In LoanIQ fine-tuning: training perplexity target < 1.5 (exp(0.5) ≈ 1.65).

BLEU (Bilingual Evaluation Understudy): n-gram overlap between generated and reference text. Good for translation; insufficient for open-ended generation.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE-L uses longest common subsequence. Good for summarization evaluation.

Sub-topic 5: Bias & Variance

Model Complexity vs Error:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
          Total Error
         /
  High Bias ←───── Bias-Variance Tradeoff ─────→ High Variance
(Underfitting)                                    (Overfitting)

  Simple model:                Complex model:
  Train loss: high             Train loss: low
  Test loss: high              Test loss: high
  Prediction: poor             Prediction: memorized training

Bias-variance tradeoff: You can't minimize both simultaneously with a given model capacity and dataset. More model capacity (more parameters) reduces bias but increases variance.

For LLMs: Pre-trained LLMs have enormous capacity (low bias) but may overfit small fine-tuning datasets (high variance). LoRA's small adapter size acts as an implicit regularizer, preventing overfitting.

PART 2: LLM-SPECIFIC ML

Sub-topic 6: Pre-training

Next Token Prediction

Causal Language Modeling (CLM): The pre-training objective for GPT, Llama, and all decoder-only LLMs.

Text: "The maximum DTI for FHA loans is 45%"

Training examples (autoregressive):
  Input: "The"                        → Predict: "maximum"
  Input: "The maximum"                → Predict: "DTI"
  Input: "The maximum DTI"            → Predict: "for"
  Input: "The maximum DTI for FHA"    → Predict: "loans"
  ...

The model is trained on trillions of tokens. After training, it has learned to predict the probability distribution of the next token given any context — which means it has implicitly learned grammar, facts, reasoning, and domain knowledge.

Why Decoder-Only for Generation?

Encoder-only (BERT): Sees the full sequence in both directions. Great for embedding/classification, can't generate.
Decoder-only (GPT, Llama): Can only see left-to-right (causal masking). Perfect for autoregressive generation.
Encoder-decoder (T5): Uses encoder to understand input, decoder to generate output. Good for seq2seq (translation, summarization).

LLMs are decoder-only because text generation (predicting the next word) is the fundamental task.

Sub-topic 7: Inference Parameters

Temperature

# Low temperature (0.0-0.3): deterministic, conservative
response = await llm.ainvoke(prompt, temperature=0.1)
# → "The maximum DTI for FHA loans is 45%." (repeatable)

# High temperature (0.7-1.5): creative, varied  
response = await llm.ainvoke(prompt, temperature=1.0)
# → might add non-standard interpretations

In LoanIQ: All decision agents use temperature=0.1 — decisions must be deterministic and reproducible. Fine-tuning inference also uses temperature=0.1.

Top-p (Nucleus Sampling)

Sample from the smallest set of tokens whose cumulative probability ≥ p:

Token probabilities: [approve: 0.8, decline: 0.15, review: 0.04, ...]
top_p=0.9: Include [approve, decline] (0.8 + 0.15 = 0.95 ≥ 0.9)
→ Only sample from {approve, decline}, ignore rare tokens

Top-p vs top-k: - top_k=50: Always consider exactly 50 tokens regardless of their probability distribution - top_p=0.9: Consider fewer tokens when model is confident, more when uncertain - top_p is generally preferred for quality

Sub-topic 8: KV Cache

What, Why, How

Problem: Autoregressive generation requires processing the full context every time a new token is generated. Generating 500 tokens = 500 full forward passes over the growing context = O(n²) computation.

KV Cache: During each forward pass, the Key and Value matrices for each token are computed once and cached. For subsequent tokens, only the new token's KV matrices are computed; old ones are reused.

Without KV cache:
  Token 1: Process [token_1] → KV₁ → output
  Token 2: Process [token_1, token_2] → KV₁, KV₂ → output  (KV₁ recomputed!)
  Token 3: Process [token_1, token_2, token_3] → KV₁, KV₂, KV₃ (KV₁,KV₂ recomputed!)

With KV cache:
  Token 1: Compute KV₁, cache it → output
  Token 2: Load KV₁ from cache, compute KV₂ → output  (KV₁ not recomputed!)
  Token 3: Load KV₁, KV₂ from cache, compute KV₃ → output

Speed improvement: ~10× faster for long sequences

Memory tradeoff: KV cache grows linearly with context length. For a 128K context window on Llama 3.1, the KV cache alone can be several GB.

Sub-topic 9: Quantization

Precision Levels

FP32 (32-bit float):  4 bytes/parameter  → Full precision training
FP16 (16-bit float):  2 bytes/parameter  → Mixed precision training
BF16 (bfloat16):      2 bytes/parameter  → Better range than FP16, A100 optimized
INT8 (8-bit int):     1 byte/parameter   → Post-training quantization
NF4  (4-bit normal):  0.5 bytes/param   → QLoRA base model

GGUF and Q4_K_M

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and Ollama for quantized models.

Q4_K_M (4-bit, K-quant, Medium): One of the best quality-to-size ratios: - Q4: 4-bit quantization - K: K-quant method (uses smaller bits for less important weights) - M: Medium — balances quality and size

In LoanIQ deployment: Fine-tuned Llama 3.1 8B → merged 16-bit → converted to Q4_K_M GGUF → loaded in Ollama. Size: ~4.5GB. Quality: nearly identical to FP16 for this task.

PART 3: TRANSFORMER ARCHITECTURE

Sub-topic 10: Self-Attention

The Attention Mechanism

Attention — Step by Step:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input: sequence of token embeddings X (shape: seq_len × d_model)

1. Project to Q, K, V:
   Q = X · W_Q    (query: "what am I looking for?")
   K = X · W_K    (key:   "what do I have to offer?")
   V = X · W_V    (value: "what information do I carry?")

2. Compute attention scores:
   Scores = Q · K^T / √d_k    (dot product, scaled)

3. Apply causal mask (decoder only — future tokens masked):
   Scores[i,j] = -∞ if j > i  (can't attend to future)

4. Softmax → attention weights:
   A = softmax(Scores)   (each row sums to 1)

5. Weighted sum of values:
   Output = A · V    (each position's output = weighted mix of all values)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Why √d_k scaling? For large d_k (dimension), dot products grow large → softmax becomes too peaked (one token gets all attention, rest get ~0). Dividing by √d_k keeps variance stable.

Multi-Head Attention

Instead of one attention computation, run H parallel attention heads, each with smaller dimensions:

d_model = 4096 (Llama)
H = 32 heads
d_head = 4096 / 32 = 128 per head

Each head learns different patterns:
  Head 1: syntactic relationships ("the" → noun that follows)
  Head 2: long-range dependencies (pronoun → antecedent)
  Head 3: domain-specific (mortgage term relationships)
  ...

Outputs concatenated → projected back to d_model

GQA (Grouped Query Attention): Used in Llama 3.1 to speed up inference. Instead of H full KV heads, use H/G KV heads shared across G query heads. Reduces KV cache size by G×.

Sub-topic 11: Positional Encoding

Why Needed?

Self-attention has no inherent notion of position — the attention formula is permutation-invariant. "Dog bites man" and "Man bites dog" have the same attention scores without positional encoding.

RoPE (Rotary Positional Embedding)

Used in Llama 3.1. Key insight: encode position by rotating the Query and Key vectors.

RoPE: Rotate Q and K vectors by an angle proportional to position

Q'ₘ = Rotate(Qₘ, θₘ)   # position m
K'ₙ = Rotate(Kₙ, θₙ)   # position n

Attention(m, n) = Q'ₘ · K'ₙ = f(Qₘ · Kₙ, m-n)

The dot product naturally encodes the relative position (m-n). Benefits: - Position 5 attends to position 3 the same way as position 15 to position 13 (translation-invariant) - Can extend to longer contexts by adjusting the angle (RoPE scaling)

Sub-topic 12: Transformer Block

Architecture

Single Transformer Block (Decoder, Pre-Norm):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input: x
  │
  ├──────────────────────────────────────┐
  ▼                                      │
RMSNorm(x)                               │ residual
  │                                      │
Multi-Head Self-Attention                │
  │                                      │
  └──────────── + ───────────────────────┘ (Add & Norm)
                │
                ├──────────────────────────────────────┐
                ▼                                      │
             RMSNorm(x')                               │ residual
                │                                      │
             SwiGLU FFN (expand to 4× d_model,         │
                         then contract back)           │
                │                                      │
                └──────────── + ───────────────────────┘
                              │
                           Output
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Residual connections: Add input to output at each sub-layer. Allows gradients to flow directly to early layers (no vanishing gradient). Makes the network learn residual corrections rather than full transformations.

RMSNorm vs LayerNorm: LayerNorm normalizes by both mean and variance. RMSNorm only divides by RMS (root mean square) — simpler, faster, equally effective. Llama uses RMSNorm.

Pre-Norm vs Post-Norm: Pre-norm (normalize before attention/FFN) trains more stably for deep networks. All modern LLMs use pre-norm.

Feed-Forward Network (SwiGLU)

Standard FFN: Linear → ReLU → Linear SwiGLU (Llama): (xW₁) ⊗ SiLU(xW₂) where ⊗ is elementwise multiplication

SwiGLU provides a learned gating mechanism — some dimensions are suppressed, others amplified. Empirically ~10% better than standard FFN.

Sub-topic 13: Tokenization

BPE (Byte-Pair Encoding)

BPE Algorithm:
1. Start with character vocabulary: {'m', 'o', 'r', 't', 'g', 'a', 'e', ...}
2. Count most frequent adjacent pair: 'g' 'a' appears 1000 times → merge to 'ga'
3. Count again: 'm' 'o' appears 800 times → merge to 'mo'
4. Repeat until vocabulary reaches target size (e.g., 32K tokens)

Result: "mortgage" → ['mort', 'gage'] (2 tokens)
        "mortgagee" → ['mort', 'gage', 'e'] (3 tokens)
        " LTV" → ['▁LTV'] (1 token — rare finance term kept as unit)

Vocabulary sizes: - Llama 3.1: 128,256 tokens (very large — covers many languages + special tokens) - GPT-4: ~100,000 tokens - Older models (GPT-2): 50,257 tokens

Cost impact (important for LoanIQ): 1 token ≈ 0.75 words in English. API pricing is per token. "Analyze this loan application thoroughly and comprehensively" = ~8 tokens = ~$0.00012 at GPT-4o pricing. Knowing token counts helps budget LLM costs per agent.

Sub-topic 14: Flash Attention

What, Why, How

Problem: Standard attention requires materializing the full O(seq² × d) attention matrix — for seq_len=4096, that's 4096² = 16M attention scores × 4 bytes = 64MB per layer. Memory-bound operation.

Flash Attention: Rewrites the attention algorithm to avoid materializing the full attention matrix. Uses CUDA kernel fusion — computes softmax incrementally, keeps partial results in fast SRAM (on-chip memory) rather than writing to slow HBM (GPU memory).

Standard: Q·K^T → write to HBM (16M entries) → softmax → read from HBM → ×V
FlashAttn: Block-wise computation entirely in SRAM → never writes large matrix to HBM

Speed: 2-4× faster. Memory: O(seq) instead of O(seq²). Quality: mathematically identical to standard attention (no approximation).

LoanIQ uses Flash Attention via Unsloth's optimized kernels during fine-tuning.

10 Interview Questions — ML & Transformers

Q1: Explain cross-entropy loss in the context of LLM training. What does a loss of 0.5 mean?

A: Cross-entropy loss for next-token prediction is -log(P(correct_next_token)).

If the model predicts the next token with probability 0.9, loss = -log(0.9) = 0.105. If it predicts probability 0.1, loss = -log(0.1) = 2.303.

Training loss of 0.5 means P(correct_token) = exp(-0.5) ≈ 0.61. The model predicts the correct next token with 61% probability on average.

For LoanIQ fine-tuning: starting loss ~2.5 (model barely knows mortgage language), target loss ~0.5 (model reliably generates correct mortgage narrative tokens). Lower loss = the model "knows" what word should come next in an underwriter narrative.

Perplexity = exp(loss) = exp(0.5) ≈ 1.65. Perplexity of 1.0 would mean perfect prediction.

Q2: What is the attention mechanism? Explain Q, K, V intuitively.

A: Attention lets each token "look at" all other tokens and decide which ones are relevant.

Intuitive explanation (library analogy): - Query (Q): "What information am I looking for?" (like a search query) - Key (K): "What information do I have to offer?" (like a book's title/index) - Value (V): "What's the actual content I'll share?" (like the book's text)

Each token broadcasts a Query. All tokens respond with their Keys. The similarity between Query and Keys determines the attention weights — higher similarity = more attention. The output is a weighted sum of Values.

In "The maximum DTI is 45%", when processing "45%", the Query would attend heavily to "DTI" and "maximum" (high Key similarity) — gathering context that this number is a DTI limit, not a random percentage.

Q3: Why is the √d_k scaling factor in the attention formula necessary?

A: Without scaling, dot products between Q and K vectors grow in magnitude proportional to √d_k (the square root of the dimension).

For d_k=64: dot products typically in range [-8, 8] → softmax distributes attention across many tokens
For d_k=512: dot products typically in range [-23, 23] → softmax becomes very peaked, most tokens get ~0 attention, one token gets ~1.0

The peaked softmax problem: gradients become tiny for the near-zero attention weights (vanishing gradients in attention). Training becomes unstable.

Dividing by √d_k brings dot products back to O(1) scale regardless of dimension, keeping softmax outputs in a healthy range. This is in the original "Attention is All You Need" paper.

Q4: What is causal masking and why do decoder-only models need it?

A: Causal masking (autoregressive masking) prevents each token from attending to future tokens during training.

During training, we process the entire sequence at once (efficient). But the model should be predicting token N using only tokens 1...N-1. If it could see token N+1, it would "cheat" — trivially predict the next token by copying from the future.

Implementation: Set attention_score[i,j] = -∞ for all j > i (j is after i). After softmax: softmax(-∞) = 0 — zero attention to future tokens.

During inference: No masking needed — we generate one token at a time, so there's no future to mask.

Contrast with BERT (encoder-only, no causal mask): BERT sees the full sequence bidirectionally during pre-training (MLM = masked language modeling, not next-token prediction). This makes BERT great at understanding but not generating.

Q5: Explain residual connections. Why do all modern LLMs use them?

A: Residual connections add the layer input directly to the layer output: Output = LayerNorm(x) + x

The vanishing gradient problem without residuals: In a 96-layer transformer, gradients must pass through 96 layers to reach the early layers. Each layer multiplies the gradient by something — if these multiplications are <1 on average, gradients shrink exponentially → early layers learn nothing.

With residuals: The gradient path includes a direct shortcut from output back to input. Even if the residual branch's gradient is near zero, the identity shortcut provides a gradient highway: ∂Loss/∂x = ∂Loss/∂output × (1 + ∂f/∂x) — the 1 in parentheses ensures at minimum, gradient magnitude is preserved.

Practical effect: Networks can be much deeper (GPT-3: 96 layers, Llama 3.1 70B: 80 layers) without training instability. Also makes optimization easier — layers can "do nothing" initially (output zero → output = residual = input) and learn incrementally.

Q6: What is the KV cache and what is its memory cost for LoanIQ?

A: The KV cache stores the Key and Value matrices computed for each token in past positions, avoiding recomputation.

For Llama 3.1 8B with a 4096-token context: - Per layer: 2 (K+V) × 4096 tokens × 128 dims/head × 8 heads × 2 bytes/float = 8MB - Total (32 layers): 32 × 8MB = 256MB

This is the KV cache overhead for a 4096-token context. At 32K tokens (Llama 3.1's context window), it grows to ~2GB.

For LoanIQ agents: each agent receives a prompt with ~1000-2000 tokens (policy context + loan data). KV cache ≈ 64MB per agent. With 7 agents running sequentially, max KV cache in memory at one time is ~64MB — very manageable.

The KV cache is why long-context inference is memory-constrained: generating a 100K-token document requires storing 6GB+ of KV cache.

Q7: Explain tokenization and why it matters for LoanIQ's cost calculations.

A: Tokenization converts text into integer IDs using a vocabulary of ~32K-128K tokens. The vocabulary is learned using BPE (Byte-Pair Encoding): frequently co-occurring character sequences become single tokens.

For LoanIQ: - "conventional" → 1 token (common word)
- "debt-to-income" → 3 tokens: "debt", "-to", "-income" - "CLTV" → 2 tokens: "C", "LTV" (financial acronym, less common)

Why it matters for cost: API pricing is per token, not per character. A policy chunk with 1000 characters ≈ 250 tokens (1 token ≈ 4 chars). At GPT-4o pricing of $0.015/1K tokens, processing 10 chunks of 1000 chars costs $0.038 per query.

For LoanIQ's fine-tuned model at $0: tokenization still matters for latency — each token generated takes a forward pass through all layers. Shorter prompts = faster inference.

LoanIQ's Context Builder enforces a 6000-token budget (≈ 24,000 chars) to control both cost and latency.

Q8: What is the difference between encoder-only, decoder-only, and encoder-decoder transformers? When would you use each?

Encoder-only (BERT, RoBERTa): - Bidirectional attention — sees full context - Pre-trained with Masked Language Modeling - Great for: embeddings, classification, NER, reading comprehension - LoanIQ uses: embedding model (text-embedding-3-small is encoder-based)

Decoder-only (GPT, Llama, Claude): - Causal (left-to-right) attention - Pre-trained with next-token prediction - Great for: text generation, chat, reasoning, code - LoanIQ uses: all 7 decision agents, fine-tuned ratio model

Encoder-decoder (T5, BART, mT5): - Encoder: bidirectional attention on input - Decoder: causal attention + cross-attention to encoder - Great for: translation, summarization, question answering with long inputs - Not used in LoanIQ

Modern trend: decoder-only models have taken over. GPT-4o, Llama 3.1, Claude — all decoder-only. They're more general and scale better with model size and training data.

Q9: Explain dropout. Why is lora_dropout=0 optimal for QLoRA?

A: Dropout randomly zeros neurons with probability p during training:

Without dropout: Hidden = [0.3, -0.5, 0.8, 0.2]
With dropout p=0.3: Hidden = [0.3, 0.0, 0.8, 0.0]  (2 neurons zeroed randomly)

This forces the network not to rely on any single neuron — it must learn redundant representations. This prevents co-adaptation (neuron A only works when neuron B has a specific value) and acts as regularization.

Why lora_dropout=0 with QLoRA (Dettmers et al., 2023): NF4 quantization introduces quantization noise in the base model — the 4-bit values are not exact. This quantization noise effectively acts as regularization, similar to what dropout provides. Adding dropout ON TOP of quantization noise doesn't improve generalization further but adds instability to training.

Additionally, LoRA adapters are already small (0.1% of parameters) — they have low capacity and don't need aggressive regularization to avoid overfitting.

Empirical finding: dropout=0 consistently outperformed dropout=0.05 and 0.1 in the QLoRA paper's ablation studies.

Q10: What is Flash Attention and why does LoanIQ's training benefit from it?

A: Standard attention has two performance bottlenecks:

Memory: The Q×K^T attention matrix is O(seq² × layers) — for seq=2048, this is 2048²×32×2bytes = 268MB of intermediate activations that must be written to GPU HBM (slow memory) during forward pass and read back during backward pass.
Bandwidth: GPU computation speed (FLOPS) has improved faster than memory bandwidth. Modern GPUs are "memory-bandwidth bound" for attention — they spend more time reading/writing HBM than computing.

Flash Attention solution: Rewrites the attention computation to use SRAM (fast on-chip memory, ~100× faster than HBM). It processes the attention in tiles that fit in SRAM, computes softmax incrementally using the online softmax trick, and never writes the full attention matrix to HBM.

Result: 2-4× faster training, 5-20× less memory for attention. Mathematically identical output.

In LoanIQ fine-tuning: Unsloth automatically uses Flash Attention 2 kernels, which is why fine-tuning on Colab A100 takes 45 minutes instead of potentially 3-4 hours with standard attention. Also enables longer sequences (up to 8K tokens) in the same VRAM budget.

Next: Docker & AWS Bedrock →