Fine-Tuning LLMs

When to fine-tune · LoRA / QLoRA · PEFT · SFT · RLHF · Fine-tune vs RAG

Fine-Tuning vs RAG — The Core Decision

Dimension	Fine-Tuning	RAG
Updates knowledge?	Yes — baked into weights	Yes — via retrieval at runtime
Knowledge freshness	Static after training	Real-time with DB updates
Cost	High upfront, cheap inference	Low upfront, retrieval cost per query
Data needed	100s–10,000s labelled examples	Source documents only
Best for	Style, tone, format, domain-specific behaviour	Knowledge-intensive Q&A, grounded answers
Hallucination risk	Higher (no grounding)	Lower (grounded in retrieved docs)
Explainability	Low — black box	High — can cite sources

Fine-tune to change HOW the model behaves. Use RAG to change WHAT the model knows.

When Fine-Tuning Makes Sense

You need a specific output format the base model doesn't follow reliably (JSON schema, structured reports)
Domain-specific vocabulary or style that prompting doesn't capture (legal, medical, financial)
Latency requirements — fine-tuned smaller model beats large model with long context
You have proprietary labelled data that gives competitive advantage
Privacy — can't send data to external API; self-hosted fine-tuned model is the answer

LoRA — Low-Rank Adaptation

Instead of updating all model weights (billions of parameters), LoRA freezes the base model and adds small trainable low-rank matrices alongside the frozen weight matrices. Only these adapter weights are trained and stored.

# Weight update decomposed into low-rank matrices
W_new = W_frozen + (A @ B)
# where A is (d x r), B is (r x k), r << d,k
# rank r is typically 4, 8, or 16

Benefits: 100–1000x fewer trainable parameters, fits on a single GPU, swap adapters without reloading the base model, merge back into base weights for zero-latency inference.

QLoRA — Quantised LoRA

QLoRA adds 4-bit quantisation of the base model to LoRA. The base model is loaded in 4-bit (NF4 or FP4), reducing VRAM by ~4x. LoRA adapters train in 16-bit. Enables fine-tuning a 70B model on a single 48GB GPU.

from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config=bnb_config)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

Supervised Fine-Tuning (SFT)

The most common form. You provide (instruction, response) pairs and train the model to predict the response given the instruction. Uses standard next-token prediction loss on the response tokens only (input tokens are masked).

# Dataset format for instruction fine-tuning
{
  "instruction": "Extract the DTI ratio from this mortgage application",
  "input": "Monthly debt payments: $2,400. Gross monthly income: $8,000.",
  "output": "DTI ratio: 30% (2400/8000 = 0.30)"
}

RLHF — Reinforcement Learning from Human Feedback

Three stages: (1) SFT on demonstration data, (2) Train a reward model on human preference pairs (response A vs response B — which is better?), (3) Fine-tune the SFT model with PPO to maximise reward model score. Used by OpenAI for InstructGPT / ChatGPT.

DPO (Direct Preference Optimisation) — simpler alternative to RLHF. Skips the reward model entirely; directly optimises on preference pairs using a modified cross-entropy loss. Increasingly preferred for production fine-tuning.

Evaluation After Fine-Tuning

Held-out task accuracy — performance on format/task the model was trained for
Perplexity — lower is better; sanity check for training convergence
Catastrophic forgetting — test general capabilities after fine-tuning; ensure base skills aren't degraded
MMLU / HellaSwag — standard benchmarks for general reasoning degradation

Common Interview Questions

Q: What rank (r) to use for LoRA?

Start with r=8. Higher rank = more expressive but more parameters. For format/style tasks r=4 or 8 is usually enough. For complex domain adaptation r=16 or 32. Monitor validation loss — if it doesn't converge, increase rank.

Q: How do you prevent catastrophic forgetting?

Use a low learning rate (1e-4 to 1e-5). Keep LoRA rank small. Add a small proportion of general-instruction data to your training mix (~10%). Evaluate on held-out general benchmarks throughout training, not just at the end.

Q: Fine-tuning vs prompting — where's the line?

Try prompting first — it's cheaper and faster to iterate. Fine-tune when: few-shot examples push context too long, prompting is inconsistent across runs, you need guaranteed output structure, or you're running thousands of queries where a smaller fine-tuned model is cheaper per token.