Fine-Tuning LLMs

When to fine-tune · LoRA / QLoRA · PEFT · SFT · RLHF · Fine-tune vs RAG


Fine-Tuning vs RAG — The Core Decision

DimensionFine-TuningRAG
Updates knowledge?Yes — baked into weightsYes — via retrieval at runtime
Knowledge freshnessStatic after trainingReal-time with DB updates
CostHigh upfront, cheap inferenceLow upfront, retrieval cost per query
Data needed100s–10,000s labelled examplesSource documents only
Best forStyle, tone, format, domain-specific behaviourKnowledge-intensive Q&A, grounded answers
Hallucination riskHigher (no grounding)Lower (grounded in retrieved docs)
ExplainabilityLow — black boxHigh — can cite sources
Fine-tune to change HOW the model behaves. Use RAG to change WHAT the model knows.

When Fine-Tuning Makes Sense

LoRA — Low-Rank Adaptation

Instead of updating all model weights (billions of parameters), LoRA freezes the base model and adds small trainable low-rank matrices alongside the frozen weight matrices. Only these adapter weights are trained and stored.

# Weight update decomposed into low-rank matrices
W_new = W_frozen + (A @ B)
# where A is (d x r), B is (r x k), r << d,k
# rank r is typically 4, 8, or 16

Benefits: 100–1000x fewer trainable parameters, fits on a single GPU, swap adapters without reloading the base model, merge back into base weights for zero-latency inference.

QLoRA — Quantised LoRA

QLoRA adds 4-bit quantisation of the base model to LoRA. The base model is loaded in 4-bit (NF4 or FP4), reducing VRAM by ~4x. LoRA adapters train in 16-bit. Enables fine-tuning a 70B model on a single 48GB GPU.

from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config=bnb_config)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

Supervised Fine-Tuning (SFT)

The most common form. You provide (instruction, response) pairs and train the model to predict the response given the instruction. Uses standard next-token prediction loss on the response tokens only (input tokens are masked).

# Dataset format for instruction fine-tuning
{
  "instruction": "Extract the DTI ratio from this mortgage application",
  "input": "Monthly debt payments: $2,400. Gross monthly income: $8,000.",
  "output": "DTI ratio: 30% (2400/8000 = 0.30)"
}

RLHF — Reinforcement Learning from Human Feedback

Three stages: (1) SFT on demonstration data, (2) Train a reward model on human preference pairs (response A vs response B — which is better?), (3) Fine-tune the SFT model with PPO to maximise reward model score. Used by OpenAI for InstructGPT / ChatGPT.

DPO (Direct Preference Optimisation) — simpler alternative to RLHF. Skips the reward model entirely; directly optimises on preference pairs using a modified cross-entropy loss. Increasingly preferred for production fine-tuning.

Evaluation After Fine-Tuning

Common Interview Questions

Q: What rank (r) to use for LoRA?

Start with r=8. Higher rank = more expressive but more parameters. For format/style tasks r=4 or 8 is usually enough. For complex domain adaptation r=16 or 32. Monitor validation loss — if it doesn't converge, increase rank.

Q: How do you prevent catastrophic forgetting?

Use a low learning rate (1e-4 to 1e-5). Keep LoRA rank small. Add a small proportion of general-instruction data to your training mix (~10%). Evaluate on held-out general benchmarks throughout training, not just at the end.

Q: Fine-tuning vs prompting — where's the line?

Try prompting first — it's cheaper and faster to iterate. Fine-tune when: few-shot examples push context too long, prompting is inconsistent across runs, you need guaranteed output structure, or you're running thousands of queries where a smaller fine-tuned model is cheaper per token.