Fine-Tuning LLMs
When to fine-tune · LoRA / QLoRA · PEFT · SFT · RLHF · Fine-tune vs RAG
Fine-Tuning vs RAG — The Core Decision
| Dimension | Fine-Tuning | RAG |
|---|---|---|
| Updates knowledge? | Yes — baked into weights | Yes — via retrieval at runtime |
| Knowledge freshness | Static after training | Real-time with DB updates |
| Cost | High upfront, cheap inference | Low upfront, retrieval cost per query |
| Data needed | 100s–10,000s labelled examples | Source documents only |
| Best for | Style, tone, format, domain-specific behaviour | Knowledge-intensive Q&A, grounded answers |
| Hallucination risk | Higher (no grounding) | Lower (grounded in retrieved docs) |
| Explainability | Low — black box | High — can cite sources |
Fine-tune to change HOW the model behaves. Use RAG to change WHAT the model knows.
When Fine-Tuning Makes Sense
- You need a specific output format the base model doesn't follow reliably (JSON schema, structured reports)
- Domain-specific vocabulary or style that prompting doesn't capture (legal, medical, financial)
- Latency requirements — fine-tuned smaller model beats large model with long context
- You have proprietary labelled data that gives competitive advantage
- Privacy — can't send data to external API; self-hosted fine-tuned model is the answer
LoRA — Low-Rank Adaptation
Instead of updating all model weights (billions of parameters), LoRA freezes the base model and adds small trainable low-rank matrices alongside the frozen weight matrices. Only these adapter weights are trained and stored.
# Weight update decomposed into low-rank matrices W_new = W_frozen + (A @ B) # where A is (d x r), B is (r x k), r << d,k # rank r is typically 4, 8, or 16
Benefits: 100–1000x fewer trainable parameters, fits on a single GPU, swap adapters without reloading the base model, merge back into base weights for zero-latency inference.
QLoRA — Quantised LoRA
QLoRA adds 4-bit quantisation of the base model to LoRA. The base model is loaded in 4-bit (NF4 or FP4), reducing VRAM by ~4x. LoRA adapters train in 16-bit. Enables fine-tuning a 70B model on a single 48GB GPU.
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config=bnb_config)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622
Supervised Fine-Tuning (SFT)
The most common form. You provide (instruction, response) pairs and train the model to predict the response given the instruction. Uses standard next-token prediction loss on the response tokens only (input tokens are masked).
# Dataset format for instruction fine-tuning
{
"instruction": "Extract the DTI ratio from this mortgage application",
"input": "Monthly debt payments: $2,400. Gross monthly income: $8,000.",
"output": "DTI ratio: 30% (2400/8000 = 0.30)"
}
RLHF — Reinforcement Learning from Human Feedback
Three stages: (1) SFT on demonstration data, (2) Train a reward model on human preference pairs (response A vs response B — which is better?), (3) Fine-tune the SFT model with PPO to maximise reward model score. Used by OpenAI for InstructGPT / ChatGPT.
DPO (Direct Preference Optimisation) — simpler alternative to RLHF. Skips the reward model entirely; directly optimises on preference pairs using a modified cross-entropy loss. Increasingly preferred for production fine-tuning.
Evaluation After Fine-Tuning
- Held-out task accuracy — performance on format/task the model was trained for
- Perplexity — lower is better; sanity check for training convergence
- Catastrophic forgetting — test general capabilities after fine-tuning; ensure base skills aren't degraded
- MMLU / HellaSwag — standard benchmarks for general reasoning degradation
Common Interview Questions
Q: What rank (r) to use for LoRA?
Start with r=8. Higher rank = more expressive but more parameters. For format/style tasks r=4 or 8 is usually enough. For complex domain adaptation r=16 or 32. Monitor validation loss — if it doesn't converge, increase rank.
Q: How do you prevent catastrophic forgetting?
Use a low learning rate (1e-4 to 1e-5). Keep LoRA rank small. Add a small proportion of general-instruction data to your training mix (~10%). Evaluate on held-out general benchmarks throughout training, not just at the end.
Q: Fine-tuning vs prompting — where's the line?
Try prompting first — it's cheaper and faster to iterate. Fine-tune when: few-shot examples push context too long, prompting is inconsistent across runs, you need guaranteed output structure, or you're running thousands of queries where a smaller fine-tuned model is cheaper per token.