Fine-Tuning Machine Learning and LLM Models
A beginner-to-intermediate handbook focused on applied fine-tuning: what to tune, when to tune, how to structure data, which PEFT method to choose, and how to run a practical Hugging Face training loop without drowning in theory.
Introduction to Fine-Tuning
Fine-tuning means taking a model that already learned general patterns and continuing training on a smaller, task-specific dataset. Instead of teaching the model everything from zero, you adapt what it already knows so it performs better on your domain, format, labels, or response style.
Fine-tuning is a smaller follow-up step that adapts a pretrained model to a target task like classification, summarization, chat, ranking, or domain adaptation.
Transfer learning starts from a pretrained model and reuses existing knowledge. Fine-tuning is the most common transfer-learning pattern for modern ML and LLM work.
When Fine-Tuning Is Required vs Not Required
Use it for: domain terminology, consistent output format, label prediction, product-specific assistant behavior, ranking objectives, or local deployment where a smaller model must perform better on a narrow task.
Skip it when: prompt engineering, retrieval-augmented generation (RAG), better evaluation prompts, or simple application logic can already solve the problem. Fine-tuning is slower to iterate and easier to get wrong if your dataset is weak.
Key Parameters in Fine-Tuning
Most failed fine-tuning runs come from a small set of knobs: learning rate, batch size, epochs, regularization, and scheduler behavior. You do not need to memorize every formula, but you should know what each parameter changes so you can debug training runs quickly.
from transformers import TrainingArguments
training_args = TrainingArguments(
learning_rate=2e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
warmup_steps=500,
logging_steps=100
)
w_{t+1} = w_t - \eta g_t. The learning rate \eta controls how large each parameter update is.batch * grad_accumulation * num_devices. This matters more than per-device batch size alone.lr_t = lr_max * t / warmup_steps during warmup. This prevents unstable early updates.g_clipped = g * min(1, max_norm / ||g||). Large gradient spikes get scaled down.| Parameter | What it does | If you increase it | Overfitting | Convergence / Stability |
|---|---|---|---|---|
| Learning rate | Controls update size each step. | Training moves faster, but can overshoot good solutions. | Indirect effect. Too high can look like poor generalization because training becomes noisy. | Most sensitive knob. Too high causes divergence or oscillation. Too low wastes time and may underfit. |
| Batch size | Number of samples used for one gradient estimate. | Gradient becomes smoother and training often allows a slightly higher LR. | Very large batches can reduce helpful noise and sometimes generalize worse. | Usually improves stability, but uses more VRAM. |
| Epochs | How many full passes over the training data. | The model fits the training set more aggressively. | Main direct driver of overfitting if validation stops improving. | Enough epochs are needed to learn; too many memorize noise. |
| Weight decay | Penalizes large weights and acts as regularization. | Can reduce overfitting, but too much hurts fit. | Often helps if the model memorizes small datasets. | Moderate values improve robustness. Too high slows learning. |
| Gradient clipping | Caps very large gradient norms. | More aggressive clipping if the threshold is lower. | Not a direct anti-overfitting tool. | Strongly improves stability on transformer runs that spike in loss. |
| Warmup steps | Starts with a small LR and ramps up. | Safer early training, but too much warmup slows progress. | No major direct effect. | Useful for preventing early instability, especially with AdamW and large models. |
| Scheduler | Changes LR over time. | Depends on schedule type. | Good schedules reduce late-stage overfitting by lowering LR. | Linear and cosine are common because they balance speed and smooth finishing. |
Scheduler Types
| Scheduler | How it behaves | When to use it |
|---|---|---|
| Linear decay | LR ramps up, then decays steadily toward zero. | Safe default for many Hugging Face runs and classification tasks. |
| Cosine decay | LR drops slowly at first, then more smoothly near the end. | Common for LLM fine-tuning when you want gentle late-stage updates. |
| Constant with warmup | LR warms up, then stays fixed. | Useful when you want predictable tuning behavior or short runs. |
| Polynomial decay | LR shrinks using a curve you can shape. | More control-heavy setups; less common for beginners. |
Data Formats for Fine-Tuning
Your dataset format should match the behavior you want the model to learn. Fine-tuning works best when the training examples already look like the prompts and outputs you expect at inference time.
Instruction Tuning Format
Use this when you want the model to follow task instructions and produce a single target answer.
{
"instruction": "Translate English to French",
"input": "Hello",
"output": "Bonjour"
}
Chat Format
Use this when the model should behave like an assistant in multi-turn conversations.
[
{"role": "user", "content": "What is AI?"},
{"role": "assistant", "content": "AI stands for Artificial Intelligence..."}
]
Plain Text Format
Use this for language modeling, continued pretraining, or next-token prediction on raw text. The model learns to continue text rather than answer explicit instruction fields.
Artificial intelligence is a branch of computer science focused on building systems that can perform tasks requiring human-like reasoning.
| Format | Best for | Pros | Cons |
|---|---|---|---|
| Instruction tuning | Single-turn tasks, extraction, summarization, QA, task transfer | Simple to build, easy to inspect, strong for SFT | Can miss conversational context if your real product is multi-turn |
| Chat format | Assistants, copilots, support bots, role-conditioned dialogue | Matches production chat behavior, preserves turns and tone | Formatting mistakes around roles can silently hurt training quality |
| Plain text | Continued pretraining, domain adaptation, language modeling | Easy to collect at scale, good for domain vocabulary | Does not directly teach instruction following or answer structure |
Dataset Preparation Tips
- Keep the output shape consistent. If production answers should be JSON, tables, or short bullets, teach exactly that format during training.
- Deduplicate aggressively. Duplicates make the model memorize frequent samples and distort evaluation.
- Clean bad labels first. A smaller high-quality dataset usually beats a larger noisy one.
- Split train, validation, and test early. Do this before heavy preprocessing to reduce leakage.
- Watch sequence length. Overly long samples waste tokens and can hide the actual signal.
- Preserve edge cases. Include difficult examples so the model learns boundary behavior, not only easy cases.
Fine-Tuning Approaches
Not all fine-tuning means updating every weight. The main approaches differ in cost, memory, speed, and how much task-specific capacity you gain.
| Method | Memory | Speed | Performance | Use Case |
|---|---|---|---|---|
| Full fine-tuning | Highest | Slowest setup and heaviest checkpoints | Best ceiling when you have strong data and compute | High-value tasks, model ownership, strong hardware budget |
| Feature extraction | Lowest | Fast | Good for simpler supervised tasks, limited for generation | Classification, regression, small labeled datasets |
| PEFT | Low to medium | Fast training with small trainable state | Often close to full tuning on narrow tasks | LLMs on limited GPUs, rapid iteration, multiple adapters |
LoRA, QLoRA, and Other Efficient Methods
Parameter-efficient fine-tuning lets you adapt large models without paying the full memory cost of updating every weight. This is why consumer-GPU fine-tuning became practical.
Concept: instead of rewriting a full weight matrix W, LoRA learns a low-rank update Delta W = BA and applies W' = W + Delta W. Only the small adapter matrices are trained.
Why it is efficient: trainable parameter count drops dramatically, optimizer state becomes much smaller, and you can store many task adapters on top of one base model.
When to use it: instruction tuning, domain adaptation, lightweight experiments, or multiple task-specific adapters.
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1
)
model = get_peft_model(model, config)
Short intuition: r controls adapter capacity, lora_alpha scales the update, and target_modules decides where adapters are inserted.
Concept: load the base model in low precision, typically 4-bit, and train LoRA adapters on top. The frozen base stays quantized while adapter weights remain trainable in higher precision.
Why it is memory efficient: the frozen base weights occupy far less VRAM, which makes 7B to 13B models accessible on much smaller GPUs.
When to prefer it over LoRA: when VRAM is your main constraint, especially for larger models or laptops and single-GPU systems.
| Method | Approx VRAM for 7B training | Quality ceiling | Best use |
|---|---|---|---|
| Full fine-tuning | 60 GB or more, depending on precision and optimizer | Highest | High-budget, high-stakes adaptation |
| LoRA | Roughly 16-24 GB | Very strong for many narrow tasks | Single-GPU adaptation with minimal performance loss |
| QLoRA | Roughly 10-16 GB | Slightly lower than LoRA in some setups, often close enough | Tight VRAM budgets and fast experimentation |
Other Techniques
| Technique | Main tradeoff | Good fit |
|---|---|---|
| LoRA | Best balance of simplicity, quality, and memory | Default PEFT choice |
| QLoRA | Lowest memory, slightly more complexity around quantization support | Limited VRAM environments |
| Adapters | More parameters than prompt-based methods | Reusable modular adapters |
| Prefix tuning | Very compact but less expressive for some tasks | Controlling style or task framing |
| Prompt tuning | Smallest footprint, but often weakest adaptation power | Resource-constrained experiments |
Optimizers With Focus on Adam
An optimizer decides how parameter updates are applied. For transformer fine-tuning, Adam and especially AdamW are the common defaults because they adapt learning rates per parameter and handle noisy gradients well.
import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=2e-5)
Simple Explanation of Adam
- Momentum: smooths noisy gradients by remembering recent directions. This helps the optimizer keep moving in a useful direction instead of reacting too hard to every mini-batch.
- RMS scaling: uses a running estimate of squared gradients so parameters with consistently large gradients get smaller updates.
- Bias correction: early moving averages are corrected so the optimizer does not underestimate them at the start.
m_t = beta1 * m_(t-1) + (1 - beta1) * g_t
v_t = beta2 * v_(t-1) + (1 - beta2) * g_t^2
update ~ m_t / (sqrt(v_t) + eps)
| Choice | Use it when | Why |
|---|---|---|
| Adam | You want a simple adaptive optimizer baseline | Works well on many deep learning tasks, especially transformers |
| SGD | You have a very large dataset, vision-style training, or well-tuned classical pipelines | Can generalize well, but usually needs more careful tuning for LLM fine-tuning |
| AdamW | You are fine-tuning transformers | Decouples weight decay from gradient updates, which usually makes regularization cleaner and more stable |
Ranking and Training Objectives
Your training objective defines what “better” means during optimization. If you choose the wrong objective, the model can improve on paper while failing the real task.
Cross-Entropy Loss
Use this for supervised fine-tuning, text classification, sequence labeling, and next-token prediction. It is usually the first objective to try because it aligns directly with “predict the correct answer.”
Ranking Loss
Use ranking objectives when the order of results matters more than a single label. Common examples include search, recommendation, reranking, and preference learning.
| Type | What it compares | Typical use |
|---|---|---|
| Pairwise | One positive item vs one negative item | Search ranking, reward modeling, binary preference comparison |
| Listwise | A full ranked list | Recommendation systems, search pages, complex ranking pipelines |
RLHF - Reinforcement Learning from Human Feedback
RLHF usually means: supervised fine-tune first, gather human preferences, train a reward model, then optimize the main model so it produces higher-reward responses. In practice, many teams now use simpler preference-optimization methods like DPO, but the high-level idea is the same: learn from what humans prefer, not only from fixed labels.
Pretraining vs Fine-Tuning
Fine-tuning is usually enough, but not always. The harder question is whether your problem needs continued pretraining or even training from scratch before fine-tuning.
Typical cases: very domain-specific data such as medical records, legal corpora, scientific literature, protein sequences, specialized codebases, or any setting where the base model repeatedly fails because the underlying vocabulary and distribution are too different.
Typical cases: general-purpose chat, classification, summarization, extraction, translation-like formatting, and many enterprise assistants where the base model already has the underlying capabilities.
Decision Checklist
- Does the base model already understand the task? If yes, start with fine-tuning or even prompting.
- Is your data highly domain-specific? If yes, continued pretraining may help before SFT.
- Do you have millions or billions of tokens? If no, full pretraining is probably not worth it.
- Do you mainly need style, format, or label adaptation? Fine-tuning is usually enough.
- Do facts change frequently? Prefer RAG or retrieval over baking transient facts into weights.
Evaluation of Models
Evaluation is critical because training loss alone can lie. A model can memorize training data, look great on training metrics, and still fail on real users, new domains, or edge cases.
Common Evaluation Methods
| Metric | Best for | What to watch |
|---|---|---|
| Accuracy / F1 | Classification, extraction, labeling | F1 is better than raw accuracy when classes are imbalanced |
| Perplexity | Language modeling, continued pretraining, causal LM tuning | Lower is better, but only meaningful when comparing on the same data distribution |
| BLEU / ROUGE | Translation, summarization | Helpful for automation, but can miss semantic quality and factuality |
| Human evaluation | Chat quality, safety, tone, helpfulness | Slow but often necessary for assistant behavior |
| Benchmark datasets | Standardized comparison | Useful for baselines, but make sure they match your real task |
from datasets import load_metric
metric = load_metric("accuracy")
result = metric.compute(predictions=preds, references=labels)
The newer evaluate package is now common, but the example above still shows the same core idea: compute task metrics from predictions and reference labels.
Offline vs Online Evaluation
| Mode | What it answers | Example |
|---|---|---|
| Offline | Did the model improve on held-out data? | F1 on a validation split, perplexity on dev text, human review set |
| Online | Did the model improve real product outcomes? | CTR, resolution rate, conversion, user preference, latency, cost |
Overfitting Detection
- Training loss keeps dropping while validation loss rises.
- Generated outputs repeat training phrasing too closely.
- Benchmark gains do not transfer to manual test prompts or business data.
- Edge cases degrade even though average metrics improve.
Best Practices and Common Pitfalls
Most practical wins come from disciplined data and monitoring, not from exotic algorithms. These are the habits that consistently improve fine-tuning quality.
Monitor validation loss. It is your early warning system for overfitting and bad hyperparameters.
Use early stopping. Stop when validation no longer improves.
Tune one thing at a time. Change learning rate or epochs first before chasing harder explanations.
Practical Tuning Tips
- Start with a strong baseline. Compare prompt-only, RAG, and a simple fine-tune before scaling the project.
- Log everything. Save configs, seeds, metrics, model revisions, and dataset versions.
- Use early stopping and checkpointing. This saves compute and prevents losing a good run.
- Inspect actual outputs every run. Metrics alone do not reveal formatting mistakes, hallucinations, or tone issues.
- Keep a small gold set. Use a hand-curated evaluation set for fast regression checks.
Hardware Considerations
Compute planning matters because the same method can feel easy or impossible depending on model size, precision, and optimization choices.
| Setup | Typical fit | Notes |
|---|---|---|
| Single 12-16 GB GPU | Small models, LoRA, QLoRA, short sequences | Use 4-bit loading, gradient accumulation, and possibly gradient checkpointing |
| Single 24 GB GPU | 7B-scale LoRA or QLoRA, moderate sequence lengths | Good sweet spot for practical experimentation |
| 48 GB+ GPU | Larger LoRA runs, faster batch sizes, some full tuning experiments | Much more room for sequence length and optimizer state |
| Multi-GPU | Large models, distributed training, full fine-tuning | Needed when model states and activations no longer fit on one device |
fp16 or bf16 when supported. This usually reduces memory and speeds up training.Real-World Use Cases
Fine-tuning is most useful when you need repeatable behavior that generic prompting cannot guarantee.
End-to-End Example
The example below shows a compact Hugging Face Trainer workflow for instruction-style fine-tuning on a small causal language model. It covers data loading, prompt formatting, tokenization, training, and evaluation.
import math
from datasets import Dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
DataCollatorForLanguageModeling,
Trainer,
TrainingArguments,
)
samples = [
{
"instruction": "Summarize the sentence",
"input": "Fine-tuning adapts a pretrained model to a target task.",
"output": "Fine-tuning specializes a pretrained model."
},
{
"instruction": "Answer the question",
"input": "What does GPU stand for?",
"output": "GPU stands for Graphics Processing Unit."
},
{
"instruction": "Rewrite in a formal tone",
"input": "The launch failed because the config was wrong.",
"output": "The launch failed due to an incorrect configuration."
},
{
"instruction": "Extract the sentiment",
"input": "The onboarding experience was smooth and very helpful.",
"output": "positive"
}
]
dataset = Dataset.from_list(samples).train_test_split(test_size=0.25, seed=42)
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)
model.config.pad_token_id = tokenizer.pad_token_id
def format_example(example):
instruction = example["instruction"].strip()
input_text = example["input"].strip()
output_text = example["output"].strip()
return (
f"### Instruction:\n{instruction}\n\n"
f"### Input:\n{input_text}\n\n"
f"### Response:\n{output_text}"
)
def add_text(example):
example["text"] = format_example(example)
return example
dataset = dataset.map(add_text)
def tokenize(example):
tokens = tokenizer(
example["text"],
truncation=True,
max_length=256,
padding="max_length"
)
tokens["labels"] = tokens["input_ids"].copy()
return tokens
tokenized = dataset.map(tokenize)
tokenized = tokenized.remove_columns(["instruction", "input", "output", "text"])
training_args = TrainingArguments(
output_dir="./ft-output",
learning_rate=2e-5,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=3,
weight_decay=0.01,
warmup_steps=20,
logging_steps=10,
evaluation_strategy="epoch",
save_strategy="epoch",
report_to="none"
)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
data_collator=data_collator
)
trainer.train()
metrics = trainer.evaluate()
print(metrics)
if "eval_loss" in metrics:
print("perplexity:", math.exp(metrics["eval_loss"]))
What each step is doing
- Load dataset: build or read examples with instruction, input, and output fields.
- Format text: turn structured fields into the exact prompt style you want the model to learn.
- Tokenize: convert text to token IDs and create labels for next-token learning.
- Train with Trainer: use
TrainingArgumentsto control learning rate, batch size, epochs, warmup, and logging. - Evaluate: inspect metrics and sample outputs before calling the run successful.
How to convert this into a PEFT run
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=8,
lora_alpha=16,
lora_dropout=0.1,
target_modules=["c_attn"]
)
model = get_peft_model(model, lora_config)
For larger instruction-tuning projects, many teams switch from raw Trainer to specialized trainers such as SFTTrainer, but the core workflow remains the same.
Reference Links
Core Docs
- HF Hugging Face Transformers docs
- Trainer Trainer API reference
- Datasets Hugging Face Datasets docs
- PEFT PEFT documentation
- PyTorch PyTorch docs
Applied Reading
- LoRA LoRA paper
- QLoRA QLoRA paper
- TRL TRL documentation
- Eval Evaluate library docs