Back to handbooks index

Fine-Tuning Machine Learning and LLM Models

A beginner-to-intermediate handbook focused on applied fine-tuning: what to tune, when to tune, how to structure data, which PEFT method to choose, and how to run a practical Hugging Face training loop without drowning in theory.

Beginner to Intermediate PyTorch + Hugging Face LoRA / QLoRA / Evaluation March 2026

Introduction to Fine-Tuning

Fine-tuning means taking a model that already learned general patterns and continuing training on a smaller, task-specific dataset. Instead of teaching the model everything from zero, you adapt what it already knows so it performs better on your domain, format, labels, or response style.

Pretraining on huge corpus
General language or task skill
Fine-tuning on your data
Specialized behavior
Comparison
Pretraining vs Fine-Tuning
Pretraining learns broad patterns from very large datasets such as web text, books, code, or domain corpora. It is expensive and usually done once.

Fine-tuning is a smaller follow-up step that adapts a pretrained model to a target task like classification, summarization, chat, ranking, or domain adaptation.
Comparison
Training From Scratch vs Transfer Learning
Training from scratch initializes random weights and requires a lot of data, compute, and time.

Transfer learning starts from a pretrained model and reuses existing knowledge. Fine-tuning is the most common transfer-learning pattern for modern ML and LLM work.

When Fine-Tuning Is Required vs Not Required

Fine-tune when the model must learn stable behavior Good fit

Use it for: domain terminology, consistent output format, label prediction, product-specific assistant behavior, ranking objectives, or local deployment where a smaller model must perform better on a narrow task.

Domain adaptation Instruction following Classification Reranking
Do not fine-tune by default Try cheaper options first

Skip it when: prompt engineering, retrieval-augmented generation (RAG), better evaluation prompts, or simple application logic can already solve the problem. Fine-tuning is slower to iterate and easier to get wrong if your dataset is weak.

Prompting first RAG for fresh facts Rules for deterministic logic
01
Practical rule: if the model already knows the knowledge but does not behave the way you want, fine-tuning can help. If the model lacks the facts and those facts change often, RAG or a search layer is usually a better first move.

Key Parameters in Fine-Tuning

Most failed fine-tuning runs come from a small set of knobs: learning rate, batch size, epochs, regularization, and scheduler behavior. You do not need to memorize every formula, but you should know what each parameter changes so you can debug training runs quickly.

from transformers import TrainingArguments

training_args = TrainingArguments(
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    warmup_steps=500,
    logging_steps=100
)
Formula
Weight update
w_{t+1} = w_t - \eta g_t. The learning rate \eta controls how large each parameter update is.
Formula
Effective batch size
batch * grad_accumulation * num_devices. This matters more than per-device batch size alone.
Formula
Warmup
lr_t = lr_max * t / warmup_steps during warmup. This prevents unstable early updates.
Formula
Gradient clipping
g_clipped = g * min(1, max_norm / ||g||). Large gradient spikes get scaled down.
Parameter What it does If you increase it Overfitting Convergence / Stability
Learning rate Controls update size each step. Training moves faster, but can overshoot good solutions. Indirect effect. Too high can look like poor generalization because training becomes noisy. Most sensitive knob. Too high causes divergence or oscillation. Too low wastes time and may underfit.
Batch size Number of samples used for one gradient estimate. Gradient becomes smoother and training often allows a slightly higher LR. Very large batches can reduce helpful noise and sometimes generalize worse. Usually improves stability, but uses more VRAM.
Epochs How many full passes over the training data. The model fits the training set more aggressively. Main direct driver of overfitting if validation stops improving. Enough epochs are needed to learn; too many memorize noise.
Weight decay Penalizes large weights and acts as regularization. Can reduce overfitting, but too much hurts fit. Often helps if the model memorizes small datasets. Moderate values improve robustness. Too high slows learning.
Gradient clipping Caps very large gradient norms. More aggressive clipping if the threshold is lower. Not a direct anti-overfitting tool. Strongly improves stability on transformer runs that spike in loss.
Warmup steps Starts with a small LR and ramps up. Safer early training, but too much warmup slows progress. No major direct effect. Useful for preventing early instability, especially with AdamW and large models.
Scheduler Changes LR over time. Depends on schedule type. Good schedules reduce late-stage overfitting by lowering LR. Linear and cosine are common because they balance speed and smooth finishing.

Scheduler Types

Scheduler How it behaves When to use it
Linear decay LR ramps up, then decays steadily toward zero. Safe default for many Hugging Face runs and classification tasks.
Cosine decay LR drops slowly at first, then more smoothly near the end. Common for LLM fine-tuning when you want gentle late-stage updates.
Constant with warmup LR warms up, then stays fixed. Useful when you want predictable tuning behavior or short runs.
Polynomial decay LR shrinks using a curve you can shape. More control-heavy setups; less common for beginners.
02
Debugging shortcut: if training is unstable, first lower learning rate, add or increase warmup, and make sure gradient clipping is on. If validation gets worse while training loss keeps improving, reduce epochs and strengthen regularization.

Data Formats for Fine-Tuning

Your dataset format should match the behavior you want the model to learn. Fine-tuning works best when the training examples already look like the prompts and outputs you expect at inference time.

Instruction Tuning Format

Use this when you want the model to follow task instructions and produce a single target answer.

{
  "instruction": "Translate English to French",
  "input": "Hello",
  "output": "Bonjour"
}

Chat Format

Use this when the model should behave like an assistant in multi-turn conversations.

[
  {"role": "user", "content": "What is AI?"},
  {"role": "assistant", "content": "AI stands for Artificial Intelligence..."}
]

Plain Text Format

Use this for language modeling, continued pretraining, or next-token prediction on raw text. The model learns to continue text rather than answer explicit instruction fields.

Artificial intelligence is a branch of computer science focused on building systems that can perform tasks requiring human-like reasoning.
Format Best for Pros Cons
Instruction tuning Single-turn tasks, extraction, summarization, QA, task transfer Simple to build, easy to inspect, strong for SFT Can miss conversational context if your real product is multi-turn
Chat format Assistants, copilots, support bots, role-conditioned dialogue Matches production chat behavior, preserves turns and tone Formatting mistakes around roles can silently hurt training quality
Plain text Continued pretraining, domain adaptation, language modeling Easy to collect at scale, good for domain vocabulary Does not directly teach instruction following or answer structure

Dataset Preparation Tips

03
Simple heuristic: if your application is a chat assistant, train on chat-shaped data. If it is a single structured task, instruction format is usually simpler and more sample-efficient.

Fine-Tuning Approaches

Not all fine-tuning means updating every weight. The main approaches differ in cost, memory, speed, and how much task-specific capacity you gain.

Approach
Full fine-tuning
Train all model parameters. Highest flexibility and often the best quality ceiling, but also the most expensive in VRAM, optimizer state, and checkpoint size.
Approach
Feature extraction
Freeze the base model and train only a small head on top. Common for classification when the pretrained representation is already strong and you only need a task-specific output layer.
Approach
PEFT
Keep most weights frozen and train a small number of new parameters such as LoRA adapters. This is the practical default for many LLM fine-tuning workloads.
Method Memory Speed Performance Use Case
Full fine-tuning Highest Slowest setup and heaviest checkpoints Best ceiling when you have strong data and compute High-value tasks, model ownership, strong hardware budget
Feature extraction Lowest Fast Good for simpler supervised tasks, limited for generation Classification, regression, small labeled datasets
PEFT Low to medium Fast training with small trainable state Often close to full tuning on narrow tasks LLMs on limited GPUs, rapid iteration, multiple adapters
04
Default recommendation: if you are fine-tuning a modern LLM and do not already know you need full fine-tuning, start with LoRA or QLoRA. The cost-to-quality ratio is usually much better.

LoRA, QLoRA, and Other Efficient Methods

Parameter-efficient fine-tuning lets you adapt large models without paying the full memory cost of updating every weight. This is why consumer-GPU fine-tuning became practical.

LoRA - Low-Rank Adaptation Most common PEFT method

Concept: instead of rewriting a full weight matrix W, LoRA learns a low-rank update Delta W = BA and applies W' = W + Delta W. Only the small adapter matrices are trained.

Why it is efficient: trainable parameter count drops dramatically, optimizer state becomes much smaller, and you can store many task adapters on top of one base model.

When to use it: instruction tuning, domain adaptation, lightweight experiments, or multiple task-specific adapters.

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1
)
model = get_peft_model(model, config)

Short intuition: r controls adapter capacity, lora_alpha scales the update, and target_modules decides where adapters are inserted.

QLoRA Quantization + LoRA

Concept: load the base model in low precision, typically 4-bit, and train LoRA adapters on top. The frozen base stays quantized while adapter weights remain trainable in higher precision.

Why it is memory efficient: the frozen base weights occupy far less VRAM, which makes 7B to 13B models accessible on much smaller GPUs.

When to prefer it over LoRA: when VRAM is your main constraint, especially for larger models or laptops and single-GPU systems.

Method Approx VRAM for 7B training Quality ceiling Best use
Full fine-tuning 60 GB or more, depending on precision and optimizer Highest High-budget, high-stakes adaptation
LoRA Roughly 16-24 GB Very strong for many narrow tasks Single-GPU adaptation with minimal performance loss
QLoRA Roughly 10-16 GB Slightly lower than LoRA in some setups, often close enough Tight VRAM budgets and fast experimentation

Other Techniques

Adapters
Small trainable layers inserted between existing model layers. More parameters than prompt tuning, often easier to optimize than pure soft prompts.
Prefix tuning
Train virtual prefix vectors that steer attention without changing most model weights. Useful when you want a very small trainable footprint.
Prompt tuning
Train soft prompt embeddings only. Extremely lightweight, but can be weaker than LoRA for complex behavioral changes.
Technique Main tradeoff Good fit
LoRA Best balance of simplicity, quality, and memory Default PEFT choice
QLoRA Lowest memory, slightly more complexity around quantization support Limited VRAM environments
Adapters More parameters than prompt-based methods Reusable modular adapters
Prefix tuning Very compact but less expressive for some tasks Controlling style or task framing
Prompt tuning Smallest footprint, but often weakest adaptation power Resource-constrained experiments

Optimizers With Focus on Adam

An optimizer decides how parameter updates are applied. For transformer fine-tuning, Adam and especially AdamW are the common defaults because they adapt learning rates per parameter and handle noisy gradients well.

Optimizer
Adam
Adam combines momentum and RMS scaling. Momentum tracks the average direction of gradients, while RMS scaling shrinks updates for parameters that repeatedly receive large gradients.
Intuition
Why it works well
Instead of one global step size behaving the same everywhere, Adam adapts update size parameter-by-parameter. That makes it much easier to fine-tune deep transformer stacks than plain SGD.
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=2e-5)

Simple Explanation of Adam

m_t = beta1 * m_(t-1) + (1 - beta1) * g_t
v_t = beta2 * v_(t-1) + (1 - beta2) * g_t^2
update ~ m_t / (sqrt(v_t) + eps)
Choice Use it when Why
Adam You want a simple adaptive optimizer baseline Works well on many deep learning tasks, especially transformers
SGD You have a very large dataset, vision-style training, or well-tuned classical pipelines Can generalize well, but usually needs more careful tuning for LLM fine-tuning
AdamW You are fine-tuning transformers Decouples weight decay from gradient updates, which usually makes regularization cleaner and more stable
05
Adam vs AdamW: AdamW is usually the better default because weight decay is handled separately from the adaptive gradient update. In practice, most Hugging Face trainer runs use AdamW unless you override it.

Ranking and Training Objectives

Your training objective defines what “better” means during optimization. If you choose the wrong objective, the model can improve on paper while failing the real task.

Cross-entropy loss
The default objective for classification and next-token prediction. It rewards assigning high probability to the correct label or token.
Ranking loss
Optimizes order, not just absolute prediction. Useful when the model must put better results above worse results.
RLHF
Uses preference feedback to move the model toward outputs humans prefer. Often used after supervised fine-tuning for assistant behavior.

Cross-Entropy Loss

Use this for supervised fine-tuning, text classification, sequence labeling, and next-token prediction. It is usually the first objective to try because it aligns directly with “predict the correct answer.”

Ranking Loss

Use ranking objectives when the order of results matters more than a single label. Common examples include search, recommendation, reranking, and preference learning.

Type What it compares Typical use
Pairwise One positive item vs one negative item Search ranking, reward modeling, binary preference comparison
Listwise A full ranked list Recommendation systems, search pages, complex ranking pipelines

RLHF - Reinforcement Learning from Human Feedback

RLHF usually means: supervised fine-tune first, gather human preferences, train a reward model, then optimize the main model so it produces higher-reward responses. In practice, many teams now use simpler preference-optimization methods like DPO, but the high-level idea is the same: learn from what humans prefer, not only from fixed labels.

06
When ranking matters: search, recommendations, ads, product results, candidate reranking, and helpfulness or safety preference learning. If the business outcome depends on order, classification loss alone is often not enough.

Pretraining vs Fine-Tuning

Fine-tuning is usually enough, but not always. The harder question is whether your problem needs continued pretraining or even training from scratch before fine-tuning.

When pretraining is required Use rarely

Typical cases: very domain-specific data such as medical records, legal corpora, scientific literature, protein sequences, specialized codebases, or any setting where the base model repeatedly fails because the underlying vocabulary and distribution are too different.

When pretraining is not required Common case

Typical cases: general-purpose chat, classification, summarization, extraction, translation-like formatting, and many enterprise assistants where the base model already has the underlying capabilities.

Decision Checklist

07
Applied shortcut: use continued pretraining to teach the model the language of a domain, then use fine-tuning to teach the task behavior inside that domain.

Evaluation of Models

Evaluation is critical because training loss alone can lie. A model can memorize training data, look great on training metrics, and still fail on real users, new domains, or edge cases.

Why it matters
Avoid false confidence
Without evaluation, you cannot tell whether the model learned the task, memorized examples, or simply improved on easy cases while getting worse on hard ones.
What to check
Offline and online quality
Measure task metrics offline, inspect failures manually, and if the model ships to users, validate business outcomes in production.

Common Evaluation Methods

Metric Best for What to watch
Accuracy / F1 Classification, extraction, labeling F1 is better than raw accuracy when classes are imbalanced
Perplexity Language modeling, continued pretraining, causal LM tuning Lower is better, but only meaningful when comparing on the same data distribution
BLEU / ROUGE Translation, summarization Helpful for automation, but can miss semantic quality and factuality
Human evaluation Chat quality, safety, tone, helpfulness Slow but often necessary for assistant behavior
Benchmark datasets Standardized comparison Useful for baselines, but make sure they match your real task
from datasets import load_metric

metric = load_metric("accuracy")
result = metric.compute(predictions=preds, references=labels)

The newer evaluate package is now common, but the example above still shows the same core idea: compute task metrics from predictions and reference labels.

Offline vs Online Evaluation

Mode What it answers Example
Offline Did the model improve on held-out data? F1 on a validation split, perplexity on dev text, human review set
Online Did the model improve real product outcomes? CTR, resolution rate, conversion, user preference, latency, cost

Overfitting Detection

Best Practices and Common Pitfalls

Most practical wins come from disciplined data and monitoring, not from exotic algorithms. These are the habits that consistently improve fine-tuning quality.

Best practices
Data quality > data quantity. Clean labels, consistent formatting, and realistic examples matter more than raw size.

Monitor validation loss. It is your early warning system for overfitting and bad hyperparameters.

Use early stopping. Stop when validation no longer improves.

Tune one thing at a time. Change learning rate or epochs first before chasing harder explanations.
Common pitfalls
Fine-tuning on noisy or duplicated data, skipping a validation set, using too high a learning rate, training for too many epochs, and shipping a model without manual error analysis are the fastest ways to waste compute.

Practical Tuning Tips

08
Important: if your dataset contains sensitive or regulated information, do not assume fine-tuning is reversible. Model weights can absorb patterns from training data, so governance and privacy review should happen before training begins.

Hardware Considerations

Compute planning matters because the same method can feel easy or impossible depending on model size, precision, and optimization choices.

Setup Typical fit Notes
Single 12-16 GB GPU Small models, LoRA, QLoRA, short sequences Use 4-bit loading, gradient accumulation, and possibly gradient checkpointing
Single 24 GB GPU 7B-scale LoRA or QLoRA, moderate sequence lengths Good sweet spot for practical experimentation
48 GB+ GPU Larger LoRA runs, faster batch sizes, some full tuning experiments Much more room for sequence length and optimizer state
Multi-GPU Large models, distributed training, full fine-tuning Needed when model states and activations no longer fit on one device
Mixed precision
Use fp16 or bf16 when supported. This usually reduces memory and speeds up training.
Gradient checkpointing
Recomputes activations during backward pass to save VRAM. Training becomes slower, but much more memory efficient.
Sequence length
Longer context quickly increases memory use. If you run out of VRAM, shorten sequence length before anything else.
09
Memory vs performance tradeoff: QLoRA, gradient checkpointing, and smaller batches make training possible on limited hardware, but they usually slow throughput. Decide whether your bottleneck is VRAM, time, or final model quality.

Real-World Use Cases

Fine-tuning is most useful when you need repeatable behavior that generic prompting cannot guarantee.

Support
Customer support assistants
Train on high-quality support conversations to improve tone, resolution steps, and response structure while keeping retrieval for changing policies.
Search
Domain rerankers
Use pairwise or listwise ranking losses so relevant results appear higher for legal, medical, or enterprise search.
Analytics
Classification and extraction
Train smaller models for sentiment, routing, document labeling, or entity extraction with lower latency and lower cost than giant general-purpose models.
Domain adaptation
Specialized copilots
Adapt a base model to finance, law, healthcare, or internal company knowledge where vocabulary and expected output style differ from public internet text.

End-to-End Example

The example below shows a compact Hugging Face Trainer workflow for instruction-style fine-tuning on a small causal language model. It covers data loading, prompt formatting, tokenization, training, and evaluation.

10
Why this example is practical: it is small enough to understand in one sitting, but it uses the same pipeline shape you will use for larger instruction-tuning projects.
import math
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)

samples = [
    {
        "instruction": "Summarize the sentence",
        "input": "Fine-tuning adapts a pretrained model to a target task.",
        "output": "Fine-tuning specializes a pretrained model."
    },
    {
        "instruction": "Answer the question",
        "input": "What does GPU stand for?",
        "output": "GPU stands for Graphics Processing Unit."
    },
    {
        "instruction": "Rewrite in a formal tone",
        "input": "The launch failed because the config was wrong.",
        "output": "The launch failed due to an incorrect configuration."
    },
    {
        "instruction": "Extract the sentiment",
        "input": "The onboarding experience was smooth and very helpful.",
        "output": "positive"
    }
]

dataset = Dataset.from_list(samples).train_test_split(test_size=0.25, seed=42)
model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)
model.config.pad_token_id = tokenizer.pad_token_id

def format_example(example):
    instruction = example["instruction"].strip()
    input_text = example["input"].strip()
    output_text = example["output"].strip()
    return (
        f"### Instruction:\n{instruction}\n\n"
        f"### Input:\n{input_text}\n\n"
        f"### Response:\n{output_text}"
    )

def add_text(example):
    example["text"] = format_example(example)
    return example

dataset = dataset.map(add_text)

def tokenize(example):
    tokens = tokenizer(
        example["text"],
        truncation=True,
        max_length=256,
        padding="max_length"
    )
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized = dataset.map(tokenize)
tokenized = tokenized.remove_columns(["instruction", "input", "output", "text"])

training_args = TrainingArguments(
    output_dir="./ft-output",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    warmup_steps=20,
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    report_to="none"
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    data_collator=data_collator
)

trainer.train()

metrics = trainer.evaluate()
print(metrics)

if "eval_loss" in metrics:
    print("perplexity:", math.exp(metrics["eval_loss"]))

What each step is doing

How to convert this into a PEFT run

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["c_attn"]
)

model = get_peft_model(model, lora_config)

For larger instruction-tuning projects, many teams switch from raw Trainer to specialized trainers such as SFTTrainer, but the core workflow remains the same.

Reference Links

Core Docs

Applied Reading