Back to handbooks index

Hugging Face Fine-Tuning & LLM Handbook

A production-ready guide to modern Hugging Face fine-tuning with transformers, datasets, peft, trl, and bitsandbytes, with a strong focus on QLoRA, SFTTrainer, and deployment-ready adapter workflows.

QLoRA + SFTTrainer VRAM-Aware Training PEFT + Deployment April 2026
i
Practical position: for modern LLM fine-tuning, full-parameter training is rarely the right default. QLoRA and PEFT are preferred because they reduce VRAM requirements dramatically, move faster, and are usually sufficient for instruction tuning and domain adaptation on 7B to 13B class models.

Module 1: The Foundation (transformers & datasets)

The Hugging Face stack separates concerns cleanly: transformers handles model and tokenizer abstractions, while datasets handles data loading and efficient preprocessing. This matters because LLM training pipelines fail fastest when model formatting and dataset formatting drift apart.

Loading Models & Tokenizers

A causal language model and its tokenizer must stay aligned. For instruct models, setting a valid pad_token is important because the trainer and data collators often need a padding token during batching. If the tokenizer has no pad token, the common safe baseline is to reuse the EOS token.

from __future__ import annotations

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


model_id = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

# Many decoder-only models do not define a pad token by default.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Keep the model config aligned with tokenizer padding behavior.
model.config.pad_token_id = tokenizer.pad_token_id

print(tokenizer.pad_token, tokenizer.pad_token_id)
!
Hardware reality: a 7B model in float16 needs roughly 14 GB of VRAM just to hold weights for inference. Training requires much more because activations, gradients, optimizer states, and temporary buffers also consume memory.

Chat Templates

Chat templates define how structured conversational messages become the token sequence the model actually sees. This step is critical for instruct tuning because the model was trained on a specific dialogue serialization format. If you feed the right content with the wrong template, the model can behave like it has forgotten how to chat.

from __future__ import annotations

from transformers import AutoTokenizer


model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a concise support assistant."},
    {"role": "user", "content": "Summarize the password reset policy."},
]

formatted_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

tokenized = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
)

print(formatted_text)
print(tokenized.shape)

A good analogy is an API contract. The message roles are the logical schema, but the chat template is the wire format. If the wire format is wrong, the receiving system misinterprets the payload.

Efficient Data Handling

The datasets library is designed for batched, vectorized preprocessing. Using .map(..., batched=True) is important because tokenization is much faster and cleaner when you process batches instead of looping row-by-row in Python.

from __future__ import annotations

from datasets import load_dataset
from transformers import AutoTokenizer


model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

dataset = load_dataset("json", data_files="train.jsonl", split="train")


def format_and_tokenize(batch: dict[str, list[str]]) -> dict[str, list[list[int]]]:
    texts: list[str] = []

    for prompt, response in zip(batch["prompt"], batch["response"]):
        messages = [
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": response},
        ]
        text = tokenizer.apply_chat_template(messages, tokenize=False)
        texts.append(text)

    return tokenizer(
        texts,
        truncation=True,
        max_length=2048,
    )


tokenized_dataset = dataset.map(
    format_and_tokenize,
    batched=True,
    remove_columns=dataset.column_names,
)

print(tokenized_dataset[0].keys())

Module 2: Memory & Hardware Optimization (bitsandbytes)

The reason memory optimization matters is simple: base model size scales faster than most teams' GPU budgets. QLoRA and 4-bit loading exist because commodity 24 GB GPUs cannot comfortably full-fine-tune modern 7B+ models.

The VRAM Problem

A rough rule of thumb is that a 7B parameter model in float16 needs about 14 GB of VRAM for inference because each parameter occupies 2 bytes. Training costs more because gradients, optimizer states, and activations also live in memory. That is why naive full fine-tuning often exceeds 40 GB or more even before sequence length and batch size are increased.

7B Float16 Inference
About 14 GB for weights alone, before KV cache and runtime overhead.
QLoRA on 24 GB GPUs
A common practical fit for 7B instruct tuning with gradient accumulation and checkpointing.
13B and Beyond
Usually needs stronger memory discipline, lower batch sizes, and careful sequence-length management even with QLoRA.

Quantization (QLoRA)

QLoRA loads the base model in 4-bit precision while training low-rank adapters in higher precision. This is why it is preferred: you avoid updating billions of full-precision parameters and instead attach a tiny trainable layer set on top of a frozen quantized backbone.

from __future__ import annotations

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig


model_id = "mistralai/Mistral-7B-Instruct-v0.3"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    torch_dtype=torch.float16,
    device_map="auto",
)

model.config.pad_token_id = tokenizer.pad_token_id

Module 3: Parameter-Efficient Fine-Tuning (peft)

PEFT exists because full-parameter tuning is too expensive for many real workflows. LoRA is the standard approach because it freezes the original weights and learns only a small set of trainable low-rank matrices inserted into key linear layers.

How LoRA Works

LoRA approximates weight updates using a low-rank decomposition. Instead of rewriting a full large matrix, it learns a compact delta. A practical analogy is editing a giant spreadsheet by storing only the changed formulas, not duplicating the whole file.

Why LoRA Is Preferred Cost, speed, portability

LoRA adapters are small, easy to version, and fast to train. For most domain adaptation and instruct-tuning tasks, they offer a much better cost-to-quality ratio than full fine-tuning.

Setting up PEFT

The important LoRA parameters are r for rank, lora_alpha for scaling, and target_modules for which projections to adapt. For decoder-only transformers, q_proj and v_proj are common high-value targets.

from __future__ import annotations

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training


model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"],
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

Why these values? A rank like 16 is a common starting point because it gives enough adaptation capacity without making the adapter unnecessarily large. A learning rate around $2e^{-4}$ is often used for LoRA because only the small adapter layers are being optimized, not the full model.

Module 4: The Training Loop (trl & SFTTrainer)

The modern Hugging Face stack for supervised LLM fine-tuning usually centers on SFTTrainer. The base Trainer is generic. SFTTrainer is purpose-built for supervised fine-tuning of causal LMs and handles common formatting expectations more directly.

Why not vanilla Trainer?

SFTTrainer is preferred because it fits the actual task shape of LLM supervised fine-tuning. It integrates more naturally with conversational data formatting, PEFT workflows, and common packing/tokenization patterns used in instruct tuning.

The Training Script

This is the standard QLoRA + SFTTrainer pattern most teams should start from for 7B-class models on constrained hardware.

from __future__ import annotations

import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer


model_id = "mistralai/Mistral-7B-Instruct-v0.3"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    torch_dtype=torch.float16,
    device_map="auto",
)
model.config.pad_token_id = tokenizer.pad_token_id
model.gradient_checkpointing_enable()  # Saves memory by recomputing activations during backward pass.

dataset = load_dataset("json", data_files="train.jsonl", split="train")


def format_example(example: dict[str, str]) -> dict[str, str]:
    messages = [
        {"role": "user", "content": example["prompt"]},
        {"role": "assistant", "content": example["response"]},
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}


dataset = dataset.map(format_example)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"],
)

training_args = TrainingArguments(
    output_dir="./outputs",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,  # Common LoRA starting point because only adapter weights are trained.
    num_train_epochs=3,
    logging_steps=10,
    save_strategy="epoch",
    bf16=False,
    fp16=True,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=2048,
    packing=False,
)

trainer.train()

Module 5: Advanced Alignment (DPO)

Supervised fine-tuning teaches the model how to respond in your desired format or domain. Preference tuning teaches the model which responses are better. That distinction matters because a model can be fluent and still be misaligned with quality preferences.

Beyond SFT

SFT is usually the first stage. DPO and similar alignment methods come later when the goal is to push the model toward preferred behavior using paired chosen and rejected completions rather than only imitating demonstrations.

Direct Preference Optimization (DPO)

DPO works on datasets containing prompts plus two candidate responses: one preferred and one rejected. Instead of training the model only to imitate a gold answer, it trains the model to increase preference for better outputs relative to worse ones.

from __future__ import annotations

from datasets import load_dataset
from transformers import TrainingArguments
from trl import DPOTrainer


preference_dataset = load_dataset("json", data_files="preferences.jsonl", split="train")

# Expected columns typically include: prompt, chosen, rejected
training_args = TrainingArguments(
    output_dir="./dpo-outputs",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=5e-5,
    num_train_epochs=1,
    logging_steps=10,
    fp16=True,
    report_to="none",
)

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,
    args=training_args,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
    beta=0.1,
)

# dpo_trainer.train()

Module 6: Inference, Merging, & Deployment

Once training finishes, the next decisions are operational: do you serve the base model plus adapter separately, or merge the adapter weights into the base model for a simpler production artifact?

Inference with Adapters

Serving adapters directly is useful when you want one base model with many specialized LoRA variants. This is convenient for experimentation and multi-tenant setups.

from __future__ import annotations

import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer


adapter_path = "./outputs/checkpoint-final"

tokenizer = AutoTokenizer.from_pretrained(adapter_path)
model = AutoPeftModelForCausalLM.from_pretrained(
    adapter_path,
    torch_dtype=torch.float16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain LoRA in two short paragraphs."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Merging Weights

Deploying adapters separately can be slower and operationally more complex because the base model and adapter must both be loaded and composed. Merging creates a single artifact that is usually simpler for production inference.

from __future__ import annotations

from peft import AutoPeftModelForCausalLM


adapter_path = "./outputs/checkpoint-final"

model = AutoPeftModelForCausalLM.from_pretrained(adapter_path, device_map="cpu")
merged_model = model.merge_and_unload()

merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

The Hugging Face Hub

The Hub is the natural distribution target for both adapters and merged models. Once pushed, the model becomes easier to version, document, and share with teammates or deployment pipelines.

huggingface-cli login

python -c "from transformers import AutoTokenizer, AutoModelForCausalLM; \
tokenizer = AutoTokenizer.from_pretrained('./merged-model'); \
model = AutoModelForCausalLM.from_pretrained('./merged-model'); \
tokenizer.push_to_hub('your-org/your-merged-model'); \
model.push_to_hub('your-org/your-merged-model')"

Module 7: Common Pitfalls & Anti-Patterns

Fine-tuning failures usually come from data scope, memory planning, or tokenizer/model mismatches rather than the trainer itself. These are the problems to watch first.

1. Catastrophic Forgetting

If the dataset is too narrow or repetitive, the model can over-specialize and degrade on broader capabilities such as coding, general reasoning, or instruction following. This is why LoRA and moderate training schedules are safer defaults than aggressive full fine-tuning on small bespoke corpora.

!
Mitigation: mix in representative task distributions, keep training durations modest, and evaluate on both domain-specific and general benchmark prompts before shipping.

2. OOM (Out of Memory) Errors

OOM errors are often solved by reducing instantaneous memory pressure rather than reducing total training signal. Gradient accumulation simulates a larger effective batch size. Gradient checkpointing trades extra compute for lower activation memory.

from transformers import TrainingArguments


training_args = TrainingArguments(
    output_dir="./outputs",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=32,
    learning_rate=2e-4,
    fp16=True,
    optim="paged_adamw_8bit",
)

model.gradient_checkpointing_enable()

On small GPUs like 16 GB or 24 GB cards, these two settings are often the first levers to pull before reducing sequence length too aggressively.

3. Mismatched Tokenizers

If you add special tokens to the tokenizer but forget to resize the model embeddings, the model and tokenizer fall out of sync. That can cause unstable or garbage outputs because token ids point to embeddings the model never allocated.

from __future__ import annotations

from transformers import AutoModelForCausalLM, AutoTokenizer


model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu")

num_added_tokens = tokenizer.add_special_tokens(
    {"additional_special_tokens": ["<|domain_policy|>"]}
)

if num_added_tokens > 0:
    # Resize the model embeddings so new tokenizer ids have learned slots.
    model.resize_token_embeddings(len(tokenizer))
Checklist: tokenizer changes, pad token behavior, chat templates, and embedding resize must all stay aligned before you start training.

Reference Links