Hugging Face Fine-Tuning & LLM Handbook
A production-ready guide to modern Hugging Face fine-tuning with transformers, datasets, peft, trl, and bitsandbytes, with a strong focus on QLoRA, SFTTrainer, and deployment-ready adapter workflows.
Module 1: The Foundation (transformers & datasets)
The Hugging Face stack separates concerns cleanly: transformers handles model and tokenizer abstractions, while datasets handles data loading and efficient preprocessing. This matters because LLM training pipelines fail fastest when model formatting and dataset formatting drift apart.
Loading Models & Tokenizers
A causal language model and its tokenizer must stay aligned. For instruct models, setting a valid pad_token is important because the trainer and data collators often need a padding token during batching. If the tokenizer has no pad token, the common safe baseline is to reuse the EOS token.
from __future__ import annotations
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
# Many decoder-only models do not define a pad token by default.
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
# Keep the model config aligned with tokenizer padding behavior.
model.config.pad_token_id = tokenizer.pad_token_id
print(tokenizer.pad_token, tokenizer.pad_token_id)
Chat Templates
Chat templates define how structured conversational messages become the token sequence the model actually sees. This step is critical for instruct tuning because the model was trained on a specific dialogue serialization format. If you feed the right content with the wrong template, the model can behave like it has forgotten how to chat.
from __future__ import annotations
from transformers import AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a concise support assistant."},
{"role": "user", "content": "Summarize the password reset policy."},
]
formatted_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
tokenized = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
)
print(formatted_text)
print(tokenized.shape)
A good analogy is an API contract. The message roles are the logical schema, but the chat template is the wire format. If the wire format is wrong, the receiving system misinterprets the payload.
Efficient Data Handling
The datasets library is designed for batched, vectorized preprocessing. Using .map(..., batched=True) is important because tokenization is much faster and cleaner when you process batches instead of looping row-by-row in Python.
from __future__ import annotations
from datasets import load_dataset
from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
dataset = load_dataset("json", data_files="train.jsonl", split="train")
def format_and_tokenize(batch: dict[str, list[str]]) -> dict[str, list[list[int]]]:
texts: list[str] = []
for prompt, response in zip(batch["prompt"], batch["response"]):
messages = [
{"role": "user", "content": prompt},
{"role": "assistant", "content": response},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
texts.append(text)
return tokenizer(
texts,
truncation=True,
max_length=2048,
)
tokenized_dataset = dataset.map(
format_and_tokenize,
batched=True,
remove_columns=dataset.column_names,
)
print(tokenized_dataset[0].keys())
Module 2: Memory & Hardware Optimization (bitsandbytes)
The reason memory optimization matters is simple: base model size scales faster than most teams' GPU budgets. QLoRA and 4-bit loading exist because commodity 24 GB GPUs cannot comfortably full-fine-tune modern 7B+ models.
The VRAM Problem
A rough rule of thumb is that a 7B parameter model in float16 needs about 14 GB of VRAM for inference because each parameter occupies 2 bytes. Training costs more because gradients, optimizer states, and activations also live in memory. That is why naive full fine-tuning often exceeds 40 GB or more even before sequence length and batch size are increased.
Quantization (QLoRA)
QLoRA loads the base model in 4-bit precision while training low-rank adapters in higher precision. This is why it is preferred: you avoid updating billions of full-precision parameters and instead attach a tiny trainable layer set on top of a frozen quantized backbone.
from __future__ import annotations
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
torch_dtype=torch.float16,
device_map="auto",
)
model.config.pad_token_id = tokenizer.pad_token_id
Module 3: Parameter-Efficient Fine-Tuning (peft)
PEFT exists because full-parameter tuning is too expensive for many real workflows. LoRA is the standard approach because it freezes the original weights and learns only a small set of trainable low-rank matrices inserted into key linear layers.
How LoRA Works
LoRA approximates weight updates using a low-rank decomposition. Instead of rewriting a full large matrix, it learns a compact delta. A practical analogy is editing a giant spreadsheet by storing only the changed formulas, not duplicating the whole file.
LoRA adapters are small, easy to version, and fast to train. For most domain adaptation and instruct-tuning tasks, they offer a much better cost-to-quality ratio than full fine-tuning.
Setting up PEFT
The important LoRA parameters are r for rank, lora_alpha for scaling, and target_modules for which projections to adapt. For decoder-only transformers, q_proj and v_proj are common high-value targets.
from __future__ import annotations
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "v_proj"],
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
Why these values? A rank like 16 is a common starting point because it gives enough adaptation capacity without making the adapter unnecessarily large. A learning rate around $2e^{-4}$ is often used for LoRA because only the small adapter layers are being optimized, not the full model.
Module 4: The Training Loop (trl & SFTTrainer)
The modern Hugging Face stack for supervised LLM fine-tuning usually centers on SFTTrainer. The base Trainer is generic. SFTTrainer is purpose-built for supervised fine-tuning of causal LMs and handles common formatting expectations more directly.
Why not vanilla Trainer?
SFTTrainer is preferred because it fits the actual task shape of LLM supervised fine-tuning. It integrates more naturally with conversational data formatting, PEFT workflows, and common packing/tokenization patterns used in instruct tuning.
The Training Script
This is the standard QLoRA + SFTTrainer pattern most teams should start from for 7B-class models on constrained hardware.
from __future__ import annotations
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
torch_dtype=torch.float16,
device_map="auto",
)
model.config.pad_token_id = tokenizer.pad_token_id
model.gradient_checkpointing_enable() # Saves memory by recomputing activations during backward pass.
dataset = load_dataset("json", data_files="train.jsonl", split="train")
def format_example(example: dict[str, str]) -> dict[str, str]:
messages = [
{"role": "user", "content": example["prompt"]},
{"role": "assistant", "content": example["response"]},
]
return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
dataset = dataset.map(format_example)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "v_proj"],
)
training_args = TrainingArguments(
output_dir="./outputs",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4, # Common LoRA starting point because only adapter weights are trained.
num_train_epochs=3,
logging_steps=10,
save_strategy="epoch",
bf16=False,
fp16=True,
optim="paged_adamw_8bit",
lr_scheduler_type="cosine",
warmup_ratio=0.03,
report_to="none",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
tokenizer=tokenizer,
max_seq_length=2048,
packing=False,
)
trainer.train()
Module 5: Advanced Alignment (DPO)
Supervised fine-tuning teaches the model how to respond in your desired format or domain. Preference tuning teaches the model which responses are better. That distinction matters because a model can be fluent and still be misaligned with quality preferences.
Beyond SFT
SFT is usually the first stage. DPO and similar alignment methods come later when the goal is to push the model toward preferred behavior using paired chosen and rejected completions rather than only imitating demonstrations.
Direct Preference Optimization (DPO)
DPO works on datasets containing prompts plus two candidate responses: one preferred and one rejected. Instead of training the model only to imitate a gold answer, it trains the model to increase preference for better outputs relative to worse ones.
from __future__ import annotations
from datasets import load_dataset
from transformers import TrainingArguments
from trl import DPOTrainer
preference_dataset = load_dataset("json", data_files="preferences.jsonl", split="train")
# Expected columns typically include: prompt, chosen, rejected
training_args = TrainingArguments(
output_dir="./dpo-outputs",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=5e-5,
num_train_epochs=1,
logging_steps=10,
fp16=True,
report_to="none",
)
dpo_trainer = DPOTrainer(
model=model,
ref_model=None,
args=training_args,
train_dataset=preference_dataset,
tokenizer=tokenizer,
beta=0.1,
)
# dpo_trainer.train()
Module 6: Inference, Merging, & Deployment
Once training finishes, the next decisions are operational: do you serve the base model plus adapter separately, or merge the adapter weights into the base model for a simpler production artifact?
Inference with Adapters
Serving adapters directly is useful when you want one base model with many specialized LoRA variants. This is convenient for experimentation and multi-tenant setups.
from __future__ import annotations
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
adapter_path = "./outputs/checkpoint-final"
tokenizer = AutoTokenizer.from_pretrained(adapter_path)
model = AutoPeftModelForCausalLM.from_pretrained(
adapter_path,
torch_dtype=torch.float16,
device_map="auto",
)
messages = [{"role": "user", "content": "Explain LoRA in two short paragraphs."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Merging Weights
Deploying adapters separately can be slower and operationally more complex because the base model and adapter must both be loaded and composed. Merging creates a single artifact that is usually simpler for production inference.
from __future__ import annotations
from peft import AutoPeftModelForCausalLM
adapter_path = "./outputs/checkpoint-final"
model = AutoPeftModelForCausalLM.from_pretrained(adapter_path, device_map="cpu")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
The Hugging Face Hub
The Hub is the natural distribution target for both adapters and merged models. Once pushed, the model becomes easier to version, document, and share with teammates or deployment pipelines.
huggingface-cli login
python -c "from transformers import AutoTokenizer, AutoModelForCausalLM; \
tokenizer = AutoTokenizer.from_pretrained('./merged-model'); \
model = AutoModelForCausalLM.from_pretrained('./merged-model'); \
tokenizer.push_to_hub('your-org/your-merged-model'); \
model.push_to_hub('your-org/your-merged-model')"
Module 7: Common Pitfalls & Anti-Patterns
Fine-tuning failures usually come from data scope, memory planning, or tokenizer/model mismatches rather than the trainer itself. These are the problems to watch first.
1. Catastrophic Forgetting
If the dataset is too narrow or repetitive, the model can over-specialize and degrade on broader capabilities such as coding, general reasoning, or instruction following. This is why LoRA and moderate training schedules are safer defaults than aggressive full fine-tuning on small bespoke corpora.
2. OOM (Out of Memory) Errors
OOM errors are often solved by reducing instantaneous memory pressure rather than reducing total training signal. Gradient accumulation simulates a larger effective batch size. Gradient checkpointing trades extra compute for lower activation memory.
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./outputs",
per_device_train_batch_size=1,
gradient_accumulation_steps=32,
learning_rate=2e-4,
fp16=True,
optim="paged_adamw_8bit",
)
model.gradient_checkpointing_enable()
On small GPUs like 16 GB or 24 GB cards, these two settings are often the first levers to pull before reducing sequence length too aggressively.
3. Mismatched Tokenizers
If you add special tokens to the tokenizer but forget to resize the model embeddings, the model and tokenizer fall out of sync. That can cause unstable or garbage outputs because token ids point to embeddings the model never allocated.
from __future__ import annotations
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu")
num_added_tokens = tokenizer.add_special_tokens(
{"additional_special_tokens": ["<|domain_policy|>"]}
)
if num_added_tokens > 0:
# Resize the model embeddings so new tokenizer ids have learned slots.
model.resize_token_embeddings(len(tokenizer))
Reference Links
- Transformers Transformers documentation
- Datasets Datasets documentation
- PEFT PEFT documentation
- TRL TRL documentation
- bitsandbytes bitsandbytes quantization guide
- Hub Hugging Face Hub documentation