Unsloth Fine-Tuning
Handbook
A complete reference for fine-tuning, quantizing, and running LLMs locally on consumer GPUs (12–16 GB VRAM). Covers Unsloth, Unsloth Studio, LoRA/QLoRA, GGUF export, and serving with Ollama — optimized for RTX 5070, 5060 Ti, and 4060 Ti.
What is Unsloth?
Unsloth is an open-source library built specifically to make LLM fine-tuning radically faster and more memory-efficient on consumer hardware — without sacrificing accuracy. Created by Daniel and Michael Han, it achieves 2–5× speedups over baseline HuggingFace + PEFT training while using up to 80% less VRAM.
Under the hood, Unsloth rewrites the critical CUDA kernels (attention, cross-entropy loss, RoPE embeddings, weight updates) in Triton and C++, patching them directly into HuggingFace Transformers — so you keep the familiar Trainer API while getting dramatically better hardware utilization.
FastLanguageModel and existing code runs faster automatically.Why Unsloth vs Alternatives?
| Feature | Unsloth | HF + PEFT | Axolotl | LLaMA Factory |
|---|---|---|---|---|
| VRAM Efficiency | Best | Baseline | Good | Good |
| Speed | 2–5× | 1× | ~1.2× | ~1.3× |
| Consumer GPU Support | ✅ Excellent | Limited | Good | Good |
| GGUF Export | ✅ Native | Manual | Manual | ✅ |
| Unsloth Studio (no-code) | ✅ | ❌ | ❌ | ❌ |
| DPO / ORPO / GRPO | ✅ | ✅ | ✅ | ✅ |
| Multi-GPU | ⚠️ Limited | ✅ | ✅ | ✅ |
Unsloth is the obvious pick for single-GPU consumer setups (RTX 40/50 series, 12–24 GB VRAM). For multi-GPU cloud training, HF + PEFT or Axolotl remain competitive, but Unsloth's VRAM savings still help.
GGUF Explained
GGUF (GPT-Generated Unified Format) is a binary file format designed by Georgi Gerganov (creator of llama.cpp) as a replacement for the older GGML format. It stores model weights plus all metadata needed to run inference in a single self-contained file — no Python, no HuggingFace Hub required.
Why GGUF matters
GGUF models run via llama.cpp, which is a pure C++ inference engine with CUDA/Metal acceleration. This means:
- Single binary, no Python dependency
- CPU + GPU hybrid inference (great for VRAM-limited setups)
- Quantized models (Q4_K_M, Q5_K_M, etc.) that are tiny but high quality
- Works with Ollama, LM Studio, llama-server, Jan, AnythingLLM
GGUF vs SafeTensors
| Property | SafeTensors | GGUF |
|---|---|---|
| Python req? | ✅ Yes | ❌ No |
| Quantization | Via PEFT | Native |
| Metadata | Separate | Embedded |
| CPU inference | Slow | Fast |
| Ollama compat | ❌ | ✅ |
The GGUF Quantization Ladder
Quantization compresses model weights to lower bit-widths. GGUF supports many formats — here's what matters for 12–16 GB GPUs:
| Format | Bits | 7B Size | 13B Size | Quality | Recommended? |
|---|---|---|---|---|---|
Q2_K | 2-bit | ~2.7 GB | ~5.0 GB | ⭐⭐ | Last resort |
Q4_0 | 4-bit | ~3.8 GB | ~7.3 GB | ⭐⭐⭐ | Fast, okay quality |
Q4_K_M | 4-bit | ~4.1 GB | ~7.9 GB | ⭐⭐⭐⭐ | Best Value |
Q5_K_M | 5-bit | ~4.8 GB | ~9.1 GB | ⭐⭐⭐⭐½ | Quality sweet spot |
Q6_K | 6-bit | ~5.6 GB | ~10.7 GB | ⭐⭐⭐⭐⭐ | Near-lossless |
Q8_0 | 8-bit | ~7.2 GB | ~13.8 GB | ⭐⭐⭐⭐⭐ | Near fp16 quality |
F16 | 16-bit | ~14.0 GB | ~26 GB | Perfect | 12GB: 7B only |
Use Q4_K_M for 13B models (fits with room for KV cache) or Q5_K_M for 7B models for near-lossless quality. For the Arthavidya 4B student model, even Q6_K or Q8_0 fits comfortably.
Quantization Formats
Unsloth supports multiple quantization strategies during training — these are distinct from GGUF formats, which apply at inference time.
LoRA & QLoRA
LoRA (Low-Rank Adaptation) freezes the original model weights and injects small trainable matrices into specific layers. Instead of updating billions of parameters, you train only millions — typically 0.1–3% of total parameters.
QLoRA = LoRA applied to a 4-bit quantized base model. The frozen weights live in NF4; the adapters are bf16/fp16. Unsloth's innovation is making this combination dramatically faster than the original bitsandbytes QLoRA implementation.
Key LoRA Hyperparameters
| Parameter | What it does | 12GB Recommendation |
|---|---|---|
r (rank) | Size of LoRA adapter matrices. Higher = more capacity. | 16–64 |
lora_alpha | Scaling factor. Often set to 2× rank. | 32–128 |
lora_dropout | Regularization dropout on adapters. | 0.05–0.1 |
target_modules | Which weight matrices to adapt. | All linear layers |
use_dora | DoRA variant — better quality, slightly more VRAM. | True if fits |
use_rslora | Rank-stabilized LoRA — better than vanilla for high rank. | True |
For best results, target all attention + MLP projections: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. Unsloth defaults to this automatically when you set target_modules="all-linear".
GPU Selection Guide
All three GPUs below are excellent for Unsloth fine-tuning of 7–13B models in QLoRA mode. Here's what you need to know:
VRAM Budget by Model Size
| Model Params | Training (NF4 QLoRA) | Inference (GGUF Q4_K_M) | 12 GB Fits? | 16 GB Fits? |
|---|---|---|---|---|
| 1–3B | ~3–5 GB | ~1.5–2 GB | ✅ Comfortable | ✅ |
| 4–7B | ~6–9 GB | ~3–5 GB | ✅ | ✅ With room |
| 8–13B | ~10–12 GB | ~6–9 GB | ⚠️ Tight (seq≤2048) | ✅ |
| 14–30B | ~16–28 GB | ~10–20 GB | ❌ | ⚠️ 14B only |
| 70B+ | >40 GB | >35 GB | ❌ | ❌ |
Keep max_seq_length ≤ 2048, use gradient_checkpointing=True, reduce per_device_train_batch_size=1, and enable optim="adamw_8bit". With Unsloth these settings allow 13B models to train on 12 GB with reasonable throughput.
Environment Setup
# Step 1: Create and activate environment conda create -n unsloth_env python=3.11 -y conda activate unsloth_env # Step 2: Install PyTorch 2.3+ with CUDA 12.1 pip install torch==2.3.1 torchvision torchaudio --index-url \ https://download.pytorch.org/whl/cu121 # Step 3: Install Unsloth (stable, CUDA 12.1, PyTorch 2.3) pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git" # Step 4: Install supporting libraries pip install transformers datasets trl peft accelerate bitsandbytes pip install wandb sentencepiece protobuf
# RTX 5060 Ti / 5070 need PyTorch nightly + CUDA 12.8 conda create -n unsloth_env python=3.11 -y conda activate unsloth_env # Install PyTorch nightly (sm_120 Blackwell support) pip install --pre torch torchvision torchaudio --index-url \ https://download.pytorch.org/whl/nightly/cu128 # Install Unsloth from source (latest main) pip install "unsloth @ git+https://github.com/unslothai/unsloth.git" # Verify GPU is detected correctly python -c "import torch; print(torch.cuda.get_device_name(0))"
import torch import unsloth print(f"PyTorch: {torch.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") print(f"GPU: {torch.cuda.get_device_name(0)}") print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB") print(f"Unsloth: {unsloth.__version__}")
WSL2 Tips
Training on WSL2 (Windows Subsystem for Linux) works well with Unsloth. A few key configuration points to maximize performance:
[wsl2] # Give WSL2 enough RAM (leave 4GB for Windows) memory=28GB # CPUs for data loading workers processors=12 # Important: disable page reporting for GPU training stability pageReporting=false # Use default kernel for CUDA compatibility kernel=# leave blank for default [experimental] # Enable GPU memory management improvements bestEffortDnsCaching=true
Store your training datasets on the Linux filesystem (/home/username/) not on the Windows mount (/mnt/c/). I/O to Windows mounts is dramatically slower and will bottleneck your DataLoader, leaving your GPU starved.
Blackwell GPUs on WSL2 require the latest NVIDIA preview driver (560.76+) and the CUDA 12.8 WSL2 package. Ensure nvidia-smi works inside WSL2 before installing PyTorch. Check with nvidia-smi -L from your Linux shell.
Dataset Preparation
Unsloth's SFTTrainer expects data in a specific chat format. The cleanest approach is to convert everything to ShareGPT or ChatML conversation format and use Unsloth's built-in formatting utilities.
Recommended Format: ShareGPT
{
"conversations": [
{"role": "system", "content": "You are an expert in Indian taxation..."},
{"role": "user", "content": "What is Section 80C deduction limit?"},
{"role": "assistant","content": "Under Section 80C of the Income Tax Act..."}
]
}
Loading and Formatting
from datasets import load_dataset from unsloth.chat_templates import get_chat_template # Apply chat template to tokenizer tokenizer = get_chat_template( tokenizer, chat_template="qwen-2.5", # or "llama-3", "mistral", etc. mapping={"role": "role", "content": "content"}, ) def formatting_func(examples): convos = examples["conversations"] texts = [ tokenizer.apply_chat_template( convo, tokenize=False, add_generation_prompt=False ) for convo in convos ] return {"text": texts} dataset = load_dataset("json", data_files="data/train.jsonl", split="train") dataset = dataset.map(formatting_func, batched=True) print(f"Dataset size: {len(dataset):,} examples") print(dataset[0]["text"][:500])
For domain adaptation (like Arthavidya's finance focus), 2,000–5,000 high-quality curated examples outperforms 50,000 noisy ones. Use a reserved gold eval set (~500–1,000 examples) that never touches training to measure true domain performance.
Config Reference
FastLanguageModel — Model Loading
from unsloth import FastLanguageModel import torch model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit", # Pre-quantized max_seq_length=4096, # Reduce to 2048 for 12GB tight dtype=None, # Auto-detect (bf16 on RTX 40/50) load_in_4bit=True, # QLoRA mode — essential for 12GB token=None, # HF token if using gated models device_map="sequential", # Unsloth-preferred over "auto" ) # Apply LoRA adapters model = FastLanguageModel.get_peft_model( model, r=32, # Rank — 16 for memory-tight, 64 for best lora_alpha=64, # Usually 2× rank lora_dropout=0.05, # Small dropout for regularisation target_modules="all-linear", # All q/k/v/o + gate/up/down use_gradient_checkpointing="unsloth", # Unsloth's optimised version random_state=42, use_rslora=True, # Rank-stabilised LoRA use_dora=False, # Set True if you have spare VRAM loftq_config=None, )
SFTTrainer Arguments — 12 GB Optimized
from trl import SFTTrainer from transformers import TrainingArguments trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", max_seq_length=2048, dataset_num_proc=4, packing=True, # Pack short sequences — boosts GPU util args=TrainingArguments( # Batch + gradient accumulation per_device_train_batch_size=2, gradient_accumulation_steps=8, # Effective batch = 16 # Learning rate schedule warmup_steps=50, num_train_epochs=3, learning_rate=2e-4, lr_scheduler_type="cosine", # Precision fp16=not torch.cuda.is_bf16_supported(), bf16=torch.cuda.is_bf16_supported(), # bf16 on RTX 40/50 # Memory optimisations optim="adamw_8bit", # 8-bit Adam — saves ~1GB VRAM gradient_checkpointing=True, # Logging logging_steps=10, output_dir="./outputs", save_strategy="epoch", report_to="wandb", # or "tensorboard" / "none" run_name="arthavidya-qwen25-7b-sft", # Stability weight_decay=0.01, max_grad_norm=1.0, seed=42, ), ) trainer_stats = trainer.train()
Key Hyperparameters Cheatsheet
| Param | Memory-Tight (12GB) | Comfortable (16GB) | Effect |
|---|---|---|---|
per_device_train_batch_size | 1–2 | 2–4 | VRAM directly |
gradient_accumulation_steps | 8–16 | 4–8 | Effective batch size |
max_seq_length | 1024–2048 | 2048–4096 | VRAM quadratically |
LoRA r (rank) | 16–32 | 32–64 | Adapter capacity |
learning_rate | 1e-4 to 3e-4 | 1e-4 to 2e-4 | Convergence speed |
packing | True | True | GPU utilization |
Full Training Loop
A complete, runnable training script combining all the above components. Copy-paste ready for a 7B QLoRA fine-tune on 12 GB VRAM.
#!/usr/bin/env python3 """ Unsloth SFT Training Script Optimised for 12–16 GB VRAM (RTX 5070, 5060 Ti, 4060 Ti) """ import torch from unsloth import FastLanguageModel from unsloth.chat_templates import get_chat_template from datasets import load_dataset from trl import SFTTrainer from transformers import TrainingArguments import wandb # ─── CONFIG ────────────────────────────── MODEL_NAME = "unsloth/Qwen2.5-7B-Instruct-bnb-4bit" MAX_SEQ_LEN = 2048 OUTPUT_DIR = "./outputs/arthavidya-v1" DATASET_PATH = "data/train.jsonl" WANDB_PROJECT = "arthavidya" # ─── LOAD MODEL ────────────────────────── model, tokenizer = FastLanguageModel.from_pretrained( model_name=MODEL_NAME, max_seq_length=MAX_SEQ_LEN, dtype=None, load_in_4bit=True, ) tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5") # ─── APPLY LORA ────────────────────────── model = FastLanguageModel.get_peft_model( model, r=32, lora_alpha=64, lora_dropout=0.05, target_modules="all-linear", use_gradient_checkpointing="unsloth", use_rslora=True, random_state=42, ) print(model.print_trainable_parameters()) # ─── DATASET ───────────────────────────── dataset = load_dataset("json", data_files=DATASET_PATH, split="train") def format_examples(examples): return {"text": [ tokenizer.apply_chat_template(c, tokenize=False, add_generation_prompt=False) for c in examples["conversations"] ]} dataset = dataset.map(format_examples, batched=True) # ─── TRAINER ───────────────────────────── wandb.init(project=WANDB_PROJECT, name="qwen25-7b-sft-r32") trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", max_seq_length=MAX_SEQ_LEN, dataset_num_proc=4, packing=True, args=TrainingArguments( per_device_train_batch_size=2, gradient_accumulation_steps=8, warmup_steps=50, num_train_epochs=3, learning_rate=2e-4, lr_scheduler_type="cosine", fp16=not torch.cuda.is_bf16_supported(), bf16=torch.cuda.is_bf16_supported(), optim="adamw_8bit", weight_decay=0.01, max_grad_norm=1.0, logging_steps=10, save_strategy="epoch", output_dir=OUTPUT_DIR, report_to="wandb", seed=42, ), ) # ─── TRAIN ─────────────────────────────── stats = trainer.train() print(f"Training complete. Loss: {stats.training_loss:.4f}") # ─── SAVE LORA ADAPTERS ────────────────── model.save_pretrained(f"{OUTPUT_DIR}/lora-adapter") tokenizer.save_pretrained(f"{OUTPUT_DIR}/lora-adapter") print("LoRA adapter saved.")
Curriculum Training
Curriculum training exposes the model to data in a deliberate difficulty progression — starting with easy, well-structured examples and advancing to complex, nuanced ones. This mirrors how humans learn: foundational concepts first, edge cases later.
from datasets import load_dataset, concatenate_datasets TIERS = { "easy": ("data/tier1_easy.jsonl", 1, 2e-4), "medium": ("data/tier2_medium.jsonl", 1, 1.5e-4), "difficult": ("data/tier3_difficult.jsonl",2, 1e-4), "tricky": ("data/tier4_tricky.jsonl", 2, 5e-5), } for tier_name, (data_path, epochs, lr) in TIERS.items(): print(f"\n{'='*50}") print(f"Training Tier: {tier_name.upper()}") print(f"LR: {lr}, Epochs: {epochs}") dataset = load_dataset("json", data_files=data_path, split="train") dataset = dataset.map(format_examples, batched=True) trainer.args.learning_rate = lr trainer.args.num_train_epochs = epochs trainer.train_dataset = dataset trainer.train() # Save checkpoint after each tier model.save_pretrained(f"checkpoints/after_{tier_name}") print(f"Tier {tier_name} complete.")
Knowledge Distillation
Knowledge Distillation (KD) trains a smaller student model to mimic a larger, more capable teacher model. Instead of only training on hard labels (correct answers), the student learns from the teacher's soft output distribution — capturing nuance and uncertainty.
Standard SFT (Hard Labels)
Loss = Cross-entropy(student_logits, ground_truth_token)
Student only sees "right answer" — no information about near-misses.
KD with Soft Labels
Loss = α × CE(student, truth) + (1-α) × KL(student_softmax, teacher_softmax / T)
Student learns the teacher's confidence distribution — richer signal.
import torch import torch.nn.functional as F from unsloth import FastLanguageModel # Load teacher (e.g. Qwen2.5-9B or via llama-server API) # Load student (e.g. Qwen2.5-4B with LoRA) TEMPERATURE = 2.0 # Soften teacher distribution ALPHA = 0.7 # Weight for KD loss vs CE loss def kd_loss(student_logits, teacher_logits, labels): # Standard CE loss (hard labels) ce_loss = F.cross_entropy( student_logits.view(-1, student_logits.size(-1)), labels.view(-1), ignore_index=-100 ) # KL divergence with temperature scaling student_soft = F.log_softmax(student_logits / TEMPERATURE, dim=-1) teacher_soft = F.softmax(teacher_logits / TEMPERATURE, dim=-1) kl_loss = F.kl_div( student_soft, teacher_soft, reduction="batchmean" ) * (TEMPERATURE ** 2) return ALPHA * kl_loss + (1 - ALPHA) * ce_loss # Use with a custom Trainer that overrides compute_loss() class KDTrainer(SFTTrainer): def compute_loss(self, model, inputs, return_outputs=False): with torch.no_grad(): teacher_out = teacher_model(**inputs) student_out = model(**inputs) loss = kd_loss( student_out.logits, teacher_out.logits, inputs["labels"] ) return (loss, student_out) if return_outputs else loss
T=1.0 = hard teacher. T=2.0–4.0 = soft, more generalisable. For finance domain KD, T=2.0 is a good starting point. Higher T can help when the teacher is very confident (low entropy output).
Unsloth Studio Overview
Unsloth Studio is a web-based no-code/low-code interface for fine-tuning models using Unsloth, launched in 2025. It runs locally in your browser (backed by a local server) or can connect to a remote GPU instance. Think of it as a visual front-end for everything in this handbook.
# Install Unsloth Studio (included with unsloth 2025.x) pip install "unsloth[studio]" # Launch the Studio server on localhost:7860 unsloth-studio # Or specify a port and host unsloth-studio --port 7860 --host 0.0.0.0 # Then open http://localhost:7860 in your browser
Use Studio for exploratory fine-tuning, quick experiments, and GGUF export. Use Python scripts for production pipelines, curriculum training, KD, DVC tracking, and W&B integration. Studio's "Export Script" button gives you the Python equivalent of any Studio config.
Studio Workflow
Export to GGUF
Unsloth has a native GGUF export pipeline — no manual llama.cpp compilation required. It merges your LoRA adapter into the base model, converts to GGUF, and applies your chosen quantization, all in one call.
# After training is complete, export to GGUF # Option 1: Save as GGUF directly (multiple quant types at once) model.save_pretrained_gguf( "arthavidya-qwen25-7b", # Output folder name tokenizer, quantization_method=[ "q4_k_m", # Best balance — use for Ollama "q5_k_m", # Higher quality "q8_0", # Near-lossless — for eval ] ) # Produces: arthavidya-qwen25-7b-Q4_K_M.gguf, etc. # Option 2: Push merged model to HuggingFace Hub model.push_to_hub_gguf( "vivek-doshi/arthavidya-qwen25-7b-gguf", tokenizer, quantization_method="q4_k_m", token="hf_...", ) # Option 3: Save merged safetensors (for HF Hub without GGUF) model.save_pretrained_merged( "arthavidya-merged", tokenizer, save_method="merged_16bit", # or "merged_4bit", "lora" )
Each GGUF export creates: model.gguf (weights), tokenizer.gguf (vocab), and a Modelfile template for Ollama. The Modelfile includes your system prompt and chat template automatically.
Local Inference
Option A: llama-server (Direct, No Overhead)
llama-server is the HTTP API mode of llama.cpp. It provides an OpenAI-compatible REST endpoint — the same API format as the OpenAI SDK, making it a drop-in for your applications.
# Download llama.cpp release binary (or build from source) wget https://github.com/ggerganov/llama.cpp/releases/latest/download/llama-b3xxx-bin-ubuntu-x64.zip # Launch server with CUDA offload ./llama-server \ -m ./arthavidya-qwen25-7b-Q4_K_M.gguf \ -ngl 40 \ # GPU layers — set to 99 to fully GPU-offload -c 4096 \ # Context window --host 0.0.0.0 \ --port 8080 \ -np 4 \ # Parallel slots (concurrent requests) --flash-attn # Enable Flash Attention — faster, less VRAM # Server now running at http://localhost:8080 # OpenAI-compatible endpoint: http://localhost:8080/v1/chat/completions
from openai import OpenAI client = OpenAI(base_url="http://localhost:8080/v1", api_key="local") response = client.chat.completions.create( model="arthavidya", messages=[ {"role": "system", "content": "You are Arthavidya, an Indian finance expert."}, {"role": "user", "content": "Explain ELSS tax saving funds."}, ], temperature=0.7, max_tokens=512, ) print(response.choices[0].message.content)
GPU Layer Recommendations for 12–16 GB
| Model + Quant | GPU | -ngl (GPU layers) | RAM fallback |
|---|---|---|---|
| 7B Q4_K_M | 12 GB | -ngl 99 | None (fully GPU) |
| 7B Q8_0 | 12 GB | -ngl 32 | ~2 GB RAM |
| 13B Q4_K_M | 12 GB | -ngl 30 | ~3 GB RAM |
| 13B Q4_K_M | 16 GB | -ngl 99 | None |
| 4B Q6_K | 12 GB | -ngl 99 | None (much room) |
Ollama Integration
Ollama is the easiest way to manage and run GGUF models locally. It handles model storage, automatic GPU offloading, and exposes the same OpenAI-compatible API. Ideal for running your Arthavidya model alongside other models with hot-swap.
# Create a Modelfile for your fine-tuned model FROM ./arthavidya-qwen25-7b-Q4_K_M.gguf # Set system prompt SYSTEM """ You are Arthavidya, a knowledgeable Indian financial assistant. You provide accurate, clear guidance on Indian taxation, mutual funds, insurance, and personal finance. Always cite relevant sections (e.g. Section 80C, SEBI regulations) where applicable. """ # Parameters PARAMETER temperature 0.7 PARAMETER top_p 0.9 PARAMETER num_ctx 4096 PARAMETER num_gpu 99 # Max GPU layers PARAMETER num_thread 8
# Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Create your custom model from Modelfile ollama create arthavidya -f ./Modelfile # Run interactively ollama run arthavidya # List all local models ollama list # Pull a base model (for comparison) ollama pull qwen2.5:7b-instruct-q4_K_M # Query via REST API (OpenAI-compatible) curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "arthavidya", "messages": [{"role":"user","content":"What is a SIP?"}] }'
llama-server vs Ollama — When to Use Which
Troubleshooting
Common Errors & Fixes
CUDA Out of Memory (OOM)
# Try these in order until OOM resolves: 1. Reduce max_seq_length: 4096 → 2048 → 1024 2. Reduce batch size: per_device_train_batch_size=1 3. Increase grad accum: gradient_accumulation_steps=16 4. Use 8-bit Adam: optim="adamw_8bit" 5. Enable grad checkpointing: use_gradient_checkpointing="unsloth" 6. Reduce LoRA rank: r=32 → r=16 7. Disable packing temporarily and check GPU util with nvidia-smi
Triton / CUDA Kernel Errors on RTX 50 Series
# Blackwell (sm_120) needs PyTorch nightly and Triton from source pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128 pip install triton==3.2.0 # or latest nightly # Also ensure your NVIDIA driver is 560.76+ nvidia-smi # Check driver version
Slow Training — Tokens/sec Too Low
# Check GPU utilization during training watch -n 1 nvidia-smi # If GPU util < 80%, your DataLoader is bottlenecking: # - Increase dataset_num_proc (e.g., 4 → 8) # - Enable packing=True to densify batches # - Move dataset to Linux FS (not /mnt/c/ in WSL2) # - Pre-tokenize and cache dataset before training
NaN / Inf Loss
# NaN loss usually means LR too high or bad data 1. Lower LR: 2e-4 → 5e-5 2. Add max_grad_norm=1.0 to clip gradients 3. Inspect dataset for empty or malformed examples 4. Try bf16=False, fp16=True (or vice versa) 5. Increase warmup_steps to 100+
Unsloth vs PEFT Version Mismatch
# Always pin compatible versions together pip install "peft==0.11.1" "trl==0.9.4" "transformers==4.44.0" pip install "unsloth @ git+https://github.com/unslothai/unsloth.git" # Check what's installed pip show unsloth peft trl transformers | grep -E "Name|Version"
Quick Cheat Sheet
Model Names (Unsloth Hub)
Key CLI Commands
# Monitor VRAM during training watch -n 1 nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv # Count tokens in dataset python -c "from datasets import load_dataset; d=load_dataset('json',data_files='train.jsonl',split='train'); print(sum(len(x.split()) for x in d['text']))" # Test GGUF model with llama.cpp CLI ./llama-cli -m model.gguf -p "What is ELSS?" --temp 0.7 -n 256 # Check Ollama GPU usage ollama ps # Export LoRA adapter to merged safetensors python -c " from unsloth import FastLanguageModel m,t = FastLanguageModel.from_pretrained('./lora-adapter') m.save_pretrained_merged('./merged', t, save_method='merged_16bit') "
Hyperparameter Quick-Reference
| Goal | Key Levers |
|---|---|
| Reduce VRAM | ↓ seq_len, ↓ batch_size, ↑ grad_accum, ↓ rank, enable grad_checkpoint |
| Faster training | ↑ batch_size, packing=True, ↑ dataset_num_proc, bf16=True |
| Better quality | ↑ rank, use_rslora, ↑ epochs, better data curation |
| Prevent overfitting | ↑ dropout, ↓ epochs, weight_decay=0.01, eval set monitoring |
| Stable training | ↓ LR, warmup_steps=100+, max_grad_norm=1.0, cosine schedule |