Unsloth Fine-Tuning Handbook

Introduction

What is Unsloth?

Unsloth is an open-source library built specifically to make LLM fine-tuning radically faster and more memory-efficient on consumer hardware — without sacrificing accuracy. Created by Daniel and Michael Han, it achieves 2–5× speedups over baseline HuggingFace + PEFT training while using up to 80% less VRAM.

Under the hood, Unsloth rewrites the critical CUDA kernels (attention, cross-entropy loss, RoPE embeddings, weight updates) in Triton and C++, patching them directly into HuggingFace Transformers — so you keep the familiar Trainer API while getting dramatically better hardware utilization.

🚀

2–5× Faster Training

Custom Triton kernels replace bottleneck ops. Llama 3 fine-tuning can drop from hours to 30–40 minutes on a single RTX 4060 Ti.

💾

80% Less VRAM

QLoRA integration + gradient checkpointing rewriting means 7–13B models fit comfortably in 12 GB VRAM during training.

🎯

Zero Accuracy Loss

Math is numerically equivalent to standard training. No approximations that affect gradient quality or final model performance.

🔌

HuggingFace Native

Works with Trainer, TRL's SFTTrainer, DPO, ORPO, GRPO. Swap in FastLanguageModel and existing code runs faster automatically.

2–5×

Speed vs Baseline

80%

VRAM Reduction

0%

Accuracy Loss

13B

Max on 12GB VRAM

70+

Supported Models

Introduction

Why Unsloth vs Alternatives?

Feature	Unsloth	HF + PEFT	Axolotl	LLaMA Factory
VRAM Efficiency	Best	Baseline	Good	Good
Speed	2–5×	1×	~1.2×	~1.3×
Consumer GPU Support	✅ Excellent	Limited	Good	Good
GGUF Export	✅ Native	Manual	Manual	✅
Unsloth Studio (no-code)	✅	❌	❌	❌
DPO / ORPO / GRPO	✅	✅	✅	✅
Multi-GPU	⚠️ Limited	✅	✅	✅

💡 When to choose Unsloth

Unsloth is the obvious pick for single-GPU consumer setups (RTX 40/50 series, 12–24 GB VRAM). For multi-GPU cloud training, HF + PEFT or Axolotl remain competitive, but Unsloth's VRAM savings still help.

Foundations

GGUF Explained

GGUF (GPT-Generated Unified Format) is a binary file format designed by Georgi Gerganov (creator of llama.cpp) as a replacement for the older GGML format. It stores model weights plus all metadata needed to run inference in a single self-contained file — no Python, no HuggingFace Hub required.

Why GGUF matters

GGUF models run via llama.cpp, which is a pure C++ inference engine with CUDA/Metal acceleration. This means:

Single binary, no Python dependency
CPU + GPU hybrid inference (great for VRAM-limited setups)
Quantized models (Q4_K_M, Q5_K_M, etc.) that are tiny but high quality
Works with Ollama, LM Studio, llama-server, Jan, AnythingLLM

GGUF vs SafeTensors

Property	SafeTensors	GGUF
Python req?	✅ Yes	❌ No
Quantization	Via PEFT	Native
Metadata	Separate	Embedded
CPU inference	Slow	Fast
Ollama compat	❌	✅

The GGUF Quantization Ladder

Quantization compresses model weights to lower bit-widths. GGUF supports many formats — here's what matters for 12–16 GB GPUs:

Format	Bits	7B Size	13B Size	Quality	Recommended?
`Q2_K`	2-bit	~2.7 GB	~5.0 GB	⭐⭐	Last resort
`Q4_0`	4-bit	~3.8 GB	~7.3 GB	⭐⭐⭐	Fast, okay quality
`Q4_K_M`	4-bit	~4.1 GB	~7.9 GB	⭐⭐⭐⭐	Best Value
`Q5_K_M`	5-bit	~4.8 GB	~9.1 GB	⭐⭐⭐⭐½	Quality sweet spot
`Q6_K`	6-bit	~5.6 GB	~10.7 GB	⭐⭐⭐⭐⭐	Near-lossless
`Q8_0`	8-bit	~7.2 GB	~13.8 GB	⭐⭐⭐⭐⭐	Near fp16 quality
`F16`	16-bit	~14.0 GB	~26 GB	Perfect	12GB: 7B only

⚡ For 12 GB VRAM

Use Q4_K_M for 13B models (fits with room for KV cache) or Q5_K_M for 7B models for near-lossless quality. For the Arthavidya 4B student model, even Q6_K or Q8_0 fits comfortably.

Foundations

Quantization Formats

Unsloth supports multiple quantization strategies during training — these are distinct from GGUF formats, which apply at inference time.

4-bit NF4 (QLoRA)

NormalFloat4 — stores base model weights in 4-bit with a Normal distribution mapping. The LoRA adapters themselves remain in float16/bfloat16. Best VRAM/quality ratio for training.

4-bit Int4

Integer 4-bit quantization. Slightly more VRAM-efficient than NF4 but marginally lower quality. Useful for very tight VRAM budgets.

8-bit (LLM.int8)

Mixed-precision 8-bit. Good quality but uses more VRAM than NF4. Rarely needed with Unsloth — NF4 is usually better.

bfloat16 (bf16)

Full precision training on Ampere+ GPUs (RTX 30/40/50). Best quality, requires ~2× VRAM of NF4. Use for small models (≤4B) that fit.

Foundations

LoRA & QLoRA

LoRA (Low-Rank Adaptation) freezes the original model weights and injects small trainable matrices into specific layers. Instead of updating billions of parameters, you train only millions — typically 0.1–3% of total parameters.

QLoRA = LoRA applied to a 4-bit quantized base model. The frozen weights live in NF4; the adapters are bf16/fp16. Unsloth's innovation is making this combination dramatically faster than the original bitsandbytes QLoRA implementation.

Key LoRA Hyperparameters

Parameter	What it does	12GB Recommendation
`r` (rank)	Size of LoRA adapter matrices. Higher = more capacity.	16–64
`lora_alpha`	Scaling factor. Often set to 2× rank.	32–128
`lora_dropout`	Regularization dropout on adapters.	0.05–0.1
`target_modules`	Which weight matrices to adapt.	All linear layers
`use_dora`	DoRA variant — better quality, slightly more VRAM.	True if fits
`use_rslora`	Rank-stabilized LoRA — better than vanilla for high rank.	True

🔬 Target Modules — What to Include

For best results, target all attention + MLP projections: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. Unsloth defaults to this automatically when you set target_modules="all-linear".

Local GPU Setup

GPU Selection Guide

All three GPUs below are excellent for Unsloth fine-tuning of 7–13B models in QLoRA mode. Here's what you need to know:

RTX 5070

12 GB

GDDR7 VRAM

ArchBlackwell (GB205)

CUDA Cores6,144

FP16 TFLOPS~120

Bandwidth672 GB/s

Best for7–13B QLoRA

Recommended

RTX 5060 Ti

16 GB

GDDR7 VRAM

ArchBlackwell (GB206)

CUDA Cores4,608

FP16 TFLOPS~89

Bandwidth448 GB/s

Best for13B QLoRA, 7B bf16

Best VRAM

RTX 4060 Ti

16 GB

GDDR6 VRAM

ArchAda Lovelace (AD106)

CUDA Cores4,352

FP16 TFLOPS~79

Bandwidth288 GB/s

Best for13B QLoRA, 7B bf16

Still Excellent

VRAM Budget by Model Size

Model Params	Training (NF4 QLoRA)	Inference (GGUF Q4_K_M)	12 GB Fits?	16 GB Fits?
1–3B	~3–5 GB	~1.5–2 GB	✅ Comfortable	✅
4–7B	~6–9 GB	~3–5 GB	✅	✅ With room
8–13B	~10–12 GB	~6–9 GB	⚠️ Tight (seq≤2048)	✅
14–30B	~16–28 GB	~10–20 GB	❌	⚠️ 14B only
70B+	>40 GB	>35 GB	❌	❌

⚠️ 12 GB VRAM Tips for 13B Models

Keep max_seq_length ≤ 2048, use gradient_checkpointing=True, reduce per_device_train_batch_size=1, and enable optim="adamw_8bit". With Unsloth these settings allow 13B models to train on 12 GB with reasonable throughput.

Local GPU Setup

Environment Setup

1

Install CUDA Toolkit (12.1+ recommended)

Unsloth requires CUDA 12.1+. For RTX 50 series, use CUDA 12.8 for Blackwell support.

2

Create a Python 3.10/3.11 environment

Unsloth works best with Python 3.10 or 3.11. Use conda or venv.

3

Install PyTorch with CUDA

Install the nightly build for RTX 50 series, or stable 2.3+ for RTX 40 series.

4

Install Unsloth

Use the pip install command with the correct CUDA suffix for your PyTorch version.

bash setup.sh — RTX 40 Series (CUDA 12.1)

# Step 1: Create and activate environment
conda create -n unsloth_env python=3.11 -y
conda activate unsloth_env

# Step 2: Install PyTorch 2.3+ with CUDA 12.1
pip install torch==2.3.1 torchvision torchaudio --index-url \
  https://download.pytorch.org/whl/cu121

# Step 3: Install Unsloth (stable, CUDA 12.1, PyTorch 2.3)
pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"

# Step 4: Install supporting libraries
pip install transformers datasets trl peft accelerate bitsandbytes
pip install wandb sentencepiece protobuf

bash setup.sh — RTX 50 Series (CUDA 12.8, Blackwell)

# RTX 5060 Ti / 5070 need PyTorch nightly + CUDA 12.8
conda create -n unsloth_env python=3.11 -y
conda activate unsloth_env

# Install PyTorch nightly (sm_120 Blackwell support)
pip install --pre torch torchvision torchaudio --index-url \
  https://download.pytorch.org/whl/nightly/cu128

# Install Unsloth from source (latest main)
pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"

# Verify GPU is detected correctly
python -c "import torch; print(torch.cuda.get_device_name(0))"

bash verify_install.py

import torch
import unsloth

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"Unsloth: {unsloth.__version__}")

Local GPU Setup

WSL2 Tips

Training on WSL2 (Windows Subsystem for Linux) works well with Unsloth. A few key configuration points to maximize performance:

ini %USERPROFILE%\.wslconfig

[wsl2]
# Give WSL2 enough RAM (leave 4GB for Windows)
memory=28GB

# CPUs for data loading workers
processors=12

# Important: disable page reporting for GPU training stability
pageReporting=false

# Use default kernel for CUDA compatibility
kernel=# leave blank for default

[experimental]
# Enable GPU memory management improvements
bestEffortDnsCaching=true

💡 WSL2 Dataset Storage

Store your training datasets on the Linux filesystem (/home/username/) not on the Windows mount (/mnt/c/). I/O to Windows mounts is dramatically slower and will bottleneck your DataLoader, leaving your GPU starved.

🔥 RTX 5070 on WSL2

Blackwell GPUs on WSL2 require the latest NVIDIA preview driver (560.76+) and the CUDA 12.8 WSL2 package. Ensure nvidia-smi works inside WSL2 before installing PyTorch. Check with nvidia-smi -L from your Linux shell.

Fine-Tuning

Dataset Preparation

Unsloth's SFTTrainer expects data in a specific chat format. The cleanest approach is to convert everything to ShareGPT or ChatML conversation format and use Unsloth's built-in formatting utilities.

Recommended Format: ShareGPT

json dataset_sample.jsonl

{
  "conversations": [
    {"role": "system",  "content": "You are an expert in Indian taxation..."},
    {"role": "user",    "content": "What is Section 80C deduction limit?"},
    {"role": "assistant","content": "Under Section 80C of the Income Tax Act..."}
  ]
}

Loading and Formatting

python prepare_dataset.py

from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

# Apply chat template to tokenizer
tokenizer = get_chat_template(
    tokenizer,
    chat_template="qwen-2.5",  # or "llama-3", "mistral", etc.
    mapping={"role": "role", "content": "content"},
)

def formatting_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo, tokenize=False, add_generation_prompt=False
        )
        for convo in convos
    ]
    return {"text": texts}

dataset = load_dataset("json", data_files="data/train.jsonl", split="train")
dataset = dataset.map(formatting_func, batched=True)

print(f"Dataset size: {len(dataset):,} examples")
print(dataset[0]["text"][:500])

🗂️ Data Quality > Data Quantity

For domain adaptation (like Arthavidya's finance focus), 2,000–5,000 high-quality curated examples outperforms 50,000 noisy ones. Use a reserved gold eval set (~500–1,000 examples) that never touches training to measure true domain performance.

Fine-Tuning

Config Reference

FastLanguageModel — Model Loading

python load_model.py

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",  # Pre-quantized
    max_seq_length=4096,          # Reduce to 2048 for 12GB tight
    dtype=None,                    # Auto-detect (bf16 on RTX 40/50)
    load_in_4bit=True,             # QLoRA mode — essential for 12GB
    token=None,                    # HF token if using gated models
    device_map="sequential",       # Unsloth-preferred over "auto"
)

# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=32,                        # Rank — 16 for memory-tight, 64 for best
    lora_alpha=64,               # Usually 2× rank
    lora_dropout=0.05,           # Small dropout for regularisation
    target_modules="all-linear",  # All q/k/v/o + gate/up/down
    use_gradient_checkpointing="unsloth",  # Unsloth's optimised version
    random_state=42,
    use_rslora=True,             # Rank-stabilised LoRA
    use_dora=False,              # Set True if you have spare VRAM
    loftq_config=None,
)

SFTTrainer Arguments — 12 GB Optimized

python trainer_config.py

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=4,
    packing=True,                 # Pack short sequences — boosts GPU util

    args=TrainingArguments(
        # Batch + gradient accumulation
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,  # Effective batch = 16

        # Learning rate schedule
        warmup_steps=50,
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",

        # Precision
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),     # bf16 on RTX 40/50

        # Memory optimisations
        optim="adamw_8bit",         # 8-bit Adam — saves ~1GB VRAM
        gradient_checkpointing=True,

        # Logging
        logging_steps=10,
        output_dir="./outputs",
        save_strategy="epoch",
        report_to="wandb",          # or "tensorboard" / "none"
        run_name="arthavidya-qwen25-7b-sft",

        # Stability
        weight_decay=0.01,
        max_grad_norm=1.0,
        seed=42,
    ),
)

trainer_stats = trainer.train()

Key Hyperparameters Cheatsheet

Param	Memory-Tight (12GB)	Comfortable (16GB)	Effect
`per_device_train_batch_size`	1–2	2–4	VRAM directly
`gradient_accumulation_steps`	8–16	4–8	Effective batch size
`max_seq_length`	1024–2048	2048–4096	VRAM quadratically
`LoRA r (rank)`	16–32	32–64	Adapter capacity
`learning_rate`	1e-4 to 3e-4	1e-4 to 2e-4	Convergence speed
`packing`	True	True	GPU utilization

Fine-Tuning

Full Training Loop

A complete, runnable training script combining all the above components. Copy-paste ready for a 7B QLoRA fine-tune on 12 GB VRAM.

python train.py — Complete SFT Script

#!/usr/bin/env python3
"""
Unsloth SFT Training Script
Optimised for 12–16 GB VRAM (RTX 5070, 5060 Ti, 4060 Ti)
"""

import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
import wandb

# ─── CONFIG ──────────────────────────────
MODEL_NAME    = "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"
MAX_SEQ_LEN   = 2048
OUTPUT_DIR    = "./outputs/arthavidya-v1"
DATASET_PATH  = "data/train.jsonl"
WANDB_PROJECT = "arthavidya"

# ─── LOAD MODEL ──────────────────────────
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LEN,
    dtype=None,
    load_in_4bit=True,
)

tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")

# ─── APPLY LORA ──────────────────────────
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    target_modules="all-linear",
    use_gradient_checkpointing="unsloth",
    use_rslora=True,
    random_state=42,
)

print(model.print_trainable_parameters())

# ─── DATASET ─────────────────────────────
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")

def format_examples(examples):
    return {"text": [
        tokenizer.apply_chat_template(c, tokenize=False, add_generation_prompt=False)
        for c in examples["conversations"]
    ]}

dataset = dataset.map(format_examples, batched=True)

# ─── TRAINER ─────────────────────────────
wandb.init(project=WANDB_PROJECT, name="qwen25-7b-sft-r32")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LEN,
    dataset_num_proc=4,
    packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        warmup_steps=50,
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        optim="adamw_8bit",
        weight_decay=0.01,
        max_grad_norm=1.0,
        logging_steps=10,
        save_strategy="epoch",
        output_dir=OUTPUT_DIR,
        report_to="wandb",
        seed=42,
    ),
)

# ─── TRAIN ───────────────────────────────
stats = trainer.train()
print(f"Training complete. Loss: {stats.training_loss:.4f}")

# ─── SAVE LORA ADAPTERS ──────────────────
model.save_pretrained(f"{OUTPUT_DIR}/lora-adapter")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/lora-adapter")
print("LoRA adapter saved.")

Fine-Tuning

Curriculum Training

Curriculum training exposes the model to data in a deliberate difficulty progression — starting with easy, well-structured examples and advancing to complex, nuanced ones. This mirrors how humans learn: foundational concepts first, edge cases later.

🌱

Tier 1 — Easy

Clear factual Q&A. Short, unambiguous responses. High-confidence data. Example: basic tax definitions, fund categories.

📚

Tier 2 — Medium

Multi-step reasoning. Moderate context. Example: comparing mutual fund categories, SEBI regulation summaries.

🧠

Tier 3 — Difficult

Complex financial scenarios with caveats and calculations. Nuanced regulatory interpretations. Requires domain expertise.

🎯

Tier 4 — Tricky

Edge cases, adversarial questions, ambiguous situations requiring careful hedging. Reserved gold eval examples.

python curriculum_train.py

from datasets import load_dataset, concatenate_datasets

TIERS = {
    "easy":      ("data/tier1_easy.jsonl",     1,   2e-4),
    "medium":    ("data/tier2_medium.jsonl",   1,   1.5e-4),
    "difficult": ("data/tier3_difficult.jsonl",2,   1e-4),
    "tricky":   ("data/tier4_tricky.jsonl",   2,   5e-5),
}

for tier_name, (data_path, epochs, lr) in TIERS.items():
    print(f"\n{'='*50}")
    print(f"Training Tier: {tier_name.upper()}")
    print(f"LR: {lr}, Epochs: {epochs}")

    dataset = load_dataset("json", data_files=data_path, split="train")
    dataset = dataset.map(format_examples, batched=True)

    trainer.args.learning_rate = lr
    trainer.args.num_train_epochs = epochs
    trainer.train_dataset = dataset
    trainer.train()

    # Save checkpoint after each tier
    model.save_pretrained(f"checkpoints/after_{tier_name}")
    print(f"Tier {tier_name} complete.")

Fine-Tuning

Knowledge Distillation

Knowledge Distillation (KD) trains a smaller student model to mimic a larger, more capable teacher model. Instead of only training on hard labels (correct answers), the student learns from the teacher's soft output distribution — capturing nuance and uncertainty.

Standard SFT (Hard Labels)

Loss = Cross-entropy(student_logits, ground_truth_token)
Student only sees "right answer" — no information about near-misses.

KD with Soft Labels

Loss = α × CE(student, truth) + (1-α) × KL(student_softmax, teacher_softmax / T)
Student learns the teacher's confidence distribution — richer signal.

python kd_training.py — Soft-Label Distillation with Unsloth

import torch
import torch.nn.functional as F
from unsloth import FastLanguageModel

# Load teacher (e.g. Qwen2.5-9B or via llama-server API)
# Load student (e.g. Qwen2.5-4B with LoRA)

TEMPERATURE = 2.0     # Soften teacher distribution
ALPHA       = 0.7     # Weight for KD loss vs CE loss

def kd_loss(student_logits, teacher_logits, labels):
    # Standard CE loss (hard labels)
    ce_loss = F.cross_entropy(
        student_logits.view(-1, student_logits.size(-1)),
        labels.view(-1),
        ignore_index=-100
    )

    # KL divergence with temperature scaling
    student_soft = F.log_softmax(student_logits / TEMPERATURE, dim=-1)
    teacher_soft = F.softmax(teacher_logits / TEMPERATURE, dim=-1)
    kl_loss = F.kl_div(
        student_soft, teacher_soft,
        reduction="batchmean"
    ) * (TEMPERATURE ** 2)

    return ALPHA * kl_loss + (1 - ALPHA) * ce_loss

# Use with a custom Trainer that overrides compute_loss()
class KDTrainer(SFTTrainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        with torch.no_grad():
            teacher_out = teacher_model(**inputs)
        student_out = model(**inputs)
        loss = kd_loss(
            student_out.logits,
            teacher_out.logits,
            inputs["labels"]
        )
        return (loss, student_out) if return_outputs else loss

🌡️ Temperature Tuning

T=1.0 = hard teacher. T=2.0–4.0 = soft, more generalisable. For finance domain KD, T=2.0 is a good starting point. Higher T can help when the teacher is very confident (low entropy output).

Unsloth Studio

Unsloth Studio Overview

Unsloth Studio is a web-based no-code/low-code interface for fine-tuning models using Unsloth, launched in 2025. It runs locally in your browser (backed by a local server) or can connect to a remote GPU instance. Think of it as a visual front-end for everything in this handbook.

🖥️

Visual Config Builder

Build your LoRA config, training arguments, and dataset pipeline through a form UI. Generates the equivalent Python script for reproducibility.

📊

Live Training Dashboard

Real-time loss curves, VRAM usage, tokens/sec, GPU temperature — all in the browser without needing Weights & Biases.

🗃️

Dataset Manager

Upload JSONL files, preview formatting, validate conversation structure, and split train/eval without code.

📦

One-Click GGUF Export

Select quantization format (Q4_K_M, Q5_K_M, etc.) and export your merged model to GGUF directly from the Studio UI.

bash Launch Unsloth Studio

# Install Unsloth Studio (included with unsloth 2025.x)
pip install "unsloth[studio]"

# Launch the Studio server on localhost:7860
unsloth-studio

# Or specify a port and host
unsloth-studio --port 7860 --host 0.0.0.0

# Then open http://localhost:7860 in your browser

💡 Studio vs Script — When to use which

Use Studio for exploratory fine-tuning, quick experiments, and GGUF export. Use Python scripts for production pipelines, curriculum training, KD, DVC tracking, and W&B integration. Studio's "Export Script" button gives you the Python equivalent of any Studio config.

Unsloth Studio

Studio Workflow

1

Select Model

Choose from the built-in model catalogue (Llama 3.2, Qwen 2.5, Mistral, Phi-4, Gemma 3, etc.) or paste any HuggingFace model ID. Studio auto-selects the correct chat template.

2

Upload Dataset

Drag-drop your JSONL file or paste a HuggingFace dataset path. Studio validates the format and shows a live preview of formatted training examples.

3

Configure LoRA & Training

Set rank, alpha, target modules via sliders. Set learning rate, epochs, batch size. Studio estimates VRAM usage dynamically as you adjust settings — great for 12 GB budgets.

4

Start Training

Click Train. Watch loss curves, VRAM, and speed in real-time. Training is interruptible — checkpoints are saved per epoch so you can resume.

5

Chat & Evaluate

After training, test your model directly in Studio's built-in chat interface. Compare responses side-by-side with the base model.

6

Export to GGUF

Select quantization type, click Export. Studio merges LoRA into base, converts to GGUF, and saves locally. Ready for Ollama or llama-server.

Export & Serve

Export to GGUF

Unsloth has a native GGUF export pipeline — no manual llama.cpp compilation required. It merges your LoRA adapter into the base model, converts to GGUF, and applies your chosen quantization, all in one call.

python export_gguf.py

# After training is complete, export to GGUF

# Option 1: Save as GGUF directly (multiple quant types at once)
model.save_pretrained_gguf(
    "arthavidya-qwen25-7b",     # Output folder name
    tokenizer,
    quantization_method=[
        "q4_k_m",               # Best balance — use for Ollama
        "q5_k_m",               # Higher quality
        "q8_0",                 # Near-lossless — for eval
    ]
)
# Produces: arthavidya-qwen25-7b-Q4_K_M.gguf, etc.

# Option 2: Push merged model to HuggingFace Hub
model.push_to_hub_gguf(
    "vivek-doshi/arthavidya-qwen25-7b-gguf",
    tokenizer,
    quantization_method="q4_k_m",
    token="hf_...",
)

# Option 3: Save merged safetensors (for HF Hub without GGUF)
model.save_pretrained_merged(
    "arthavidya-merged",
    tokenizer,
    save_method="merged_16bit",  # or "merged_4bit", "lora"
)

📁 What Gets Saved

Each GGUF export creates: model.gguf (weights), tokenizer.gguf (vocab), and a Modelfile template for Ollama. The Modelfile includes your system prompt and chat template automatically.

Export & Serve

Local Inference

Option A: llama-server (Direct, No Overhead)

llama-server is the HTTP API mode of llama.cpp. It provides an OpenAI-compatible REST endpoint — the same API format as the OpenAI SDK, making it a drop-in for your applications.

bash llama-server launch

# Download llama.cpp release binary (or build from source)
wget https://github.com/ggerganov/llama.cpp/releases/latest/download/llama-b3xxx-bin-ubuntu-x64.zip

# Launch server with CUDA offload
./llama-server \
  -m ./arthavidya-qwen25-7b-Q4_K_M.gguf \
  -ngl 40 \               # GPU layers — set to 99 to fully GPU-offload
  -c 4096 \               # Context window
  --host 0.0.0.0 \
  --port 8080 \
  -np 4 \                 # Parallel slots (concurrent requests)
  --flash-attn            # Enable Flash Attention — faster, less VRAM

# Server now running at http://localhost:8080
# OpenAI-compatible endpoint: http://localhost:8080/v1/chat/completions

python query_llama_server.py

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

response = client.chat.completions.create(
    model="arthavidya",
    messages=[
        {"role": "system",  "content": "You are Arthavidya, an Indian finance expert."},
        {"role": "user",    "content": "Explain ELSS tax saving funds."},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

GPU Layer Recommendations for 12–16 GB

Model + Quant	GPU	-ngl (GPU layers)	RAM fallback
7B Q4_K_M	12 GB	-ngl 99	None (fully GPU)
7B Q8_0	12 GB	-ngl 32	~2 GB RAM
13B Q4_K_M	12 GB	-ngl 30	~3 GB RAM
13B Q4_K_M	16 GB	-ngl 99	None
4B Q6_K	12 GB	-ngl 99	None (much room)

Export & Serve

Ollama Integration

Ollama is the easiest way to manage and run GGUF models locally. It handles model storage, automatic GPU offloading, and exposes the same OpenAI-compatible API. Ideal for running your Arthavidya model alongside other models with hot-swap.

bash Modelfile — Custom Ollama Model

# Create a Modelfile for your fine-tuned model
FROM ./arthavidya-qwen25-7b-Q4_K_M.gguf

# Set system prompt
SYSTEM """
You are Arthavidya, a knowledgeable Indian financial assistant.
You provide accurate, clear guidance on Indian taxation, mutual funds,
insurance, and personal finance. Always cite relevant sections
(e.g. Section 80C, SEBI regulations) where applicable.
"""

# Parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER num_gpu 99        # Max GPU layers
PARAMETER num_thread 8

bash ollama_commands.sh

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Create your custom model from Modelfile
ollama create arthavidya -f ./Modelfile

# Run interactively
ollama run arthavidya

# List all local models
ollama list

# Pull a base model (for comparison)
ollama pull qwen2.5:7b-instruct-q4_K_M

# Query via REST API (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "arthavidya",
    "messages": [{"role":"user","content":"What is a SIP?"}]
  }'

llama-server vs Ollama — When to Use Which

🔧 llama-server

Use when: maximum control over context, batch size, threading; running in production pipeline; need consistent version pinning; want to manage manually (as in Arthavidya pipeline).

🐳 Ollama

Use when: managing multiple models, hot-swapping between base and fine-tuned, easy integration with LangChain/LlamaIndex, or sharing with teammates via simple API.

Reference

Troubleshooting

Common Errors & Fixes

CUDA Out of Memory (OOM)

fix

# Try these in order until OOM resolves:
1. Reduce max_seq_length:    4096 → 2048 → 1024
2. Reduce batch size:        per_device_train_batch_size=1
3. Increase grad accum:     gradient_accumulation_steps=16
4. Use 8-bit Adam:           optim="adamw_8bit"
5. Enable grad checkpointing: use_gradient_checkpointing="unsloth"
6. Reduce LoRA rank:         r=32 → r=16
7. Disable packing temporarily and check GPU util with nvidia-smi

Triton / CUDA Kernel Errors on RTX 50 Series

fix

# Blackwell (sm_120) needs PyTorch nightly and Triton from source
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
pip install triton==3.2.0  # or latest nightly

# Also ensure your NVIDIA driver is 560.76+
nvidia-smi  # Check driver version

Slow Training — Tokens/sec Too Low

diagnose

# Check GPU utilization during training
watch -n 1 nvidia-smi

# If GPU util < 80%, your DataLoader is bottlenecking:
# - Increase dataset_num_proc (e.g., 4 → 8)
# - Enable packing=True to densify batches
# - Move dataset to Linux FS (not /mnt/c/ in WSL2)
# - Pre-tokenize and cache dataset before training

NaN / Inf Loss

fix

# NaN loss usually means LR too high or bad data
1. Lower LR: 2e-4 → 5e-5
2. Add max_grad_norm=1.0 to clip gradients
3. Inspect dataset for empty or malformed examples
4. Try bf16=False, fp16=True (or vice versa)
5. Increase warmup_steps to 100+

Unsloth vs PEFT Version Mismatch

bash

# Always pin compatible versions together
pip install "peft==0.11.1" "trl==0.9.4" "transformers==4.44.0"
pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"

# Check what's installed
pip show unsloth peft trl transformers | grep -E "Name|Version"

Reference

Quick Cheat Sheet

Model Names (Unsloth Hub)

unsloth/Qwen2.5-7B-Instruct-bnb-4bit unsloth/Qwen2.5-14B-Instruct-bnb-4bit unsloth/Llama-3.2-3B-Instruct-bnb-4bit unsloth/Llama-3.1-8B-Instruct-bnb-4bit unsloth/mistral-7b-instruct-v0.3-bnb-4bit unsloth/Phi-4-bnb-4bit unsloth/gemma-3-12b-it-bnb-4bit

Key CLI Commands

bash

# Monitor VRAM during training
watch -n 1 nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv

# Count tokens in dataset
python -c "from datasets import load_dataset; d=load_dataset('json',data_files='train.jsonl',split='train'); print(sum(len(x.split()) for x in d['text']))"

# Test GGUF model with llama.cpp CLI
./llama-cli -m model.gguf -p "What is ELSS?" --temp 0.7 -n 256

# Check Ollama GPU usage
ollama ps

# Export LoRA adapter to merged safetensors
python -c "
from unsloth import FastLanguageModel
m,t = FastLanguageModel.from_pretrained('./lora-adapter')
m.save_pretrained_merged('./merged', t, save_method='merged_16bit')
"

Hyperparameter Quick-Reference

Goal	Key Levers
Reduce VRAM	↓ seq_len, ↓ batch_size, ↑ grad_accum, ↓ rank, enable grad_checkpoint
Faster training	↑ batch_size, packing=True, ↑ dataset_num_proc, bf16=True
Better quality	↑ rank, use_rslora, ↑ epochs, better data curation
Prevent overfitting	↑ dropout, ↓ epochs, weight_decay=0.01, eval set monitoring
Stable training	↓ LR, warmup_steps=100+, max_grad_norm=1.0, cosine schedule

Useful Links

github.com/unslothai/unsloth huggingface.co/unsloth github.com/ggerganov/llama.cpp ollama.com docs.unsloth.ai wandb.ai

Unsloth Fine-TuningHandbook

What is Unsloth?

Why Unsloth vs Alternatives?

GGUF Explained

Why GGUF matters

GGUF vs SafeTensors

The GGUF Quantization Ladder

Quantization Formats

LoRA & QLoRA

Key LoRA Hyperparameters

GPU Selection Guide

VRAM Budget by Model Size

Environment Setup

WSL2 Tips

Dataset Preparation

Recommended Format: ShareGPT

Loading and Formatting

Config Reference

FastLanguageModel — Model Loading

SFTTrainer Arguments — 12 GB Optimized

Key Hyperparameters Cheatsheet

Full Training Loop

Curriculum Training

Knowledge Distillation

Standard SFT (Hard Labels)

KD with Soft Labels

Unsloth Studio Overview

Studio Workflow

Export to GGUF

Local Inference

Option A: llama-server (Direct, No Overhead)

GPU Layer Recommendations for 12–16 GB

Ollama Integration

llama-server vs Ollama — When to Use Which

Troubleshooting

Common Errors & Fixes

CUDA Out of Memory (OOM)

Triton / CUDA Kernel Errors on RTX 50 Series

Slow Training — Tokens/sec Too Low

NaN / Inf Loss

Unsloth vs PEFT Version Mismatch

Quick Cheat Sheet

Model Names (Unsloth Hub)

Key CLI Commands

Hyperparameter Quick-Reference

Useful Links

Unsloth Fine-Tuning
Handbook