??????? Unsloth Fine-Tuning Handbook
V1
Back to handbooks index
🦥 The Practitioner's Guide

Unsloth Fine-Tuning
Handbook

A complete reference for fine-tuning, quantizing, and running LLMs locally on consumer GPUs (12–16 GB VRAM). Covers Unsloth, Unsloth Studio, LoRA/QLoRA, GGUF export, and serving with Ollama — optimized for RTX 5070, 5060 Ti, and 4060 Ti.

Unsloth 2025 GGUF / llama.cpp QLoRA / LoRA RTX 50 / 40 Series Ollama + llama-server

What is Unsloth?

Unsloth is an open-source library built specifically to make LLM fine-tuning radically faster and more memory-efficient on consumer hardware — without sacrificing accuracy. Created by Daniel and Michael Han, it achieves 2–5× speedups over baseline HuggingFace + PEFT training while using up to 80% less VRAM.

Under the hood, Unsloth rewrites the critical CUDA kernels (attention, cross-entropy loss, RoPE embeddings, weight updates) in Triton and C++, patching them directly into HuggingFace Transformers — so you keep the familiar Trainer API while getting dramatically better hardware utilization.

🚀
2–5× Faster Training
Custom Triton kernels replace bottleneck ops. Llama 3 fine-tuning can drop from hours to 30–40 minutes on a single RTX 4060 Ti.
💾
80% Less VRAM
QLoRA integration + gradient checkpointing rewriting means 7–13B models fit comfortably in 12 GB VRAM during training.
🎯
Zero Accuracy Loss
Math is numerically equivalent to standard training. No approximations that affect gradient quality or final model performance.
🔌
HuggingFace Native
Works with Trainer, TRL's SFTTrainer, DPO, ORPO, GRPO. Swap in FastLanguageModel and existing code runs faster automatically.
2–5×
Speed vs Baseline
80%
VRAM Reduction
0%
Accuracy Loss
13B
Max on 12GB VRAM
70+
Supported Models

Why Unsloth vs Alternatives?

FeatureUnslothHF + PEFTAxolotlLLaMA Factory
VRAM EfficiencyBestBaselineGoodGood
Speed2–5×~1.2×~1.3×
Consumer GPU Support✅ ExcellentLimitedGoodGood
GGUF Export✅ NativeManualManual
Unsloth Studio (no-code)
DPO / ORPO / GRPO
Multi-GPU⚠️ Limited
💡 When to choose Unsloth

Unsloth is the obvious pick for single-GPU consumer setups (RTX 40/50 series, 12–24 GB VRAM). For multi-GPU cloud training, HF + PEFT or Axolotl remain competitive, but Unsloth's VRAM savings still help.

GGUF Explained

GGUF (GPT-Generated Unified Format) is a binary file format designed by Georgi Gerganov (creator of llama.cpp) as a replacement for the older GGML format. It stores model weights plus all metadata needed to run inference in a single self-contained file — no Python, no HuggingFace Hub required.

Why GGUF matters

GGUF models run via llama.cpp, which is a pure C++ inference engine with CUDA/Metal acceleration. This means:

  • Single binary, no Python dependency
  • CPU + GPU hybrid inference (great for VRAM-limited setups)
  • Quantized models (Q4_K_M, Q5_K_M, etc.) that are tiny but high quality
  • Works with Ollama, LM Studio, llama-server, Jan, AnythingLLM

GGUF vs SafeTensors

PropertySafeTensorsGGUF
Python req?✅ Yes❌ No
QuantizationVia PEFTNative
MetadataSeparateEmbedded
CPU inferenceSlowFast
Ollama compat

The GGUF Quantization Ladder

Quantization compresses model weights to lower bit-widths. GGUF supports many formats — here's what matters for 12–16 GB GPUs:

FormatBits7B Size13B SizeQualityRecommended?
Q2_K2-bit~2.7 GB~5.0 GB⭐⭐Last resort
Q4_04-bit~3.8 GB~7.3 GB⭐⭐⭐Fast, okay quality
Q4_K_M4-bit~4.1 GB~7.9 GB⭐⭐⭐⭐Best Value
Q5_K_M5-bit~4.8 GB~9.1 GB⭐⭐⭐⭐½Quality sweet spot
Q6_K6-bit~5.6 GB~10.7 GB⭐⭐⭐⭐⭐Near-lossless
Q8_08-bit~7.2 GB~13.8 GB⭐⭐⭐⭐⭐Near fp16 quality
F1616-bit~14.0 GB~26 GBPerfect12GB: 7B only
⚡ For 12 GB VRAM

Use Q4_K_M for 13B models (fits with room for KV cache) or Q5_K_M for 7B models for near-lossless quality. For the Arthavidya 4B student model, even Q6_K or Q8_0 fits comfortably.

Quantization Formats

Unsloth supports multiple quantization strategies during training — these are distinct from GGUF formats, which apply at inference time.

4-bit NF4 (QLoRA)
NormalFloat4 — stores base model weights in 4-bit with a Normal distribution mapping. The LoRA adapters themselves remain in float16/bfloat16. Best VRAM/quality ratio for training.
4-bit Int4
Integer 4-bit quantization. Slightly more VRAM-efficient than NF4 but marginally lower quality. Useful for very tight VRAM budgets.
8-bit (LLM.int8)
Mixed-precision 8-bit. Good quality but uses more VRAM than NF4. Rarely needed with Unsloth — NF4 is usually better.
bfloat16 (bf16)
Full precision training on Ampere+ GPUs (RTX 30/40/50). Best quality, requires ~2× VRAM of NF4. Use for small models (≤4B) that fit.

LoRA & QLoRA

LoRA (Low-Rank Adaptation) freezes the original model weights and injects small trainable matrices into specific layers. Instead of updating billions of parameters, you train only millions — typically 0.1–3% of total parameters.

QLoRA = LoRA applied to a 4-bit quantized base model. The frozen weights live in NF4; the adapters are bf16/fp16. Unsloth's innovation is making this combination dramatically faster than the original bitsandbytes QLoRA implementation.

Key LoRA Hyperparameters

ParameterWhat it does12GB Recommendation
r (rank)Size of LoRA adapter matrices. Higher = more capacity.16–64
lora_alphaScaling factor. Often set to 2× rank.32–128
lora_dropoutRegularization dropout on adapters.0.05–0.1
target_modulesWhich weight matrices to adapt.All linear layers
use_doraDoRA variant — better quality, slightly more VRAM.True if fits
use_rsloraRank-stabilized LoRA — better than vanilla for high rank.True
🔬 Target Modules — What to Include

For best results, target all attention + MLP projections: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. Unsloth defaults to this automatically when you set target_modules="all-linear".

GPU Selection Guide

All three GPUs below are excellent for Unsloth fine-tuning of 7–13B models in QLoRA mode. Here's what you need to know:

RTX 5070
12 GB
GDDR7 VRAM
ArchBlackwell (GB205)
CUDA Cores6,144
FP16 TFLOPS~120
Bandwidth672 GB/s
Best for7–13B QLoRA
Recommended
RTX 5060 Ti
16 GB
GDDR7 VRAM
ArchBlackwell (GB206)
CUDA Cores4,608
FP16 TFLOPS~89
Bandwidth448 GB/s
Best for13B QLoRA, 7B bf16
Best VRAM
RTX 4060 Ti
16 GB
GDDR6 VRAM
ArchAda Lovelace (AD106)
CUDA Cores4,352
FP16 TFLOPS~79
Bandwidth288 GB/s
Best for13B QLoRA, 7B bf16
Still Excellent

VRAM Budget by Model Size

Model ParamsTraining (NF4 QLoRA)Inference (GGUF Q4_K_M)12 GB Fits?16 GB Fits?
1–3B~3–5 GB~1.5–2 GB✅ Comfortable
4–7B~6–9 GB~3–5 GB✅ With room
8–13B~10–12 GB~6–9 GB⚠️ Tight (seq≤2048)
14–30B~16–28 GB~10–20 GB⚠️ 14B only
70B+>40 GB>35 GB
⚠️ 12 GB VRAM Tips for 13B Models

Keep max_seq_length ≤ 2048, use gradient_checkpointing=True, reduce per_device_train_batch_size=1, and enable optim="adamw_8bit". With Unsloth these settings allow 13B models to train on 12 GB with reasonable throughput.

Environment Setup

1
Install CUDA Toolkit (12.1+ recommended)
Unsloth requires CUDA 12.1+. For RTX 50 series, use CUDA 12.8 for Blackwell support.
2
Create a Python 3.10/3.11 environment
Unsloth works best with Python 3.10 or 3.11. Use conda or venv.
3
Install PyTorch with CUDA
Install the nightly build for RTX 50 series, or stable 2.3+ for RTX 40 series.
4
Install Unsloth
Use the pip install command with the correct CUDA suffix for your PyTorch version.
bash setup.sh — RTX 40 Series (CUDA 12.1)
# Step 1: Create and activate environment
conda create -n unsloth_env python=3.11 -y
conda activate unsloth_env

# Step 2: Install PyTorch 2.3+ with CUDA 12.1
pip install torch==2.3.1 torchvision torchaudio --index-url \
  https://download.pytorch.org/whl/cu121

# Step 3: Install Unsloth (stable, CUDA 12.1, PyTorch 2.3)
pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"

# Step 4: Install supporting libraries
pip install transformers datasets trl peft accelerate bitsandbytes
pip install wandb sentencepiece protobuf
bash setup.sh — RTX 50 Series (CUDA 12.8, Blackwell)
# RTX 5060 Ti / 5070 need PyTorch nightly + CUDA 12.8
conda create -n unsloth_env python=3.11 -y
conda activate unsloth_env

# Install PyTorch nightly (sm_120 Blackwell support)
pip install --pre torch torchvision torchaudio --index-url \
  https://download.pytorch.org/whl/nightly/cu128

# Install Unsloth from source (latest main)
pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"

# Verify GPU is detected correctly
python -c "import torch; print(torch.cuda.get_device_name(0))"
bash verify_install.py
import torch
import unsloth

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"Unsloth: {unsloth.__version__}")

WSL2 Tips

Training on WSL2 (Windows Subsystem for Linux) works well with Unsloth. A few key configuration points to maximize performance:

ini %USERPROFILE%\.wslconfig
[wsl2]
# Give WSL2 enough RAM (leave 4GB for Windows)
memory=28GB

# CPUs for data loading workers
processors=12

# Important: disable page reporting for GPU training stability
pageReporting=false

# Use default kernel for CUDA compatibility
kernel=# leave blank for default

[experimental]
# Enable GPU memory management improvements
bestEffortDnsCaching=true
💡 WSL2 Dataset Storage

Store your training datasets on the Linux filesystem (/home/username/) not on the Windows mount (/mnt/c/). I/O to Windows mounts is dramatically slower and will bottleneck your DataLoader, leaving your GPU starved.

🔥 RTX 5070 on WSL2

Blackwell GPUs on WSL2 require the latest NVIDIA preview driver (560.76+) and the CUDA 12.8 WSL2 package. Ensure nvidia-smi works inside WSL2 before installing PyTorch. Check with nvidia-smi -L from your Linux shell.

Dataset Preparation

Unsloth's SFTTrainer expects data in a specific chat format. The cleanest approach is to convert everything to ShareGPT or ChatML conversation format and use Unsloth's built-in formatting utilities.

Recommended Format: ShareGPT

json dataset_sample.jsonl
{
  "conversations": [
    {"role": "system",  "content": "You are an expert in Indian taxation..."},
    {"role": "user",    "content": "What is Section 80C deduction limit?"},
    {"role": "assistant","content": "Under Section 80C of the Income Tax Act..."}
  ]
}

Loading and Formatting

python prepare_dataset.py
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

# Apply chat template to tokenizer
tokenizer = get_chat_template(
    tokenizer,
    chat_template="qwen-2.5",  # or "llama-3", "mistral", etc.
    mapping={"role": "role", "content": "content"},
)

def formatting_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo, tokenize=False, add_generation_prompt=False
        )
        for convo in convos
    ]
    return {"text": texts}

dataset = load_dataset("json", data_files="data/train.jsonl", split="train")
dataset = dataset.map(formatting_func, batched=True)

print(f"Dataset size: {len(dataset):,} examples")
print(dataset[0]["text"][:500])
🗂️ Data Quality > Data Quantity

For domain adaptation (like Arthavidya's finance focus), 2,000–5,000 high-quality curated examples outperforms 50,000 noisy ones. Use a reserved gold eval set (~500–1,000 examples) that never touches training to measure true domain performance.

Config Reference

FastLanguageModel — Model Loading

python load_model.py
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",  # Pre-quantized
    max_seq_length=4096,          # Reduce to 2048 for 12GB tight
    dtype=None,                    # Auto-detect (bf16 on RTX 40/50)
    load_in_4bit=True,             # QLoRA mode — essential for 12GB
    token=None,                    # HF token if using gated models
    device_map="sequential",       # Unsloth-preferred over "auto"
)

# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=32,                        # Rank — 16 for memory-tight, 64 for best
    lora_alpha=64,               # Usually 2× rank
    lora_dropout=0.05,           # Small dropout for regularisation
    target_modules="all-linear",  # All q/k/v/o + gate/up/down
    use_gradient_checkpointing="unsloth",  # Unsloth's optimised version
    random_state=42,
    use_rslora=True,             # Rank-stabilised LoRA
    use_dora=False,              # Set True if you have spare VRAM
    loftq_config=None,
)

SFTTrainer Arguments — 12 GB Optimized

python trainer_config.py
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=4,
    packing=True,                 # Pack short sequences — boosts GPU util

    args=TrainingArguments(
        # Batch + gradient accumulation
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,  # Effective batch = 16

        # Learning rate schedule
        warmup_steps=50,
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",

        # Precision
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),     # bf16 on RTX 40/50

        # Memory optimisations
        optim="adamw_8bit",         # 8-bit Adam — saves ~1GB VRAM
        gradient_checkpointing=True,

        # Logging
        logging_steps=10,
        output_dir="./outputs",
        save_strategy="epoch",
        report_to="wandb",          # or "tensorboard" / "none"
        run_name="arthavidya-qwen25-7b-sft",

        # Stability
        weight_decay=0.01,
        max_grad_norm=1.0,
        seed=42,
    ),
)

trainer_stats = trainer.train()

Key Hyperparameters Cheatsheet

ParamMemory-Tight (12GB)Comfortable (16GB)Effect
per_device_train_batch_size1–22–4VRAM directly
gradient_accumulation_steps8–164–8Effective batch size
max_seq_length1024–20482048–4096VRAM quadratically
LoRA r (rank)16–3232–64Adapter capacity
learning_rate1e-4 to 3e-41e-4 to 2e-4Convergence speed
packingTrueTrueGPU utilization

Full Training Loop

A complete, runnable training script combining all the above components. Copy-paste ready for a 7B QLoRA fine-tune on 12 GB VRAM.

python train.py — Complete SFT Script
#!/usr/bin/env python3
"""
Unsloth SFT Training Script
Optimised for 12–16 GB VRAM (RTX 5070, 5060 Ti, 4060 Ti)
"""

import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
import wandb

# ─── CONFIG ──────────────────────────────
MODEL_NAME    = "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"
MAX_SEQ_LEN   = 2048
OUTPUT_DIR    = "./outputs/arthavidya-v1"
DATASET_PATH  = "data/train.jsonl"
WANDB_PROJECT = "arthavidya"

# ─── LOAD MODEL ──────────────────────────
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LEN,
    dtype=None,
    load_in_4bit=True,
)

tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")

# ─── APPLY LORA ──────────────────────────
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    target_modules="all-linear",
    use_gradient_checkpointing="unsloth",
    use_rslora=True,
    random_state=42,
)

print(model.print_trainable_parameters())

# ─── DATASET ─────────────────────────────
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")

def format_examples(examples):
    return {"text": [
        tokenizer.apply_chat_template(c, tokenize=False, add_generation_prompt=False)
        for c in examples["conversations"]
    ]}

dataset = dataset.map(format_examples, batched=True)

# ─── TRAINER ─────────────────────────────
wandb.init(project=WANDB_PROJECT, name="qwen25-7b-sft-r32")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LEN,
    dataset_num_proc=4,
    packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        warmup_steps=50,
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        optim="adamw_8bit",
        weight_decay=0.01,
        max_grad_norm=1.0,
        logging_steps=10,
        save_strategy="epoch",
        output_dir=OUTPUT_DIR,
        report_to="wandb",
        seed=42,
    ),
)

# ─── TRAIN ───────────────────────────────
stats = trainer.train()
print(f"Training complete. Loss: {stats.training_loss:.4f}")

# ─── SAVE LORA ADAPTERS ──────────────────
model.save_pretrained(f"{OUTPUT_DIR}/lora-adapter")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/lora-adapter")
print("LoRA adapter saved.")

Curriculum Training

Curriculum training exposes the model to data in a deliberate difficulty progression — starting with easy, well-structured examples and advancing to complex, nuanced ones. This mirrors how humans learn: foundational concepts first, edge cases later.

🌱
Tier 1 — Easy
Clear factual Q&A. Short, unambiguous responses. High-confidence data. Example: basic tax definitions, fund categories.
📚
Tier 2 — Medium
Multi-step reasoning. Moderate context. Example: comparing mutual fund categories, SEBI regulation summaries.
🧠
Tier 3 — Difficult
Complex financial scenarios with caveats and calculations. Nuanced regulatory interpretations. Requires domain expertise.
🎯
Tier 4 — Tricky
Edge cases, adversarial questions, ambiguous situations requiring careful hedging. Reserved gold eval examples.
python curriculum_train.py
from datasets import load_dataset, concatenate_datasets

TIERS = {
    "easy":      ("data/tier1_easy.jsonl",     1,   2e-4),
    "medium":    ("data/tier2_medium.jsonl",   1,   1.5e-4),
    "difficult": ("data/tier3_difficult.jsonl",2,   1e-4),
    "tricky":   ("data/tier4_tricky.jsonl",   2,   5e-5),
}

for tier_name, (data_path, epochs, lr) in TIERS.items():
    print(f"\n{'='*50}")
    print(f"Training Tier: {tier_name.upper()}")
    print(f"LR: {lr}, Epochs: {epochs}")

    dataset = load_dataset("json", data_files=data_path, split="train")
    dataset = dataset.map(format_examples, batched=True)

    trainer.args.learning_rate = lr
    trainer.args.num_train_epochs = epochs
    trainer.train_dataset = dataset
    trainer.train()

    # Save checkpoint after each tier
    model.save_pretrained(f"checkpoints/after_{tier_name}")
    print(f"Tier {tier_name} complete.")

Knowledge Distillation

Knowledge Distillation (KD) trains a smaller student model to mimic a larger, more capable teacher model. Instead of only training on hard labels (correct answers), the student learns from the teacher's soft output distribution — capturing nuance and uncertainty.

Standard SFT (Hard Labels)

Loss = Cross-entropy(student_logits, ground_truth_token)
Student only sees "right answer" — no information about near-misses.

KD with Soft Labels

Loss = α × CE(student, truth) + (1-α) × KL(student_softmax, teacher_softmax / T)
Student learns the teacher's confidence distribution — richer signal.

python kd_training.py — Soft-Label Distillation with Unsloth
import torch
import torch.nn.functional as F
from unsloth import FastLanguageModel

# Load teacher (e.g. Qwen2.5-9B or via llama-server API)
# Load student (e.g. Qwen2.5-4B with LoRA)

TEMPERATURE = 2.0     # Soften teacher distribution
ALPHA       = 0.7     # Weight for KD loss vs CE loss

def kd_loss(student_logits, teacher_logits, labels):
    # Standard CE loss (hard labels)
    ce_loss = F.cross_entropy(
        student_logits.view(-1, student_logits.size(-1)),
        labels.view(-1),
        ignore_index=-100
    )

    # KL divergence with temperature scaling
    student_soft = F.log_softmax(student_logits / TEMPERATURE, dim=-1)
    teacher_soft = F.softmax(teacher_logits / TEMPERATURE, dim=-1)
    kl_loss = F.kl_div(
        student_soft, teacher_soft,
        reduction="batchmean"
    ) * (TEMPERATURE ** 2)

    return ALPHA * kl_loss + (1 - ALPHA) * ce_loss

# Use with a custom Trainer that overrides compute_loss()
class KDTrainer(SFTTrainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        with torch.no_grad():
            teacher_out = teacher_model(**inputs)
        student_out = model(**inputs)
        loss = kd_loss(
            student_out.logits,
            teacher_out.logits,
            inputs["labels"]
        )
        return (loss, student_out) if return_outputs else loss
🌡️ Temperature Tuning

T=1.0 = hard teacher. T=2.0–4.0 = soft, more generalisable. For finance domain KD, T=2.0 is a good starting point. Higher T can help when the teacher is very confident (low entropy output).

Unsloth Studio Overview

Unsloth Studio is a web-based no-code/low-code interface for fine-tuning models using Unsloth, launched in 2025. It runs locally in your browser (backed by a local server) or can connect to a remote GPU instance. Think of it as a visual front-end for everything in this handbook.

🖥️
Visual Config Builder
Build your LoRA config, training arguments, and dataset pipeline through a form UI. Generates the equivalent Python script for reproducibility.
📊
Live Training Dashboard
Real-time loss curves, VRAM usage, tokens/sec, GPU temperature — all in the browser without needing Weights & Biases.
🗃️
Dataset Manager
Upload JSONL files, preview formatting, validate conversation structure, and split train/eval without code.
📦
One-Click GGUF Export
Select quantization format (Q4_K_M, Q5_K_M, etc.) and export your merged model to GGUF directly from the Studio UI.
bash Launch Unsloth Studio
# Install Unsloth Studio (included with unsloth 2025.x)
pip install "unsloth[studio]"

# Launch the Studio server on localhost:7860
unsloth-studio

# Or specify a port and host
unsloth-studio --port 7860 --host 0.0.0.0

# Then open http://localhost:7860 in your browser
💡 Studio vs Script — When to use which

Use Studio for exploratory fine-tuning, quick experiments, and GGUF export. Use Python scripts for production pipelines, curriculum training, KD, DVC tracking, and W&B integration. Studio's "Export Script" button gives you the Python equivalent of any Studio config.

Studio Workflow

1
Select Model
Choose from the built-in model catalogue (Llama 3.2, Qwen 2.5, Mistral, Phi-4, Gemma 3, etc.) or paste any HuggingFace model ID. Studio auto-selects the correct chat template.
2
Upload Dataset
Drag-drop your JSONL file or paste a HuggingFace dataset path. Studio validates the format and shows a live preview of formatted training examples.
3
Configure LoRA & Training
Set rank, alpha, target modules via sliders. Set learning rate, epochs, batch size. Studio estimates VRAM usage dynamically as you adjust settings — great for 12 GB budgets.
4
Start Training
Click Train. Watch loss curves, VRAM, and speed in real-time. Training is interruptible — checkpoints are saved per epoch so you can resume.
5
Chat & Evaluate
After training, test your model directly in Studio's built-in chat interface. Compare responses side-by-side with the base model.
6
Export to GGUF
Select quantization type, click Export. Studio merges LoRA into base, converts to GGUF, and saves locally. Ready for Ollama or llama-server.

Export to GGUF

Unsloth has a native GGUF export pipeline — no manual llama.cpp compilation required. It merges your LoRA adapter into the base model, converts to GGUF, and applies your chosen quantization, all in one call.

python export_gguf.py
# After training is complete, export to GGUF

# Option 1: Save as GGUF directly (multiple quant types at once)
model.save_pretrained_gguf(
    "arthavidya-qwen25-7b",     # Output folder name
    tokenizer,
    quantization_method=[
        "q4_k_m",               # Best balance — use for Ollama
        "q5_k_m",               # Higher quality
        "q8_0",                 # Near-lossless — for eval
    ]
)
# Produces: arthavidya-qwen25-7b-Q4_K_M.gguf, etc.

# Option 2: Push merged model to HuggingFace Hub
model.push_to_hub_gguf(
    "vivek-doshi/arthavidya-qwen25-7b-gguf",
    tokenizer,
    quantization_method="q4_k_m",
    token="hf_...",
)

# Option 3: Save merged safetensors (for HF Hub without GGUF)
model.save_pretrained_merged(
    "arthavidya-merged",
    tokenizer,
    save_method="merged_16bit",  # or "merged_4bit", "lora"
)
📁 What Gets Saved

Each GGUF export creates: model.gguf (weights), tokenizer.gguf (vocab), and a Modelfile template for Ollama. The Modelfile includes your system prompt and chat template automatically.

Local Inference

Option A: llama-server (Direct, No Overhead)

llama-server is the HTTP API mode of llama.cpp. It provides an OpenAI-compatible REST endpoint — the same API format as the OpenAI SDK, making it a drop-in for your applications.

bash llama-server launch
# Download llama.cpp release binary (or build from source)
wget https://github.com/ggerganov/llama.cpp/releases/latest/download/llama-b3xxx-bin-ubuntu-x64.zip

# Launch server with CUDA offload
./llama-server \
  -m ./arthavidya-qwen25-7b-Q4_K_M.gguf \
  -ngl 40 \               # GPU layers — set to 99 to fully GPU-offload
  -c 4096 \               # Context window
  --host 0.0.0.0 \
  --port 8080 \
  -np 4 \                 # Parallel slots (concurrent requests)
  --flash-attn            # Enable Flash Attention — faster, less VRAM

# Server now running at http://localhost:8080
# OpenAI-compatible endpoint: http://localhost:8080/v1/chat/completions
python query_llama_server.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

response = client.chat.completions.create(
    model="arthavidya",
    messages=[
        {"role": "system",  "content": "You are Arthavidya, an Indian finance expert."},
        {"role": "user",    "content": "Explain ELSS tax saving funds."},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

GPU Layer Recommendations for 12–16 GB

Model + QuantGPU-ngl (GPU layers)RAM fallback
7B Q4_K_M12 GB-ngl 99None (fully GPU)
7B Q8_012 GB-ngl 32~2 GB RAM
13B Q4_K_M12 GB-ngl 30~3 GB RAM
13B Q4_K_M16 GB-ngl 99None
4B Q6_K12 GB-ngl 99None (much room)

Ollama Integration

Ollama is the easiest way to manage and run GGUF models locally. It handles model storage, automatic GPU offloading, and exposes the same OpenAI-compatible API. Ideal for running your Arthavidya model alongside other models with hot-swap.

bash Modelfile — Custom Ollama Model
# Create a Modelfile for your fine-tuned model
FROM ./arthavidya-qwen25-7b-Q4_K_M.gguf

# Set system prompt
SYSTEM """
You are Arthavidya, a knowledgeable Indian financial assistant.
You provide accurate, clear guidance on Indian taxation, mutual funds,
insurance, and personal finance. Always cite relevant sections
(e.g. Section 80C, SEBI regulations) where applicable.
"""

# Parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER num_gpu 99        # Max GPU layers
PARAMETER num_thread 8
bash ollama_commands.sh
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Create your custom model from Modelfile
ollama create arthavidya -f ./Modelfile

# Run interactively
ollama run arthavidya

# List all local models
ollama list

# Pull a base model (for comparison)
ollama pull qwen2.5:7b-instruct-q4_K_M

# Query via REST API (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "arthavidya",
    "messages": [{"role":"user","content":"What is a SIP?"}]
  }'

llama-server vs Ollama — When to Use Which

🔧 llama-server
Use when: maximum control over context, batch size, threading; running in production pipeline; need consistent version pinning; want to manage manually (as in Arthavidya pipeline).
🐳 Ollama
Use when: managing multiple models, hot-swapping between base and fine-tuned, easy integration with LangChain/LlamaIndex, or sharing with teammates via simple API.

Troubleshooting

Common Errors & Fixes

CUDA Out of Memory (OOM)

fix
# Try these in order until OOM resolves:
1. Reduce max_seq_length:    409620481024
2. Reduce batch size:        per_device_train_batch_size=1
3. Increase grad accum:     gradient_accumulation_steps=16
4. Use 8-bit Adam:           optim="adamw_8bit"
5. Enable grad checkpointing: use_gradient_checkpointing="unsloth"
6. Reduce LoRA rank:         r=32 → r=16
7. Disable packing temporarily and check GPU util with nvidia-smi

Triton / CUDA Kernel Errors on RTX 50 Series

fix
# Blackwell (sm_120) needs PyTorch nightly and Triton from source
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
pip install triton==3.2.0  # or latest nightly

# Also ensure your NVIDIA driver is 560.76+
nvidia-smi  # Check driver version

Slow Training — Tokens/sec Too Low

diagnose
# Check GPU utilization during training
watch -n 1 nvidia-smi

# If GPU util < 80%, your DataLoader is bottlenecking:
# - Increase dataset_num_proc (e.g., 4 → 8)
# - Enable packing=True to densify batches
# - Move dataset to Linux FS (not /mnt/c/ in WSL2)
# - Pre-tokenize and cache dataset before training

NaN / Inf Loss

fix
# NaN loss usually means LR too high or bad data
1. Lower LR: 2e-4 → 5e-5
2. Add max_grad_norm=1.0 to clip gradients
3. Inspect dataset for empty or malformed examples
4. Try bf16=False, fp16=True (or vice versa)
5. Increase warmup_steps to 100+

Unsloth vs PEFT Version Mismatch

bash
# Always pin compatible versions together
pip install "peft==0.11.1" "trl==0.9.4" "transformers==4.44.0"
pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"

# Check what's installed
pip show unsloth peft trl transformers | grep -E "Name|Version"

Quick Cheat Sheet

Model Names (Unsloth Hub)

unsloth/Qwen2.5-7B-Instruct-bnb-4bit unsloth/Qwen2.5-14B-Instruct-bnb-4bit unsloth/Llama-3.2-3B-Instruct-bnb-4bit unsloth/Llama-3.1-8B-Instruct-bnb-4bit unsloth/mistral-7b-instruct-v0.3-bnb-4bit unsloth/Phi-4-bnb-4bit unsloth/gemma-3-12b-it-bnb-4bit

Key CLI Commands

bash
# Monitor VRAM during training
watch -n 1 nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv

# Count tokens in dataset
python -c "from datasets import load_dataset; d=load_dataset('json',data_files='train.jsonl',split='train'); print(sum(len(x.split()) for x in d['text']))"

# Test GGUF model with llama.cpp CLI
./llama-cli -m model.gguf -p "What is ELSS?" --temp 0.7 -n 256

# Check Ollama GPU usage
ollama ps

# Export LoRA adapter to merged safetensors
python -c "
from unsloth import FastLanguageModel
m,t = FastLanguageModel.from_pretrained('./lora-adapter')
m.save_pretrained_merged('./merged', t, save_method='merged_16bit')
"

Hyperparameter Quick-Reference

GoalKey Levers
Reduce VRAM↓ seq_len, ↓ batch_size, ↑ grad_accum, ↓ rank, enable grad_checkpoint
Faster training↑ batch_size, packing=True, ↑ dataset_num_proc, bf16=True
Better quality↑ rank, use_rslora, ↑ epochs, better data curation
Prevent overfitting↑ dropout, ↓ epochs, weight_decay=0.01, eval set monitoring
Stable training↓ LR, warmup_steps=100+, max_grad_norm=1.0, cosine schedule

Useful Links

github.com/unslothai/unsloth huggingface.co/unsloth github.com/ggerganov/llama.cpp ollama.com docs.unsloth.ai wandb.ai