LLAMA.CPP
HANDBOOK
Everything you need to configure, tune, and run GGUF models locally — from GPU/CPU hybrid offloading to sampling parameters, server mode, Jinja templates, and KV cache optimization. Built for RTX 40/50 series and 12–16 GB VRAM setups.
What is llama.cpp?
llama.cpp is a high-performance LLM inference engine written in pure C/C++ by Georgi Gerganov. It runs GGUF-format models with CPU-only, GPU-only, or hybrid GPU+CPU execution — no Python, no CUDA runtime dependency for CPU mode, no HuggingFace stack required.
Its defining capability is GPU layer offloading: you load exactly as many transformer layers as fit in your GPU VRAM, with the remainder computed on CPU RAM. This lets you run models larger than your VRAM by trading speed for capacity.
Ollama and LM Studio are frontends that wrap llama.cpp. If you use Ollama, llama.cpp's parameters map directly — Ollama passes them through to the underlying llama.cpp process. Knowing llama.cpp parameters means you understand what every LLM serving tool is doing under the hood.
Build & Install
# Clone latest git clone https://github.com/ggerganov/llama.cpp cd llama.cpp # Build with CUDA — LLAMA_CUDA=ON enables GPU support cmake -B build \ -DGGML_CUDA=ON \ -DCMAKE_CUDA_ARCHITECTURES="89;120" \ # 89=Ada(40xx) 120=Blackwell(50xx) -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j$(nproc) # Binaries land in ./build/bin/ ls build/bin/ # → llama-server llama-cli llama-bench llama-quantize
# Download latest pre-built CUDA binary from GitHub Releases RELEASE=$(curl -s https://api.github.com/repos/ggerganov/llama.cpp/releases/latest | grep tag_name | cut -d'"' -f4) # Linux CUDA build wget "https://github.com/ggerganov/llama.cpp/releases/download/${RELEASE}/llama-${RELEASE}-bin-ubuntu-x64.zip" unzip llama-*.zip -d llama-cpp export PATH="$PWD/llama-cpp:$PATH" # Verify CUDA detected ./llama-cpp/llama-server --version # Should show: CUDA available, GPU count: 1
For RTX 5060 Ti / 5070, set -DCMAKE_CUDA_ARCHITECTURES="120". Pre-built binaries as of early 2025 may not include sm_120 — build from source for best performance on Blackwell. Also ensure CUDA 12.8+ and driver 560.76+.
Key Binaries
| Binary | Purpose | When to Use |
|---|---|---|
llama-server | HTTP API server (OpenAI-compatible) | App integration, Claude Code, LangChain, anything that needs an API |
llama-cli | Interactive CLI / single-shot inference | Quick testing, scripting, benchmarking prompts |
llama-bench | Benchmark prompt & generation throughput | Finding optimal -ngl, -b, -t values for your GPU |
llama-quantize | Convert/re-quantize GGUF models | Changing quant level (F16→Q4_K_M) locally |
llama-perplexity | Evaluate model quality (PPL score) | Comparing quant levels for quality loss measurement |
Model Loading
-m ./models/qwen2.5-7b-instruct-Q4_K_M.gguf
-m "https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf"
/v1/models endpoint and used in chat completions. Clients like OpenAI SDK must specify this name in the model field of their request.-a "qwen2.5-7b" # Client calls: {"model": "qwen2.5-7b", ...}
GPU Layers -ngl
-ngl 99 or -ngl 9999 offloads all layers (safe — clamps to model's actual layer count).-ngl 99 # Fully GPU (all layers fit in VRAM)
-ngl 20 # 20 layers GPU, rest CPU (hybrid for VRAM-tight setups)
-ngl 0 # Pure CPU inference
Layer Count by Model
| Model | Layers | 12 GB -ngl | 16 GB -ngl | Notes |
|---|---|---|---|---|
| Qwen2.5-7B Q4_K_M | 28 | -ngl 28 ✅ | -ngl 28 ✅ | Full GPU, ~4.1 GB VRAM |
| Qwen2.5-7B Q8_0 | 28 | -ngl 20 | -ngl 28 ✅ | Q8 = ~7.7 GB, tight on 12 GB |
| Llama-3.1-8B Q4_K_M | 32 | -ngl 32 ✅ | -ngl 32 ✅ | ~4.6 GB VRAM |
| Qwen2.5-14B Q4_K_M | 48 | -ngl 24 | -ngl 35 | ~8 GB for 24 layers |
| Mistral-7B Q5_K_M | 32 | -ngl 32 ✅ | -ngl 32 ✅ | ~5.1 GB VRAM |
| Phi-4 14B Q4_K_M | 40 | -ngl 20 | -ngl 30 | ~9.0 GB for 30 layers |
| DeepSeek-R1 7B Q4_K_M | 28 | -ngl 28 ✅ | -ngl 28 ✅ | Same as Qwen-7B class |
Context Size -c
-c 2048 # Conservative — saves VRAM, good for chat
-c 8192 # Balanced — suits most tasks
-c 32768 # Full long-context — Qwen2.5 supports up to 128K
Context vs VRAM at Different Sizes (7B Q4_K_M)
Threads -t / -tb
-ngl 99), this has minimal effect.-t 12. Using too many threads causes context-switching overhead that slows inference.
-t 12 # 12 physical cores for CPU decode layers
-t since prefill is more parallelizable. Separate from -t to let you tune each phase independently.-t 8 -tb 16 # 8 threads decode, 16 threads prefill
Batch Size -b / -ub
-b 512 to free VRAM during prompt processing. The speed difference for short prompts is negligible.
-b 512 # Reduced for VRAM-tight setups
-b 4096 # Aggressive — maximize prompt throughput if VRAM allows
-b. Smaller = less VRAM peak usage. Larger = more GPU parallelism. Tune this if you're getting OOM during long-prompt processing.-b 2048 -ub 256 # Large logical batch, small physical batch (OOM prevention)
Temperature --temp
--temp 0.1 # Code generation / factual Q&A
--temp 0.7 # General chat (sweet spot)
--temp 1.2 # Creative writing / brainstorming
Top-P --top-p (Nucleus Sampling)
p. At 0.95, only the top-probability tokens summing to 95% of the distribution are candidates. Eliminates low-probability "tail" tokens that cause incoherence.--top-p 0.9 # Standard — good quality/diversity balance
--top-p 1.0 # Disabled — full distribution (rely on temp alone)
--top-p 0.5 # Very conservative — only high-confidence tokens
Top-K --top-k
K highest-probability tokens and zeros out all others before sampling. Simpler than Top-P. Setting --top-k 0 disables it (no hard cutoff). Applied before Top-P in the sampling chain.--top-k 0 # Disabled — recommended when using top-p
--top-k 40 # Classic default
--top-k 1 # Greedy (same as --temp 0)
Min-P --min-p
min_p × (probability of top token). Scales dynamically with confidence: when the model is very confident, fewer tokens pass; when uncertain, more pass. Often superior to Top-P for modern GGUF models.--top-p 1.0 --min-p 0.05 --temp 0.8 # Modern recommended combo
Repeat Penalty --repeat-penalty
--presence-penalty P — OpenAI-style: flat penalty per token present in context
--frequency-penalty P — OpenAI-style: penalty proportional to how often token appeared
--repeat-penalty 1.1 --repeat-last-n 128 # Gentle anti-repetition
--repeat-penalty 1.0 # Disabled (default)
--presence-penalty 0.2 --frequency-penalty 0.1 # OpenAI-style
Max Tokens -n
-1 means generate until the model produces an EOS token or the context is full. In server mode, clients can override this per-request via max_tokens in the API payload.-n 512 # Cap at 512 output tokens
-n -1 # Let model decide when to stop
-n -2 # Generate until context full (useful for completion tasks)
Flash Attention -fa
• Enables larger contexts without OOM
• Often 20–40% faster on GPU (RTX 40/50)
• Required for long context (32K+) on 12 GB
• Incompatible with some older GGUF quant formats
• Must pair with compatible --cache-type-k/v settings
• Best on Ampere+ (RTX 30xx and newer)
-fa # Enable Flash Attention
-fa -c 32768 # Flash Attention + long context
-fa --cache-type-k q8_0 # Flash Attn + quantized KV cache
Cache Type K/V --cache-type-k / --cache-type-v
q8_0, q4_0, q5_0) requires -fa (Flash Attention) to be enabled. Without -fa, only f16 and f32 work.
Available Cache Types
| Type | Bits | VRAM (vs f16) | Quality | Requires -fa? |
|---|---|---|---|---|
f32 | 32-bit | 2× (more) | Perfect | No |
f16 | 16-bit | 1× (baseline) | Excellent | No |
q8_0 | 8-bit | ~0.5× | Near-lossless | Yes |
q5_0 | 5-bit | ~0.35× | Very good | Yes |
q4_0 | 4-bit | ~0.25× | Good | Yes |
-fa --cache-type-k q8_0 --cache-type-v q8_0 # Best quality + VRAM savings
-fa --cache-type-k q4_0 --cache-type-v q4_0 # Maximum VRAM savings (long ctx)
--cache-type-k f16 --cache-type-v f16 # Default — no Flash Attn needed
For a 7B Q4_K_M model with 32K context on 12 GB: use -fa --cache-type-k q8_0 --cache-type-v q8_0. This keeps KV cache at ~2 GB instead of ~4 GB at f16, leaving room for weights + activations. For 13B models at 8K context: -fa --cache-type-k q4_0 --cache-type-v q4_0.
Tensor Override -ot
PATTERN=DEVICE where device is CPU, GPU, or GPU0, GPU1 etc. This is the most surgical tool for fitting models into tight VRAM budgets.-ngl is a blunt instrument — it offloads N complete layers to GPU. -ot is surgical — you can keep specific tensor types on CPU even within layers that -ngl assigned to GPU. Very useful for keeping embedding tables on CPU RAM (they're large and rarely the bottleneck).
Common Tensor Name Patterns
| Pattern | Matches | VRAM Impact |
|---|---|---|
blk\.\d+\.attn | All attention weights | ~40% of layer VRAM |
blk\.\d+\.ffn | All FFN/MLP weights | ~60% of layer VRAM |
token_embd | Embedding table | Large (vocab × dim) |
output | LM head / output weights | Same as embeddings |
blk\.[2-9][0-9] | Layers 20–99 | Selective layer control |
-ot "token_embd=CPU" # Keep embedding on CPU RAM (saves ~1-2GB VRAM)
-ot "output=CPU" # Keep LM head on CPU
-ot "blk\.3[2-9]\.=CPU" # Layers 32-39 to CPU, rest GPU
-ot "token_embd=CPU" -ot "output=CPU" -ngl 99 # Hybrid: all compute layers GPU, I/O on CPU
For Mixture-of-Experts models (DeepSeek, Mixtral), the FFN expert tensors are huge but only a few activate per token. Use -ot "blk\.\d+\.ffn_gate_exps=CPU" to keep sparse expert weights on CPU RAM and only pull them when needed — massive VRAM savings with modest speed cost.
Memory Map --mmap / --mlock / --no-mmap
--no-mmap # Load fully into RAM — best for repeated inference, no page faults
--mmap --mlock # Map + lock in RAM — fast and no swap risk
KV Cache Defrag --defrag-thold
0.1 means "defrag when 10%+ of cache is fragmented".--defrag-thold 0.1 # Defrag when 10% fragmentation detected
Host & Port
127.0.0.1 means localhost-only (safe). Use 0.0.0.0 to expose to all network interfaces — use with caution and add --api-key if exposed externally.--host 127.0.0.1 --port 8080 # Localhost only (default)
--host 0.0.0.0 --port 8080 # All interfaces — accessible on LAN
Jinja Templates --jinja
chat_template from the GGUF metadata to format messages. This is critical for correct instruction-following behavior.--chat-template — Override template manually
| Template Preset | Use For |
|---|---|
qwen2 | Qwen 2.x / 2.5 models |
llama3 | Llama 3.x, Meta models |
mistral | Mistral / Mixtral |
chatml | ChatML format (many fine-tunes) |
gemma | Google Gemma |
phi3 | Microsoft Phi-3/4 |
deepseek2 | DeepSeek V2/R1 |
--jinja # Use template from GGUF metadata
--chat-template qwen2 # Force Qwen2 template
--chat-template-file my_template.j2 # Load custom Jinja2 file
Parallel Slots -np
total_kv_cache = ctx_size × n_parallel.-np 1 # Single user — best latency
-np 4 # 4 concurrent users — 4× KV VRAM cost
API Key --api-key
Authorization header. Must match exactly. When set, unauthenticated requests return HTTP 401. Use when exposing llama-server beyond localhost.--api-key "my-secret-key-here"
--api-key-file /run/secrets/llama_api_key # Load from file (safer)
Continuous Batching --cont-batching
--cont-batching # Default — good for multi-user serving
--no-cont-batching # Disable — strict sequential processing
Hybrid Overview
llama.cpp's killer feature is partial GPU offloading: you split the model's transformer layers between GPU VRAM and CPU RAM. Layers on GPU run fast (CUDA/CUBLAS). Layers on CPU run slower but allow running models far larger than your VRAM.
The model has N total layers. You set -ngl K. Layers 0 to K-1 → GPU VRAM. Layers K to N-1 → CPU RAM + CPU compute. The embedding table and LM head placement is controlled separately by -ot. Data passes from GPU→CPU→GPU between the split, which creates some overhead on the PCIe bus.
Speed Impact of Hybrid vs Full GPU
| Config | Tokens/sec (7B Q4_K_M, RTX 5070) | VRAM Used |
|---|---|---|
| Full GPU (-ngl 99) | ~80–110 tok/s | ~4.5 GB |
| Hybrid 20/28 layers GPU | ~35–55 tok/s | ~3.0 GB |
| Hybrid 10/28 layers GPU | ~20–30 tok/s | ~1.5 GB |
| Full CPU (-ngl 0) | ~5–12 tok/s | 0 GB (RAM only) |
VRAM Budgeting
Use this formula to estimate -ngl for your VRAM budget:
# VRAM per layer ≈ model_file_size / total_layers # Total VRAM ≈ (layers_on_gpu × vram_per_layer) + kv_cache + overhead # Example: Qwen2.5-14B Q4_K_M (8.9 GB file, 48 layers) vram_per_layer = 8900 MB / 48 = ~185 MB / layer kv_cache_f16 = 2 × 48 × 40 × 128 × 8192 × 2 bytes ≈ ~3200 MB (8K ctx) overhead = ~500 MB (CUDA context, activations) # For 12 GB VRAM: available = 12000 - 3200 - 500 = 8300 MB for weights max_layers = 8300 / 185 = ~44 layers → use -ngl 44 # With quantized KV cache (q8_0): kv_cache_q8 = ~1600 MB → available = 9900 MB → -ngl 53 (close to all 48!)
Pre-calculated -ngl Values for Common Setups
| Model + Quant | GPU VRAM | ctx=2048 | ctx=8192 | ctx=32768 |
|---|---|---|---|---|
| 7B Q4_K_M | 12 GB | -ngl 99 ✅ | -ngl 99 ✅ | -ngl 99 (-fa -ctk q8_0) |
| 7B Q8_0 | 12 GB | -ngl 99 ✅ | -ngl 26 | -ngl 14 |
| 13B Q4_K_M | 12 GB | -ngl 40 ✅ | -ngl 32 | -ngl 20 |
| 14B Q4_K_M | 16 GB | -ngl 48 ✅ | -ngl 44 | -ngl 36 |
| 4B Q8_0 | 12 GB | -ngl 99 ✅ | -ngl 99 ✅ | -ngl 99 ✅ |
| 70B Q4_K_M | 12 GB | -ngl 14 | -ngl 10 | -ngl 6 |
Offload Recipes
./llama-server \ -m ./Qwen2.5-14B-Instruct-Q4_K_M.gguf \ -ngl 30 \ # 30/48 layers on GPU -c 4096 \ # Smaller context to save KV cache VRAM -b 512 \ # Smaller batch to reduce peak VRAM -fa \ # Flash attention — saves attention VRAM --cache-type-k q8_0 \ # Quantize K cache — halves KV VRAM --cache-type-v q8_0 \ -ot "token_embd=CPU" \ # Embedding to CPU (saves ~1.2 GB) -ot "output=CPU" \ # LM head to CPU -t 8 \ # 8 threads for CPU layers --host 127.0.0.1 --port 8080 \ --jinja -a "qwen14b"
./llama-server \ -m ./DeepSeek-R1-70B-Q4_K_M.gguf \ -ngl 10 \ # Only 10 layers on GPU (70B has 80 layers) -c 2048 \ # Keep context small -t 16 \ # 16 CPU threads for the 70 CPU layers -tb 16 \ # 16 threads for prefill -b 512 \ --no-mmap \ # Load fully into RAM — avoids page faults -fa \ --cache-type-k q4_0 \ # Maximum KV compression --cache-type-v q4_0 \ --host 127.0.0.1 --port 8080 \ --jinja -a "deepseek-70b" # Expect ~3-6 tok/s — the PCIe + CPU compute limits throughput
./llama-server \ -m ./Qwen2.5-7B-Instruct-Q4_K_M.gguf \ -ngl 99 \ # All layers to GPU -c 8192 \ -fa \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ -b 2048 \ --cont-batching \ -np 2 \ # 2 parallel slots for multiple users --host 127.0.0.1 --port 8080 \ --jinja \ -a "qwen7b" \ --defrag-thold 0.1 # Auto-defrag KV cache
CLI Examples
# Interactive chat (GPU, Qwen2.5) ./llama-cli \ -m ./Qwen2.5-7B-Instruct-Q4_K_M.gguf \ -ngl 99 \ -c 4096 \ --temp 0.7 \ --top-p 1.0 \ --min-p 0.05 \ --repeat-penalty 1.05 \ --jinja \ -i -if # -i = interactive, -if = interactive first # One-shot generation (pipe to output) ./llama-cli \ -m ./model.gguf \ -ngl 99 \ -p "Explain ELSS mutual funds in 3 bullet points" \ -n 256 \ --temp 0.3 \ --no-display-prompt \ -s 42 # -s = seed (reproducible output) # Benchmark to find optimal -ngl ./llama-bench \ -m ./Qwen2.5-7B-Instruct-Q4_K_M.gguf \ -ngl 0,10,20,28 \ # Test multiple values -p 512 -n 128 \ -r 3 # 3 repetitions per config
Server Launch
#!/bin/bash # launch_server.sh — parameterized llama-server launch MODEL="${MODEL:-./models/Qwen2.5-7B-Instruct-Q4_K_M.gguf}" PORT="${PORT:-8080}" CTX="${CTX:-8192}" NGL="${NGL:-99}" THREADS="${THREADS:-12}" ALIAS="${ALIAS:-local-model}" ./llama-server \ -m "$MODEL" \ -a "$ALIAS" \ -ngl $NGL \ -c $CTX \ -t $THREADS \ -tb $THREADS \ -b 2048 \ -fa \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --cont-batching \ --defrag-thold 0.1 \ --jinja \ --host 127.0.0.1 \ --port $PORT \ --log-disable \ # Suppress verbose logs 2>&1 | tee llama-server.log # Usage: # ./launch_server.sh # MODEL=./14B.gguf NGL=30 CTX=4096 ./launch_server.sh
# Health check curl http://localhost:8080/health # List models curl http://localhost:8080/v1/models | python3 -m json.tool # Test completion curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "local-model", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Section 80C?"} ], "temperature": 0.7, "max_tokens": 256 }' | python3 -m json.tool
Python Client
from openai import OpenAI # llama-server is OpenAI API-compatible client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed", # Required by SDK but ignored if no --api-key set ) # ─── Non-streaming completion ───────── response = client.chat.completions.create( model="local-model", messages=[ {"role": "system", "content": "You are Arthavidya, an Indian finance expert."}, {"role": "user", "content": "Explain ELSS tax saving mutual funds."}, ], temperature=0.7, max_tokens=512, top_p=0.9, frequency_penalty=0.1, presence_penalty=0.05, ) print(response.choices[0].message.content) # ─── Streaming response ─────────────── stream = client.chat.completions.create( model="local-model", messages=[{"role": "user", "content": "Write a haiku about investing."}], stream=True, temperature=1.0, max_tokens=100, ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) print()
import requests, json def chat(messages: list, temperature: float = 0.7, max_tokens: int = 512) -> str: response = requests.post( "http://localhost:8080/v1/chat/completions", json={ "model": "local-model", "messages": messages, "temperature": temperature, "max_tokens": max_tokens, "top_p": 0.9, "repeat_penalty": 1.05, # llama.cpp native field "min_p": 0.05, # llama.cpp extension }, timeout=120, ) response.raise_for_status() return response.json()["choices"][0]["message"]["content"] # ─── Example usage ──────────────────── answer = chat([ {"role": "system", "content": "You are a helpful finance assistant."}, {"role": "user", "content": "What is Section 80D deduction?"}, ]) print(answer) # ─── Check server info ──────────────── info = requests.get("http://localhost:8080/v1/models").json() print(f"Loaded model: {info['data'][0]['id']}")
from langchain_openai import ChatOpenAI from langchain_core.messages import SystemMessage, HumanMessage # Point LangChain at local llama-server llm = ChatOpenAI( model="local-model", base_url="http://localhost:8080/v1", api_key="not-needed", temperature=0.7, max_tokens=1024, ) messages = [ SystemMessage(content="You are an expert in Indian taxation."), HumanMessage(content="What is the difference between ELSS and PPF?"), ] response = llm.invoke(messages) print(response.content)
Config Files
Instead of long command-line flags, llama-server accepts a JSON config file via --config. Easier to version-control and manage across environments.
{
"model": "./models/Qwen2.5-7B-Instruct-Q4_K_M.gguf",
"alias": "qwen7b",
"n_gpu_layers": 99,
"ctx_size": 8192,
"batch_size": 2048,
"ubatch_size": 512,
"threads": 12,
"threads_batch": 12,
"flash_attn": true,
"cache_type_k": "q8_0",
"cache_type_v": "q8_0",
"cont_batching": true,
"defrag_thold": 0.1,
"jinja": true,
"host": "127.0.0.1",
"port": 8080,
"n_parallel": 1,
"temperature": 0.7,
"top_p": 1.0,
"min_p": 0.05,
"repeat_penalty": 1.05
}
{
"model": "./models/Qwen2.5-14B-Instruct-Q4_K_M.gguf",
"alias": "qwen14b",
"n_gpu_layers": 32,
"ctx_size": 4096,
"batch_size": 512,
"ubatch_size": 256,
"threads": 12,
"threads_batch": 16,
"flash_attn": true,
"cache_type_k": "q4_0",
"cache_type_v": "q4_0",
"tensor_split": "",
"override_tensor": "token_embd=CPU,output=CPU",
"no_mmap": false,
"jinja": true,
"host": "127.0.0.1",
"port": 8080
}
./llama-server --config config_7b_12gb.json
# Override specific values from config via CLI flags
./llama-server --config config_7b_12gb.json --port 8081 -c 16384
Cheat Sheet
All Major Parameters — Quick Reference
| Flag | Long Form | Type | Default | One-Line Summary |
|---|---|---|---|---|
-m | --model | str | — | Path to GGUF file ← required |
-a | --alias | str | — | Model name returned by API |
-ngl | --n-gpu-layers | int | 0 | Layers offloaded to GPU VRAM |
-c | --ctx-size | int | model | Context window in tokens |
-t | --threads | int | cores | CPU threads for decode |
-tb | --threads-batch | int | = -t | CPU threads for prefill |
-b | --batch-size | int | 2048 | Logical batch tokens (prefill) |
-ub | --ubatch-size | int | 512 | Physical micro-batch (CUDA) |
-n | --predict | int | -1 | Max output tokens (-1=unlimited) |
--temp | --temperature | float | 0.8 | Sampling randomness (0=greedy) |
--top-p | — | float | 0.95 | Nucleus sampling cutoff |
--top-k | — | int | 40 | Top-K token limit (0=off) |
--min-p | — | float | 0.05 | Min probability relative to top |
--repeat-penalty | — | float | 1.0 | Penalty for repeating tokens |
--repeat-last-n | — | int | 64 | Context window for repeat check |
-fa | --flash-attn | bool | off | Enable Flash Attention v2 |
--cache-type-k | — | enum | f16 | K-cache dtype (f16/q8_0/q4_0) |
--cache-type-v | — | enum | f16 | V-cache dtype (f16/q8_0/q4_0) |
-ot | --override-tensor | str | — | Regex=device tensor placement |
--mmap | --no-mmap | bool | on | Memory-mapped model loading |
--mlock | — | bool | off | Lock model pages in RAM |
--defrag-thold | — | float | -1 | KV cache defrag threshold |
--host | — | str | 127.0.0.1 | Server bind address |
--port | — | int | 8080 | Server port |
--jinja | — | bool | off | Enable Jinja2 chat templates |
--chat-template | — | str | — | Override chat template preset |
-np | --parallel | int | 1 | Concurrent inference slots |
--api-key | — | str | — | Bearer token for auth |
--cont-batching | — | bool | on | Continuous batching (server) |
-s | --seed | int | -1 | RNG seed (-1=random) |
Sampling Stack — Execution Order
Raw logits from model
↓
Temperature scaling (--temp)
↓
Top-K filtering (--top-k 0 to disable)
↓
Top-P / nucleus (--top-p)
↓
Min-P filtering (--min-p)
↓
Repeat penalty (--repeat-penalty)
↓
Sample final token
Quick Decision Guide
| Goal | Settings |
|---|---|
| Code / factual answers | --temp 0.1 --top-p 0.9 --top-k 0 --repeat-penalty 1.0 |
| General chat | --temp 0.7 --top-p 1.0 --min-p 0.05 --repeat-penalty 1.05 |
| Creative writing | --temp 1.1 --top-p 0.95 --min-p 0.02 --repeat-penalty 1.1 |
| Reproduce output | --temp 0 -s 42 (greedy + fixed seed) |
| Max VRAM savings | -fa --cache-type-k q4_0 --cache-type-v q4_0 -ot "token_embd=CPU,output=CPU" |
| Max speed (full GPU) | -ngl 99 -fa -b 4096 -ub 1024 --cont-batching |