llama.cpp Handbook

Getting Started

What is llama.cpp?

llama.cpp is a high-performance LLM inference engine written in pure C/C++, originally created by Georgi Gerganov and now maintained by the ggml-org community on GitHub. It runs GGUF-format models with CPU-only, GPU-only, or hybrid GPU+CPU execution — no Python, no CUDA runtime dependency for CPU mode, no HuggingFace stack required.

Its defining capability is GPU layer offloading: you load exactly as many transformer layers as fit in your GPU VRAM, with the remainder computed on CPU RAM. This lets you run models larger than your VRAM by trading speed for capacity. As of mid-2026, the project no longer relies purely on manual layer math — the built-in --fit memory planner can do this automatically.

C++

Core Language

GGUF

Model Format

118K+

GitHub Stars

70+

Arch Supported

b9976

Latest Build (Jul 2026)

🦎 vs Ollama / LM Studio

Ollama and LM Studio are frontends that wrap llama.cpp's ggml engine. If you use Ollama, llama.cpp's parameters map directly — Ollama passes them through to the underlying llama.cpp process. Knowing llama.cpp parameters means you understand what every local LLM serving tool is doing under the hood. All three (llama.cpp, Ollama, and LM Studio) now speak the Anthropic Messages API natively, making them drop-in local backends for Claude Code.

📛 Repository Moved: ggerganov/llama.cpp → ggml-org/llama.cpp

The canonical repository now lives at github.com/ggml-org/llama.cpp. Bookmarks, clone URLs, and Docker image tags (ghcr.io/ggml-org/llama.cpp) should be updated — the old ggerganov/llama.cpp path redirects but new releases, discussions, and issues live under the new org. Georgi Gerganov remains the original author and a core maintainer; the rename reflects the project's growth into a community-governed effort alongside the GGML tensor library.

Getting Started

Build & Install

bashCUDA build (RTX 40/50 — recommended)

# Clone latest — repo now lives under the ggml-org GitHub org
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build with CUDA — GGML_CUDA=ON enables GPU support
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="89;120" \   # 89=Ada(40xx) 120=Blackwell(50xx)
  -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j$(nproc)

# Binaries land in ./build/bin/
ls build/bin/
# → llama-server  llama-cli  llama-bench  llama-quantize

bashPre-built release (fastest path)

# Download latest pre-built CUDA binary from GitHub Releases
# Releases use a "b" build-number scheme (e.g. b9976), not semver
RELEASE=$(curl -s https://api.github.com/repos/ggml-org/llama.cpp/releases/latest | grep tag_name | cut -d'"' -f4)

# Linux CUDA build
wget "https://github.com/ggml-org/llama.cpp/releases/download/${RELEASE}/llama-${RELEASE}-bin-ubuntu-x64.zip"
unzip llama-*.zip -d llama-cpp
export PATH="$PWD/llama-cpp:$PATH"

# Verify CUDA detected
./llama-cpp/llama-server --version
# Should show: CUDA available, GPU count: 1

bashDocker (no local build required)

# Official images published under ghcr.io/ggml-org
docker run --gpus all -v ~/models:/models -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -hf ggml-org/gemma-3-4b-it-GGUF --fit -c 8192

# CPU-only tag if no GPU passthrough available
docker run -v ~/models:/models -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server -hf ggml-org/Qwen3-4B-GGUF

⚡ RTX 50 Series (Blackwell)

For RTX 5060 Ti / 5070 / 5090, set -DCMAKE_CUDA_ARCHITECTURES="120". Recent pre-built binaries (b9900+) ship with sm_120 kernels, so you generally no longer need to build from source purely for Blackwell support — build from source only if you need bleeding-edge kernel fusion or a backend combo not in the release matrix. Ensure CUDA 12.8+ and driver 560.76+.

Getting Started

Key Binaries

Binary	Purpose	When to Use
`llama-server`	HTTP API server (OpenAI- and Anthropic-compatible)	App integration, Claude Code, LangChain, anything that needs an API
`llama-cli`	Interactive CLI / single-shot inference	Quick testing, scripting, benchmarking prompts
`llama-bench`	Benchmark prompt & generation throughput	Finding optimal -ngl, -b, -t values for your GPU
`llama-quantize`	Convert/re-quantize GGUF models	Changing quant level (F16→Q4_K_M) locally
`llama-perplexity`	Evaluate model quality (PPL score)	Comparing quant levels for quality loss measurement
`llama-mtmd-cli`	Multimodal (vision/audio) CLI	Testing Qwen3-VL / Gemma-3-VL style models with mmproj

Getting Started

Pull Directly from Hugging Face

You no longer need to manually download GGUF files. The -hf (alias --hf-repo) flag resolves a Hugging Face repo, downloads the right quant, caches it locally, and — for vision models — auto-downloads the matching mmproj projector file. This is now the default way most people bootstrap a model with llama.cpp in 2026.

bashPulling models by repo id

# Pull a specific quant (recommended — be explicit)
llama-server -hf unsloth/Qwen3-4B-Instruct-2507-GGUF:Q4_K_M --fit

# Omit the quant tag and llama.cpp picks a sensible default (usually Q4_K_M)
llama-server -hf ggml-org/gemma-3-4b-it-GGUF --fit

# Vision model — mmproj file is auto-resolved and downloaded alongside
llama-server -hf ggml-org/Qwen3-VL-8B-Instruct-GGUF --fit

# Files land in the standard HF cache (~/.cache/llama.cpp or $HF_HOME)
# so re-runs are instant — no re-download

💡 Combine with --fit for a zero-math launch

The single most common 2026 launch command is now llama-server -hf <repo> --fit — pull the model, let the auto-fit planner figure out GPU layers and context size for your hardware, and start serving. See Auto-Fit Memory below.

Core Parameters

Model Loading

-m --model

alias: --model-path

string default: none server + cli

Path to the GGUF model file. Required parameter — the most basic thing you must specify. Can be a local file path or a URL (llama.cpp will download it).

-m ./models/qwen2.5-7b-instruct-Q4_K_M.gguf -m "https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf"

-a --alias

string server

Sets the model name returned by the API's /v1/models endpoint and used in chat completions. Clients like OpenAI SDK must specify this name in the model field of their request.

-a "qwen2.5-7b" # Client calls: {"model": "qwen2.5-7b", ...}

Core Parameters

GPU Layers -ngl

-ngl --gpu-layers --n-gpu-layers

integer default: 0 (CPU only) server + cli

Number of transformer layers to load into GPU VRAM. Each layer occupies VRAM; layers not on GPU are computed on CPU. Setting -ngl 99 or -ngl 9999 offloads all layers (safe — clamps to model's actual layer count).

💡 Rule of Thumb

Most 7B models have 32 layers. 13B models have 40 layers. Qwen2.5-7B has 28 layers. If you set -ngl 20 on a 28-layer model, 20 go to GPU and 8 run on CPU — you get hybrid inference.

-ngl 99 # Fully GPU (all layers fit in VRAM) -ngl 20 # 20 layers GPU, rest CPU (hybrid for VRAM-tight setups) -ngl 0 # Pure CPU inference

Layer Count by Model

Model	Layers	12 GB -ngl	16 GB -ngl	Notes
Qwen2.5-7B Q4_K_M	28	-ngl 28 ✅	-ngl 28 ✅	Full GPU, ~4.1 GB VRAM
Qwen2.5-7B Q8_0	28	-ngl 20	-ngl 28 ✅	Q8 = ~7.7 GB, tight on 12 GB
Llama-3.1-8B Q4_K_M	32	-ngl 32 ✅	-ngl 32 ✅	~4.6 GB VRAM
Qwen2.5-14B Q4_K_M	48	-ngl 24	-ngl 35	~8 GB for 24 layers
Mistral-7B Q5_K_M	32	-ngl 32 ✅	-ngl 32 ✅	~5.1 GB VRAM
Phi-4 14B Q4_K_M	40	-ngl 20	-ngl 30	~9.0 GB for 30 layers
DeepSeek-R1 7B Q4_K_M	28	-ngl 28 ✅	-ngl 28 ✅	Same as Qwen-7B class

Core Parameters

Context Size -c

-c --ctx-size --context-size

integer default: model default (often 4096) server + cli

Sets the KV cache context window size in tokens. This is the total token budget for prompt + generation. Larger contexts consume more VRAM — the KV cache grows linearly with context and quadratically affects attention computation.

⚠️ VRAM Cost of Context

KV cache VRAM ≈ 2 × num_layers × num_heads × head_dim × ctx_size × dtype_bytes. For a 7B model at Q4, adding 4096 ctx costs ~0.5–1 GB. At 32768 ctx it costs ~4–6 GB extra. Use the minimum context you need.

-c 2048 # Conservative — saves VRAM, good for chat -c 8192 # Balanced — suits most tasks -c 32768 # Full long-context — Qwen2.5 supports up to 128K

Context vs VRAM at Different Sizes (7B Q4_K_M)

Model weights (~4.1 GB fixed)always loaded

34% of 12 GB — constant regardless of context

KV Cache @ ctx=2048~0.5 GB extra

38% total — safe for 12 GB

KV Cache @ ctx=8192~2 GB extra

~51% total — comfortable

KV Cache @ ctx=32768~7 GB extra

~92% total — very tight on 12 GB, reduce -ngl or use Q4 cache

Core Parameters

Threads -t / -tb

-t --threads

CPU threads for generation (token-by-token decode)

integer default: physical core count server + cli

Number of CPU threads used for the decode phase (generating tokens one by one). Only matters for layers running on CPU. If all layers are on GPU (-ngl 99), this has minimal effect.

💡 Best Practice

Set to physical cores (not hyperthreads). E.g., for a 12-core/24-thread CPU, use -t 12. Using too many threads causes context-switching overhead that slows inference.

-t 12 # 12 physical cores for CPU decode layers

-tb --threads-batch

CPU threads for prompt processing (prefill phase)

integer default: same as -t server + cli

Threads for the prefill/prompt-processing phase (processing the full input prompt in parallel). Can often be set higher than -t since prefill is more parallelizable. Separate from -t to let you tune each phase independently.

-t 8 -tb 16 # 8 threads decode, 16 threads prefill

Core Parameters

Batch Size -b / -ub

-b --batch-size

integer default: 2048 server + cli

Maximum tokens processed in a single batch during the prompt ingestion (prefill) phase. Larger batch = faster prompt processing but more VRAM. For short prompts, this barely matters.

Memory Tip

On 12 GB VRAM with a 13B model, reduce to -b 512 to free VRAM during prompt processing. The speed difference for short prompts is negligible.

-b 512 # Reduced for VRAM-tight setups -b 4096 # Aggressive — maximize prompt throughput if VRAM allows

-ub --ubatch-size

integer default: 512 server + cli

Physical micro-batch size — how many tokens are actually computed in one CUDA kernel call during prefill. Must be ≤ -b. Smaller = less VRAM peak usage. Larger = more GPU parallelism. Tune this if you're getting OOM during long-prompt processing.

-b 2048 -ub 256 # Large logical batch, small physical batch (OOM prevention)

Sampling Parameters

Temperature --temp

--temp --temperature

float default: 0.8 server + cli

Divides the logits before softmax, controlling randomness. Lower = more deterministic, higher = more creative/varied. Temperature=0 is greedy decoding (always picks highest probability token).

📊 Quick Reference

0.0 = deterministic · 0.1–0.3 = factual/code · 0.7–0.9 = balanced chat · 1.0–1.3 = creative writing · 2.0+ = chaotic

--temp 0.1 # Code generation / factual Q&A --temp 0.7 # General chat (sweet spot) --temp 1.2 # Creative writing / brainstorming

Sampling Parameters

Top-P --top-p (Nucleus Sampling)

--top-p

nucleus sampling — cumulative probability cutoff

float [0.0–1.0] default: 0.95 server + cli

Restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. At 0.95, only the top-probability tokens summing to 95% of the distribution are candidates. Eliminates low-probability "tail" tokens that cause incoherence.

💡 Interaction with Temperature

Top-P and Temperature work together. Temperature reshapes the distribution, then Top-P samples from it. For factual tasks: low temp + low top-p. For creative: high temp + high top-p or use Min-P instead (often better).

--top-p 0.9 # Standard — good quality/diversity balance --top-p 1.0 # Disabled — full distribution (rely on temp alone) --top-p 0.5 # Very conservative — only high-confidence tokens

Sampling Parameters

Top-K --top-k

--top-k

keep only top K tokens before sampling

integer default: 40 server + cli

Keeps only the top K highest-probability tokens and zeros out all others before sampling. Simpler than Top-P. Setting --top-k 0 disables it (no hard cutoff). Applied before Top-P in the sampling chain.

When to use Top-K

Top-K is a blunter instrument than Top-P — it doesn't care about probability mass, just rank. Generally, use Top-P (or Min-P) instead of Top-K for modern models. If you use both, K is applied first.

--top-k 0 # Disabled — recommended when using top-p --top-k 40 # Classic default --top-k 1 # Greedy (same as --temp 0)

Sampling Parameters

Min-P --min-p

--min-p

minimum probability relative to top token

float [0.0–1.0] default: 0.05 server + cli

Keeps tokens whose probability is at least min_p × (probability of top token). Scales dynamically with confidence: when the model is very confident, fewer tokens pass; when uncertain, more pass. Often superior to Top-P for modern GGUF models.

🌟 Recommended Modern Sampler

Many llama.cpp users replace Top-P entirely with Min-P (set --top-p 1.0, --min-p 0.05). Min-P naturally adapts to model confidence, producing more coherent high-confidence output and appropriately creative low-confidence output.

--top-p 1.0 --min-p 0.05 --temp 0.8 # Modern recommended combo

Sampling Parameters

Repeat Penalty --repeat-penalty

--repeat-penalty

also: --repeat-last-n --presence-penalty --frequency-penalty

float default: 1.0 (disabled) server + cli

Penalizes tokens that have appeared in the recent context. Values >1.0 discourage repetition; <1.0 encourages it. Applied to logits by dividing (for positive) or multiplying (for negative). Prevents the model from looping the same phrases.

Related Parameters

--repeat-last-n N — how many previous tokens to check for repeats (default 64, -1 = full context)
--presence-penalty P — OpenAI-style: flat penalty per token present in context
--frequency-penalty P — OpenAI-style: penalty proportional to how often token appeared

--repeat-penalty 1.1 --repeat-last-n 128 # Gentle anti-repetition --repeat-penalty 1.0 # Disabled (default) --presence-penalty 0.2 --frequency-penalty 0.1 # OpenAI-style

Sampling Parameters

Max Tokens -n

-n --predict --n-predict

integer default: -1 (unlimited) server + cli

Maximum number of tokens to generate. -1 means generate until the model produces an EOS token or the context is full. In server mode, clients can override this per-request via max_tokens in the API payload.

-n 512 # Cap at 512 output tokens -n -1 # Let model decide when to stop -n -2 # Generate until context full (useful for completion tasks)

Sampling Parameters

DRY / XTC / Top-N-Sigma

The default sampler chain shipped by llama.cpp evolved considerably through 2025–2026. Temperature/top-p/top-k/min-p alone struggle with two specific failure modes: repetition loops in long generations, and low-probability "creative" tokens getting starved by aggressive truncation. Three newer samplers address this and now ship enabled by default in the standard chain.

Sampler	Flag	Default	What It Does
DRY Don't Repeat Yourself	`--dry-multiplier`	0.8	Detects repeated n-gram sequences in the recent context and applies an escalating penalty, breaking loops that fixed repeat-penalty can't catch (e.g. repeating whole sentences, not just tokens).
XTC Exclude Top Choices	`--xtc-probability`	0 (off unless set)	Probabilistically removes the single most-likely token (above a threshold) to force the model off the most predictable/boring continuation — improves creative writing diversity without going fully random.
Top-N-Sigma	`--top-nsigma`	-1 (off)	Statistical filter that keeps only tokens within N standard deviations of the top logit — more robust than top-p at very high temperatures, prevents "temperature melting" incoherence.

textDefault sampler chain order (b9900+)

penalties → dry → top_n_sigma → top_k → typ_p → top_p → min_p → xtc → temperature

# Order matters: penalties and DRY run first (on the raw logit set),
# truncation samplers narrow the pool, temperature is applied last.
# Override the whole chain with --samplers "top_k;top_p;temp" if you want
# the old-style simple pipeline.

✅ Recommended starting point (2026)

--temp 0.7 --top-k 40 --top-p 0.9 --min-p 0.05 --dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 — DRY at its default multiplier is safe to leave on for almost any workload. Only enable XTC (--xtc-probability 0.15 --xtc-threshold 0.1) for creative writing/roleplay; leave it off for code and factual Q&A, where you want the single best token to actually win.

Performance

Flash Attention -fa

-fa --flash-attn [on|off|auto]

enum default: auto (was: off) server + cli

Enables Flash Attention v2 — a memory-efficient attention algorithm that avoids materializing the full N×N attention matrix. Instead it computes attention in tiles, dramatically reducing VRAM usage during attention computation. As of the 2026 releases, --flash-attn takes an explicit on|off|auto value and now defaults to auto — llama.cpp enables it automatically whenever the backend and model support it, rather than requiring you to opt in.

✅ Benefits

• Reduces VRAM by 30–60% for attention computation
• Enables larger contexts without OOM
• Often 20–40% faster on GPU (RTX 40/50)
• Required for long context (32K+) on 12 GB

⚠️ Caveats

• auto can still silently fall back to off on unsupported backend/quant combos — check llama-server --verbose startup log
• Must pair with compatible --cache-type-k/v settings
• Best on Ampere+ (RTX 30xx and newer)

-fa on # Force-enable Flash Attention -fa auto -c 32768 # Default: auto-detect + long context -fa on --cache-type-k q8_0 # Flash Attn + quantized KV cache

Performance

Cache Type K/V --cache-type-k / --cache-type-v

--cache-type-k --cache-type-v

quantize the KV cache to reduce VRAM usage

enum default: f16 server + cli

Controls the data type of the Key and Value caches in the attention mechanism. The KV cache grows linearly with context length and can consume significant VRAM. Quantizing it to Q8 or Q4 can save 50–75% of KV cache VRAM with minimal quality loss.

⚠️ Requires Flash Attention

KV cache quantization (q8_0, q4_0, q5_0) requires -fa (Flash Attention) to be enabled. Without -fa, only f16 and f32 work.

Available Cache Types

Type	Bits	VRAM (vs f16)	Quality	Requires -fa?
`f32`	32-bit	2× (more)	Perfect	No
`f16`	16-bit	1× (baseline)	Excellent	No
`q8_0`	8-bit	~0.5×	Near-lossless	Yes
`q5_0`	5-bit	~0.35×	Very good	Yes
`q4_0`	4-bit	~0.25×	Good	Yes

-fa --cache-type-k q8_0 --cache-type-v q8_0 # Best quality + VRAM savings -fa --cache-type-k q4_0 --cache-type-v q4_0 # Maximum VRAM savings (long ctx) --cache-type-k f16 --cache-type-v f16 # Default — no Flash Attn needed

🎯 12 GB VRAM — Recommended KV Cache Config

For a 7B Q4_K_M model with 32K context on 12 GB: use -fa --cache-type-k q8_0 --cache-type-v q8_0. This keeps KV cache at ~2 GB instead of ~4 GB at f16, leaving room for weights + activations. For 13B models at 8K context: -fa --cache-type-k q4_0 --cache-type-v q4_0.

Performance

Tensor Override -ot

-ot --override-tensor

fine-grained control: which tensors go to GPU vs CPU

string (regex=device) default: none server + cli

Overrides where specific tensors are placed, using a regex pattern matched against tensor names. Format: PATTERN=DEVICE where device is CPU, GPU, or GPU0, GPU1 etc. This is the most surgical tool for fitting models into tight VRAM budgets.

How it interacts with -ngl

-ngl is a blunt instrument — it offloads N complete layers to GPU. -ot is surgical — you can keep specific tensor types on CPU even within layers that -ngl assigned to GPU. Very useful for keeping embedding tables on CPU RAM (they're large and rarely the bottleneck).

Common Tensor Name Patterns

Pattern	Matches	VRAM Impact
`blk\.\d+\.attn`	All attention weights	~40% of layer VRAM
`blk\.\d+\.ffn`	All FFN/MLP weights	~60% of layer VRAM
`token_embd`	Embedding table	Large (vocab × dim)
`output`	LM head / output weights	Same as embeddings
`blk\.[2-9][0-9]`	Layers 20–99	Selective layer control

-ot "token_embd=CPU" # Keep embedding on CPU RAM (saves ~1-2GB VRAM) -ot "output=CPU" # Keep LM head on CPU -ot "blk\.3[2-9]\.=CPU" # Layers 32-39 to CPU, rest GPU -ot "token_embd=CPU" -ot "output=CPU" -ngl 99 # Hybrid: all compute layers GPU, I/O on CPU

Performance

MoE Expert Offloading

Sparse Mixture-of-Experts (MoE) models — Qwen3-Coder, GLM-4.7-Air, DeepSeek-V3-class, gpt-oss — dominated the 2025–2026 open-weight landscape. They have huge total parameter counts but only activate a small fraction per token. The FFN "expert" tensors are by far the largest part of the file, yet they're exactly the tensors you can most safely leave on CPU RAM, because each token only touches a couple of experts per layer regardless of where they live.

bashStandard MoE offload pattern (2026 idiom)

# Keep attention + shared/dense layers on GPU (always active, latency-critical)
# Push the sparse FFN expert tensors to CPU RAM (rarely all active at once)
llama-server -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL \
  -ngl 99 \
  --override-tensor "\.ffn_(up|down|gate)_exps\.=CPU" \
  -c 32768 -fa on

# Shorthand: many recipes now write this as --override-tensor 'exps=CPU'
# or the newer --cpu-moe / --n-cpu-moe N convenience flags that do the
# same regex under the hood without hand-writing it
llama-server -hf ggml-org/gpt-oss-20b-GGUF -ngl 99 --n-cpu-moe 999 --fit

Why this works so well

A dense 30B model needs all 30B parameters resident wherever compute happens. A 30B-A3B MoE model only activates ~3B parameters per token — so CPU RAM bandwidth (even DDR5 dual-channel) is enough to stream the rarely-touched expert weights, while the GPU handles attention and the always-on shared experts at full speed. This is why a 12–16 GB GPU can now serve MoE models that would be impossible to fit as a dense checkpoint.

📐 Tuning the split further

Use --n-cpu-moe N to send only the expert tensors from the last N layers to CPU rather than all of them — a middle ground when you have some spare VRAM. Combine with llama-bench to sweep values and find the throughput/VRAM sweet spot for your card. On multi-GPU rigs, pair with --tensor-split to spread the GPU-resident portion across cards.

Performance

Auto-Fit Memory --fit

--fit [on|off] --fit-margin -ngl -1

enum default: on for -hf pulls server + cli

--fit is the 2026-era answer to "how many layers do I offload?" It inspects your available VRAM (and system RAM) at startup and automatically computes GPU layer count, batch size, and context size that will fit without OOM — using -ngl -1 as a sentinel meaning "let the fit planner decide" rather than "offload everything." It's the single biggest quality-of-life change to hit llama.cpp's memory story since -ot was introduced.

✅ When to use it

• First launch of any new model/quant combo
• Shared/multi-tenant boxes where free VRAM varies
• Router mode, where multiple models compete for the same GPU
• You just want it to work without spreadsheet math

⚠️ When to go manual

• Squeezing the absolute last MB out of a fixed, known box
• Benchmarking — you want deterministic, repeatable layer counts
• Combined with hand-tuned -ot MoE expert patterns (set those explicitly; --fit won't override them)

llama-server -hf ggml-org/gemma-3-12b-it-GGUF --fit # Zero-math launch llama-server -m model.gguf -ngl -1 --fit --fit-margin 512 # Leave 512MB headroom llama-server -m model.gguf --fit off -ngl 35 # Disable, go fully manual

🧮 See it work: memory breakdown at startup

Launch with --verbose and llama.cpp prints a llama_memory_breakdown_print table showing exactly how VRAM/RAM was allocated per tensor category (weights, KV cache, compute buffer, overhead) — invaluable for understanding what --fit actually decided, and for hand-tuning from there if you need to go further.

Performance

Memory Map --mmap / --mlock / --no-mmap

--mmap --no-mmap --mlock

flags default: --mmap enabled server + cli

--mmap (default): model file is memory-mapped — pages are loaded from disk on demand. Fast startup, but first inference may be slower as pages fault in. --no-mmap: entire model loaded into RAM eagerly at startup. Slower launch, but consistent inference speed. --mlock: pins RAM pages so they can't be swapped to disk — prevents latency spikes but requires OS permissions.

--no-mmap # Load fully into RAM — best for repeated inference, no page faults --mmap --mlock # Map + lock in RAM — fast and no swap risk

Performance

KV Cache Defrag --defrag-thold

--defrag-thold

float [0.0–1.0] default: -1.0 (disabled) server

When the KV cache fragmentation ratio exceeds this threshold, llama-server automatically defragments it. Fragmentation occurs in server mode when multiple concurrent requests start and end at different times, leaving gaps in the KV cache. Setting 0.1 means "defrag when 10%+ of cache is fragmented".

💡 When to Enable

Enable in production server deployments with many concurrent users. For single-user local use, leave disabled. A threshold of 0.1–0.2 is a good starting point.

--defrag-thold 0.1 # Defrag when 10% fragmentation detected

Server Mode

Host & Port

--host --port

string / integer default: 127.0.0.1 / 8080 server only

Network interface and port the HTTP server binds to. Default 127.0.0.1 means localhost-only (safe). Use 0.0.0.0 to expose to all network interfaces — use with caution and add --api-key if exposed externally.

--host 127.0.0.1 --port 8080 # Localhost only (default) --host 0.0.0.0 --port 8080 # All interfaces — accessible on LAN

Server Mode

Jinja Templates --jinja

--jinja

also: --chat-template --chat-template-file

flag / string default: disabled server

--jinja enables Jinja2-based chat template processing — the same system HuggingFace uses for chat formatting. When enabled, llama-server applies the model's built-in chat_template from the GGUF metadata to format messages. This is critical for correct instruction-following behavior.

Why this matters

Each model family (Qwen, Llama, Mistral, Phi) uses a different special token format for system/user/assistant turns. Without the correct template, the model receives malformed input and produces poor outputs. --jinja applies the template automatically from model metadata.

--chat-template — Override template manually

Override the built-in chat template with a named preset or custom Jinja2 string. Useful when GGUF metadata has wrong/missing template.

Template Preset	Use For
`qwen2`	Qwen 2.x / 2.5 models
`llama3`	Llama 3.x, Meta models
`mistral`	Mistral / Mixtral
`chatml`	ChatML format (many fine-tunes)
`gemma`	Google Gemma
`phi3`	Microsoft Phi-3/4
`deepseek2`	DeepSeek V2/R1

--jinja # Use template from GGUF metadata --chat-template qwen2 # Force Qwen2 template --chat-template-file my_template.j2 # Load custom Jinja2 file

Server Mode

Parallel Slots -np

-np --parallel --n-parallel

integer default: 1 server

Number of concurrent inference slots — how many requests can be processed simultaneously. Each slot reserves a portion of the KV cache. VRAM cost scales linearly: total_kv_cache = ctx_size × n_parallel.

💡 For Local Single-User Use

Keep -np 1 (default) for best single-request latency. Increase only for multi-user serving. With -np 4 and ctx=4096, you need 4× the KV cache VRAM.

-np 1 # Single user — best latency -np 4 # 4 concurrent users — 4× KV VRAM cost

Server Mode

API Key --api-key

--api-key --api-key-file

string default: none (open access) server

Requires clients to pass a Bearer token in the Authorization header. Must match exactly. When set, unauthenticated requests return HTTP 401. Use when exposing llama-server beyond localhost.

--api-key "my-secret-key-here" --api-key-file /run/secrets/llama_api_key # Load from file (safer)

Server Mode

Continuous Batching --cont-batching

--cont-batching --no-cont-batching

flag default: enabled server

Enables continuous batching — new requests are inserted into the inference pipeline mid-generation without waiting for current requests to finish. Dramatically improves GPU utilization under concurrent load. Enabled by default in llama-server. Disable only for debugging or strict FIFO order requirements.

--cont-batching # Default — good for multi-user serving --no-cont-batching # Disable — strict sequential processing

Server Mode

Anthropic Messages API /v1/messages

/v1/messages /v1/messages/count_tokens

endpoint shipped ~Jan 2026 server

llama-server now speaks Anthropic's Messages API alongside the long-standing OpenAI-compatible /v1/chat/completions route. This means you can point Claude Code or any Anthropic-SDK-based client straight at a local llama.cpp instance by overriding the base URL — no proxy or translation shim required.

bashPoint Claude Code at a local model

# Start the server with an Anthropic-capable chat model
llama-server -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF --fit \
  -c 65536 --jinja --host 0.0.0.0 --port 8080

# Point any Anthropic-SDK client (incl. Claude Code) at it
export ANTHROPIC_BASE_URL="http://localhost:8080"
export ANTHROPIC_API_KEY="not-needed-unless---api-key-set"

# Token counting also works pre-flight, same as api.anthropic.com
curl http://localhost:8080/v1/messages/count_tokens \
  -H "content-type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"hi"}]}'

curl http://localhost:8080/v1/messages -d '{"model":"local","max_tokens":256,"messages":[{"role":"user","content":"Explain -ngl"}]}'

🔌 Works alongside, not instead of, OpenAI compatibility

The OpenAI-style /v1/chat/completions, /v1/completions, and /v1/embeddings routes remain fully supported — the Anthropic route is additive. Tool/function calling, reasoning-format handling (--reasoning-format), and streaming work through both API shapes on the same running server.

Server Mode

Router Mode --models-dir

--models-dir --models-max --models-preset

path / int default: single-model mode server

A single llama-server process can now host multiple models and load/unload them on demand, instead of one process per model. Point it at a directory of GGUF files (or a presets file), cap how many stay resident at once with --models-max, and the server evicts the least-recently-used model automatically when a request for a new one arrives and the cap is hit.

llama-server --models-dir ~/models --models-max 2 --fit --port 8080 curl http://localhost:8080/models # List available + currently loaded models curl localhost:8080/v1/chat/completions -d '{"model":"qwen3-coder-30b", ...}' # Triggers load-on-demand

🖥️ Why this matters for local dev boxes

Before router mode, running a coding model and an embedding model simultaneously meant two llama-server processes fighting over the same GPU with fixed memory reservations. Router mode lets one process time-share the GPU across several models, freeing VRAM from whichever model was used least recently. Pairs naturally with --sleep-idle-seconds below.

Server Mode

Idle Sleep --sleep-idle-seconds

--sleep-idle-seconds N

integer (seconds) default: disabled server

Frees GPU/CPU memory for a loaded model after it has been idle for N seconds, without killing the server process — the next request simply pays a reload cost instead of holding VRAM hostage 24/7. Health and metadata probes (/health, /props) do not reset the idle timer, so monitoring traffic won't keep a model artificially awake.

llama-server -hf ggml-org/gemma-3-12b-it-GGUF --fit --sleep-idle-seconds 600 # Sleep after 10 min idle llama-server --models-dir ~/models --models-max 3 --sleep-idle-seconds 300 # Router mode + auto-sleep

⚠️ Trade-off

The first request after a model sleeps pays the full model-load latency again (seconds, depending on size and --mmap/disk speed). Use this for dev boxes and shared/multi-user setups where VRAM is contended, not for latency-sensitive production endpoints expecting a hot model at all times.

GPU/CPU Hybrid

Hybrid Overview

llama.cpp's killer feature is partial GPU offloading: you split the model's transformer layers between GPU VRAM and CPU RAM. Layers on GPU run fast (CUDA/CUBLAS). Layers on CPU run slower but allow running models far larger than your VRAM.

⚡ How Hybrid Works

The model has N total layers. You set -ngl K. Layers 0 to K-1 → GPU VRAM. Layers K to N-1 → CPU RAM + CPU compute. The embedding table and LM head placement is controlled separately by -ot. Data passes from GPU→CPU→GPU between the split, which creates some overhead on the PCIe bus.

Speed Impact of Hybrid vs Full GPU

Config	Tokens/sec (7B Q4_K_M, RTX 5070)	VRAM Used
Full GPU (-ngl 99)	~80–110 tok/s	~4.5 GB
Hybrid 20/28 layers GPU	~35–55 tok/s	~3.0 GB
Hybrid 10/28 layers GPU	~20–30 tok/s	~1.5 GB
Full CPU (-ngl 0)	~5–12 tok/s	0 GB (RAM only)

GPU/CPU Hybrid

VRAM Budgeting

Use this formula to estimate -ngl for your VRAM budget:

formulaVRAM estimation

# VRAM per layer ≈ model_file_size / total_layers
# Total VRAM ≈ (layers_on_gpu × vram_per_layer) + kv_cache + overhead

# Example: Qwen2.5-14B Q4_K_M (8.9 GB file, 48 layers)
vram_per_layer = 8900 MB / 48 = ~185 MB / layer
kv_cache_f16   = 2 × 48 × 40 × 128 × 8192 × 2 bytes ≈ ~3200 MB (8K ctx)
overhead       = ~500 MB (CUDA context, activations)

# For 12 GB VRAM:
available = 12000 - 3200 - 500 = 8300 MB for weights
max_layers = 8300 / 185 = ~44 layers → use -ngl 44

# With quantized KV cache (q8_0):
kv_cache_q8 = ~1600 MB → available = 9900 MB → -ngl 53 (close to all 48!)

Pre-calculated -ngl Values for Common Setups

Model + Quant	GPU VRAM	ctx=2048	ctx=8192	ctx=32768
7B Q4_K_M	12 GB	-ngl 99 ✅	-ngl 99 ✅	-ngl 99 (-fa -ctk q8_0)
7B Q8_0	12 GB	-ngl 99 ✅	-ngl 26	-ngl 14
13B Q4_K_M	12 GB	-ngl 40 ✅	-ngl 32	-ngl 20
14B Q4_K_M	16 GB	-ngl 48 ✅	-ngl 44	-ngl 36
4B Q8_0	12 GB	-ngl 99 ✅	-ngl 99 ✅	-ngl 99 ✅
70B Q4_K_M	12 GB	-ngl 14	-ngl 10	-ngl 6

GPU/CPU Hybrid

Offload Recipes

Recipe 1 — VRAM Tight (12 GB, 13B model)

bash14B on 12 GB VRAM — aggressive offloading

./llama-server \
  -m ./Qwen3-14B-Instruct-Q4_K_M.gguf \
  -ngl 30 \                        # 30/48 layers on GPU
  -c 4096 \                        # Smaller context to save KV cache VRAM
  -b 512 \                         # Smaller batch to reduce peak VRAM
  -fa on \                         # Flash attention — saves attention VRAM
  --cache-type-k q8_0 \            # Quantize K cache — halves KV VRAM
  --cache-type-v q8_0 \
  -ot "token_embd=CPU" \           # Embedding to CPU (saves ~1.2 GB)
  -ot "output=CPU" \               # LM head to CPU
  -t 8 \                           # 8 threads for CPU layers
  --host 127.0.0.1 --port 8080 \
  --jinja -a "qwen14b"

Recipe 2 — Large Dense Model, Mostly CPU

bash70B Q4_K_M on 12 GB VRAM + 64 GB RAM

./llama-server \
  -m ./Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  -ngl 10 \                        # Only 10 layers on GPU (70B has 80 layers)
  -c 2048 \                        # Keep context small
  -t 16 \                          # 16 CPU threads for the 70 CPU layers
  -tb 16 \                         # 16 threads for prefill
  -b 512 \
  --no-mmap \                      # Load fully into RAM — avoids page faults
  -fa on \
  --cache-type-k q4_0 \            # Maximum KV compression
  --cache-type-v q4_0 \
  --host 127.0.0.1 --port 8080 \
  --jinja -a "llama3-70b"
# Expect ~3-6 tok/s — the PCIe + CPU compute limits throughput

Recipe 3 — Full GPU (7B, maximized)

bash7B fully on GPU — maximum speed

./llama-server \
  -m ./Qwen3-8B-Instruct-Q4_K_M.gguf \
  -ngl 99 \                        # All layers to GPU
  -c 8192 \
  -fa on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -b 2048 \
  --cont-batching \
  -np 2 \                          # 2 parallel slots for multiple users
  --host 127.0.0.1 --port 8080 \
  --jinja \
  -a "qwen8b" \
  --defrag-thold 0.1               # Auto-defrag KV cache

Recipe 4 — MoE Model + Auto-Fit (2026 idiom)

bashQwen3-Coder-30B-A3B on 16 GB VRAM — sparse experts on CPU

./llama-server \
  -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL \
  -ngl 99 \                        # Attention + shared layers on GPU
  --n-cpu-moe 999 \                # Sparse expert tensors → CPU RAM
  --fit \                          # Auto-tune context/batch to remaining VRAM
  -fa on \
  --host 127.0.0.1 --port 8080 \
  --jinja -a "qwen3-coder-30b"
# Only ~3B active params per token → near-dense-7B speed
# despite a 30B total parameter checkpoint

Sample Code

CLI Examples

bashllama-cli — single-shot inference

# Interactive chat (GPU, Qwen2.5)
./llama-cli \
  -m ./Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  -ngl 99 \
  -c 4096 \
  --temp 0.7 \
  --top-p 1.0 \
  --min-p 0.05 \
  --repeat-penalty 1.05 \
  --jinja \
  -i -if           # -i = interactive, -if = interactive first

# One-shot generation (pipe to output)
./llama-cli \
  -m ./model.gguf \
  -ngl 99 \
  -p "Explain ELSS mutual funds in 3 bullet points" \
  -n 256 \
  --temp 0.3 \
  --no-display-prompt \
  -s 42            # -s = seed (reproducible output)

# Benchmark to find optimal -ngl
./llama-bench \
  -m ./Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  -ngl 0,10,20,28 \    # Test multiple values
  -p 512 -n 128 \
  -r 3                 # 3 repetitions per config

Sample Code

Server Launch

bashllama-server — production launch script

#!/bin/bash
# launch_server.sh — parameterized llama-server launch

MODEL="${MODEL:-./models/Qwen2.5-7B-Instruct-Q4_K_M.gguf}"
PORT="${PORT:-8080}"
CTX="${CTX:-8192}"
NGL="${NGL:-99}"
THREADS="${THREADS:-12}"
ALIAS="${ALIAS:-local-model}"

./llama-server \
  -m  "$MODEL" \
  -a  "$ALIAS" \
  -ngl $NGL \
  -c  $CTX \
  -t  $THREADS \
  -tb $THREADS \
  -b  2048 \
  -fa \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cont-batching \
  --defrag-thold 0.1 \
  --jinja \
  --host 127.0.0.1 \
  --port $PORT \
  --log-disable \            # Suppress verbose logs
  2>&1 | tee llama-server.log

# Usage:
# ./launch_server.sh
# MODEL=./14B.gguf NGL=30 CTX=4096 ./launch_server.sh

bashVerify server is running

# Health check
curl http://localhost:8080/health

# List models
curl http://localhost:8080/v1/models | python3 -m json.tool

# Test completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user",   "content": "What is Section 80C?"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }' | python3 -m json.tool

Sample Code

Python Client

pythonclient_openai.py — OpenAI SDK (recommended)

from openai import OpenAI

# llama-server is OpenAI API-compatible
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",   # Required by SDK but ignored if no --api-key set
)

# ─── Non-streaming completion ─────────
response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "system",  "content": "You are Arthavidya, an Indian finance expert."},
        {"role": "user",    "content": "Explain ELSS tax saving mutual funds."},
    ],
    temperature=0.7,
    max_tokens=512,
    top_p=0.9,
    frequency_penalty=0.1,
    presence_penalty=0.05,
)
print(response.choices[0].message.content)

# ─── Streaming response ───────────────
stream = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Write a haiku about investing."}],
    stream=True,
    temperature=1.0,
    max_tokens=100,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

pythonclient_requests.py — Direct HTTP (no OpenAI SDK)

import requests, json

def chat(messages: list, temperature: float = 0.7, max_tokens: int = 512) -> str:
    response = requests.post(
        "http://localhost:8080/v1/chat/completions",
        json={
            "model": "local-model",
            "messages": messages,
            "temperature": temperature,
            "max_tokens": max_tokens,
            "top_p": 0.9,
            "repeat_penalty": 1.05,   # llama.cpp native field
            "min_p": 0.05,              # llama.cpp extension
        },
        timeout=120,
    )
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

# ─── Example usage ────────────────────
answer = chat([
    {"role": "system", "content": "You are a helpful finance assistant."},
    {"role": "user",   "content": "What is Section 80D deduction?"},
])
print(answer)

# ─── Check server info ────────────────
info = requests.get("http://localhost:8080/v1/models").json()
print(f"Loaded model: {info['data'][0]['id']}")

pythonclient_langchain.py — LangChain integration

from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

# Point LangChain at local llama-server
llm = ChatOpenAI(
    model="local-model",
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
    temperature=0.7,
    max_tokens=1024,
)

messages = [
    SystemMessage(content="You are an expert in Indian taxation."),
    HumanMessage(content="What is the difference between ELSS and PPF?"),
]

response = llm.invoke(messages)
print(response.content)

Sample Code

Config Files

Instead of long command-line flags, llama-server accepts a JSON config file via --config. Easier to version-control and manage across environments.

jsonconfig_7b_12gb.json — 7B full GPU config

{
  "model": "./models/Qwen2.5-7B-Instruct-Q4_K_M.gguf",
  "alias": "qwen7b",
  "n_gpu_layers": 99,
  "ctx_size": 8192,
  "batch_size": 2048,
  "ubatch_size": 512,
  "threads": 12,
  "threads_batch": 12,
  "flash_attn": true,
  "cache_type_k": "q8_0",
  "cache_type_v": "q8_0",
  "cont_batching": true,
  "defrag_thold": 0.1,
  "jinja": true,
  "host": "127.0.0.1",
  "port": 8080,
  "n_parallel": 1,
  "temperature": 0.7,
  "top_p": 1.0,
  "min_p": 0.05,
  "repeat_penalty": 1.05
}

jsonconfig_13b_12gb.json — 13B hybrid GPU+CPU

{
  "model": "./models/Qwen2.5-14B-Instruct-Q4_K_M.gguf",
  "alias": "qwen14b",
  "n_gpu_layers": 32,
  "ctx_size": 4096,
  "batch_size": 512,
  "ubatch_size": 256,
  "threads": 12,
  "threads_batch": 16,
  "flash_attn": true,
  "cache_type_k": "q4_0",
  "cache_type_v": "q4_0",
  "tensor_split": "",
  "override_tensor": "token_embd=CPU,output=CPU",
  "no_mmap": false,
  "jinja": true,
  "host": "127.0.0.1",
  "port": 8080
}

bashLaunch with config file

./llama-server --config config_7b_12gb.json

# Override specific values from config via CLI flags
./llama-server --config config_7b_12gb.json --port 8081 -c 16384

Reference

Cheat Sheet

All Major Parameters — Quick Reference

Flag	Long Form	Type	Default	One-Line Summary
`-m`	`--model`	str	—	Path to GGUF file ← required
`-a`	`--alias`	str	—	Model name returned by API
`-ngl`	`--n-gpu-layers`	int	0	Layers offloaded to GPU VRAM
`-c`	`--ctx-size`	int	model	Context window in tokens
`-t`	`--threads`	int	cores	CPU threads for decode
`-tb`	`--threads-batch`	int	= -t	CPU threads for prefill
`-b`	`--batch-size`	int	2048	Logical batch tokens (prefill)
`-ub`	`--ubatch-size`	int	512	Physical micro-batch (CUDA)
`-n`	`--predict`	int	-1	Max output tokens (-1=unlimited)
`--temp`	`--temperature`	float	0.8	Sampling randomness (0=greedy)
`--top-p`	—	float	0.95	Nucleus sampling cutoff
`--top-k`	—	int	40	Top-K token limit (0=off)
`--min-p`	—	float	0.05	Min probability relative to top
`--repeat-penalty`	—	float	1.0	Penalty for repeating tokens
`--repeat-last-n`	—	int	64	Context window for repeat check
`-fa`	`--flash-attn`	enum	auto	Flash Attention v2 (on/off/auto)
`--cache-type-k`	—	enum	f16	K-cache dtype (f16/q8_0/q4_0)
`--cache-type-v`	—	enum	f16	V-cache dtype (f16/q8_0/q4_0)
`-ot`	`--override-tensor`	str	—	Regex=device tensor placement
—	`--n-cpu-moe`	int	0	Send N layers' MoE experts to CPU
—	`--fit`	bool	on (-hf)	Auto-fit -ngl/-c to free memory
—	`-hf` / `--hf-repo`	str	—	Pull GGUF straight from Hugging Face
—	`--models-dir`	path	—	Router mode: multi-model directory
—	`--sleep-idle-seconds`	int	off	Free memory after N sec idle
—	`--dry-multiplier`	float	0.8	DRY repetition-loop penalty strength
—	`--xtc-probability`	float	0	Exclude-top-choices sampler (creative)
—	`--top-nsigma`	float	-1	Statistical logit filter (off by default)
`--mmap`	`--no-mmap`	bool	on	Memory-mapped model loading
`--mlock`	—	bool	off	Lock model pages in RAM
`--defrag-thold`	—	float	-1	KV cache defrag threshold
`--host`	—	str	127.0.0.1	Server bind address
`--port`	—	int	8080	Server port
`--jinja`	—	bool	off	Enable Jinja2 chat templates
`--chat-template`	—	str	—	Override chat template preset
`-np`	`--parallel`	int	1	Concurrent inference slots
`--api-key`	—	str	—	Bearer token for auth
`--cont-batching`	—	bool	on	Continuous batching (server)
—	`-md` / `--model-draft`	str	—	Draft model for speculative decoding
`-s`	`--seed`	int	-1	RNG seed (-1=random)

Sampling Stack — Execution Order (b9900+)

textHow samplers chain together

Raw logits from model
    ↓
Penalties               (--repeat-penalty / --presence-penalty / --frequency-penalty)
    ↓
DRY                     (--dry-multiplier — breaks phrase-level repetition loops)
    ↓
Top-N-Sigma             (--top-nsigma — off by default)
    ↓
Top-K filtering         (--top-k 0 to disable)
    ↓
Typical-P               (--typical-p — off by default)
    ↓
Top-P / nucleus         (--top-p)
    ↓
Min-P filtering         (--min-p)
    ↓
XTC                     (--xtc-probability — off by default, creative writing only)
    ↓
Temperature scaling     (--temp — applied last)
    ↓
Sample final token

# Override the full chain order with --samplers "top_k;top_p;temp" etc.

Quick Decision Guide

Goal	Settings
Code / factual answers	`--temp 0.1 --top-p 0.9 --top-k 0 --repeat-penalty 1.0 --dry-multiplier 0.8`
General chat	`--temp 0.7 --top-p 1.0 --min-p 0.05 --repeat-penalty 1.05 --dry-multiplier 0.8`
Creative writing	`--temp 1.1 --top-p 0.95 --min-p 0.02 --repeat-penalty 1.1 --xtc-probability 0.15 --xtc-threshold 0.1`
Reproduce output	`--temp 0 -s 42` (greedy + fixed seed)
Zero-math first launch	`llama-server -hf <repo> --fit`
MoE model, tight VRAM	`-ngl 99 --n-cpu-moe 999 --fit -fa on`
Max VRAM savings (dense)	`-fa on --cache-type-k q4_0 --cache-type-v q4_0 -ot "token_embd=CPU,output=CPU"`
Max speed (full GPU)	`-ngl 99 -fa on -b 4096 -ub 1024 --cont-batching`
Claude Code / Anthropic client	`--jinja` then set `ANTHROPIC_BASE_URL=http://host:port`

Reference

Official Docs & Links

📦 llama.cpp Repository

github.com/ggml-org/llama.cpp

Main repository (moved from ggerganov/llama.cpp to the ggml-org). Source code, releases, build instructions, and issue tracker.

🖥️ llama-server Docs

github.com/ggml-org/llama.cpp/blob/master/docs/server.md

Complete llama-server parameter reference, API endpoints, OpenAI + Anthropic compatibility notes.

🔨 Build Guide

github.com/ggml-org/llama.cpp/blob/master/docs/build.md

Platform-specific build instructions: CUDA, Metal, Vulkan, SYCL, WebGPU, CPU backends.

⚡ Performance Tips

docs/development/token-generation-performance-tips.md

Official performance tuning guide: batch size, threading, GPU offload, and throughput optimization.

✨ Flash Attention Docs

docs/flash-attention.md

Detailed explanation of FA support, compatible quant types, and how KV cache quantization interacts.

📄 GGUF Format Spec

github.com/ggml-org/ggml/blob/master/docs/gguf.md

GGUF file format specification — metadata fields, tensor layout, and quantization types.

💬 GitHub Discussions

github.com/ggml-org/llama.cpp/discussions

Community Q&A, model compatibility reports, performance benchmarks, and config sharing.

🤗 bartowski GGUF Hub

huggingface.co/bartowski

High-quality GGUF quantizations of the latest models — Q4_K_M, Q5_K_M, Q8_0, and more.

🔌 Server API Reference

tools/server/README.md

All REST endpoints: /v1/chat/completions, /v1/messages (Anthropic), /v1/completions, /tokenize, /detokenize, /slots, /health, /models.

🤗 ggml-org GGUF Hub

huggingface.co/ggml-org

Official ggml-org quantizations, ready for direct -hf pulls — vision/mmproj pairs included for multimodal models.

🚀 Releases & Changelogs

github.com/ggml-org/llama.cpp/releases

Pre-built binaries for Linux, macOS, Windows. CUDA, Metal, Vulkan, SYCL, WebGPU, and CPU variants — tagged by build number (bNNNN), not semver.

LLAMA.CPPHANDBOOK

What is llama.cpp?

Build & Install

Key Binaries

Pull Directly from Hugging Face

Model Loading

GPU Layers -ngl

Layer Count by Model

Context Size -c

Context vs VRAM at Different Sizes (7B Q4_K_M)

Threads -t / -tb

Batch Size -b / -ub

Temperature --temp

Top-P --top-p (Nucleus Sampling)

Top-K --top-k

Min-P --min-p

Repeat Penalty --repeat-penalty

Max Tokens -n

DRY / XTC / Top-N-Sigma

Flash Attention -fa

Cache Type K/V --cache-type-k / --cache-type-v

Available Cache Types

Tensor Override -ot

Common Tensor Name Patterns

MoE Expert Offloading

Auto-Fit Memory --fit

Memory Map --mmap / --mlock / --no-mmap

KV Cache Defrag --defrag-thold

Host & Port

Jinja Templates --jinja

--chat-template — Override template manually

Parallel Slots -np

API Key --api-key

Continuous Batching --cont-batching

Anthropic Messages API /v1/messages

Router Mode --models-dir

Idle Sleep --sleep-idle-seconds

Hybrid Overview

Speed Impact of Hybrid vs Full GPU

VRAM Budgeting

Pre-calculated -ngl Values for Common Setups

Offload Recipes

CLI Examples

Server Launch

Python Client

Config Files

Cheat Sheet

All Major Parameters — Quick Reference

Sampling Stack — Execution Order (b9900+)

Quick Decision Guide

Official Docs & Links

LLAMA.CPP
HANDBOOK