Back to handbooks index
b4500 · 2025
llama-server llama-cli GPU+CPU Hybrid GGUF RTX 40/50
$ ./llama-server --version

LLAMA.CPP
HANDBOOK

// The Complete Parameter Reference for Local LLM Inference

Everything you need to configure, tune, and run GGUF models locally — from GPU/CPU hybrid offloading to sampling parameters, server mode, Jinja templates, and KV cache optimization. Built for RTX 40/50 series and 12–16 GB VRAM setups.

llama-server llama-cli -ngl GPU layers --jinja templates Flash Attention CPU+GPU hybrid

What is llama.cpp?

llama.cpp is a high-performance LLM inference engine written in pure C/C++ by Georgi Gerganov. It runs GGUF-format models with CPU-only, GPU-only, or hybrid GPU+CPU execution — no Python, no CUDA runtime dependency for CPU mode, no HuggingFace stack required.

Its defining capability is GPU layer offloading: you load exactly as many transformer layers as fit in your GPU VRAM, with the remainder computed on CPU RAM. This lets you run models larger than your VRAM by trading speed for capacity.

C++
Core Language
GGUF
Model Format
0
Python Deps (CLI)
70+
Arch Supported
v1
OpenAI Compat API
🦎 vs Ollama / LM Studio

Ollama and LM Studio are frontends that wrap llama.cpp. If you use Ollama, llama.cpp's parameters map directly — Ollama passes them through to the underlying llama.cpp process. Knowing llama.cpp parameters means you understand what every LLM serving tool is doing under the hood.

Build & Install

bashCUDA build (RTX 40/50 — recommended)
# Clone latest
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA — LLAMA_CUDA=ON enables GPU support
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="89;120" \   # 89=Ada(40xx) 120=Blackwell(50xx)
  -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j$(nproc)

# Binaries land in ./build/bin/
ls build/bin/
# → llama-server  llama-cli  llama-bench  llama-quantize
bashPre-built release (fastest path)
# Download latest pre-built CUDA binary from GitHub Releases
RELEASE=$(curl -s https://api.github.com/repos/ggerganov/llama.cpp/releases/latest | grep tag_name | cut -d'"' -f4)

# Linux CUDA build
wget "https://github.com/ggerganov/llama.cpp/releases/download/${RELEASE}/llama-${RELEASE}-bin-ubuntu-x64.zip"
unzip llama-*.zip -d llama-cpp
export PATH="$PWD/llama-cpp:$PATH"

# Verify CUDA detected
./llama-cpp/llama-server --version
# Should show: CUDA available, GPU count: 1
⚡ RTX 50 Series (Blackwell)

For RTX 5060 Ti / 5070, set -DCMAKE_CUDA_ARCHITECTURES="120". Pre-built binaries as of early 2025 may not include sm_120 — build from source for best performance on Blackwell. Also ensure CUDA 12.8+ and driver 560.76+.

Key Binaries

BinaryPurposeWhen to Use
llama-serverHTTP API server (OpenAI-compatible)App integration, Claude Code, LangChain, anything that needs an API
llama-cliInteractive CLI / single-shot inferenceQuick testing, scripting, benchmarking prompts
llama-benchBenchmark prompt & generation throughputFinding optimal -ngl, -b, -t values for your GPU
llama-quantizeConvert/re-quantize GGUF modelsChanging quant level (F16→Q4_K_M) locally
llama-perplexityEvaluate model quality (PPL score)Comparing quant levels for quality loss measurement

Model Loading

-m   --model
alias: --model-path
string default: none server + cli
Path to the GGUF model file. Required parameter — the most basic thing you must specify. Can be a local file path or a URL (llama.cpp will download it).
-m ./models/qwen2.5-7b-instruct-Q4_K_M.gguf -m "https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf"
-a   --alias
string server
Sets the model name returned by the API's /v1/models endpoint and used in chat completions. Clients like OpenAI SDK must specify this name in the model field of their request.
-a "qwen2.5-7b"   # Client calls: {"model": "qwen2.5-7b", ...}

GPU Layers -ngl

-ngl   --gpu-layers   --n-gpu-layers
integer default: 0 (CPU only) server + cli
Number of transformer layers to load into GPU VRAM. Each layer occupies VRAM; layers not on GPU are computed on CPU. Setting -ngl 99 or -ngl 9999 offloads all layers (safe — clamps to model's actual layer count).
💡 Rule of Thumb
Most 7B models have 32 layers. 13B models have 40 layers. Qwen2.5-7B has 28 layers. If you set -ngl 20 on a 28-layer model, 20 go to GPU and 8 run on CPU — you get hybrid inference.
-ngl 99 # Fully GPU (all layers fit in VRAM) -ngl 20 # 20 layers GPU, rest CPU (hybrid for VRAM-tight setups) -ngl 0 # Pure CPU inference

Layer Count by Model

ModelLayers12 GB -ngl16 GB -nglNotes
Qwen2.5-7B Q4_K_M28-ngl 28 ✅-ngl 28 ✅Full GPU, ~4.1 GB VRAM
Qwen2.5-7B Q8_028-ngl 20-ngl 28 ✅Q8 = ~7.7 GB, tight on 12 GB
Llama-3.1-8B Q4_K_M32-ngl 32 ✅-ngl 32 ✅~4.6 GB VRAM
Qwen2.5-14B Q4_K_M48-ngl 24-ngl 35~8 GB for 24 layers
Mistral-7B Q5_K_M32-ngl 32 ✅-ngl 32 ✅~5.1 GB VRAM
Phi-4 14B Q4_K_M40-ngl 20-ngl 30~9.0 GB for 30 layers
DeepSeek-R1 7B Q4_K_M28-ngl 28 ✅-ngl 28 ✅Same as Qwen-7B class

Context Size -c

-c   --ctx-size   --context-size
integer default: model default (often 4096) server + cli
Sets the KV cache context window size in tokens. This is the total token budget for prompt + generation. Larger contexts consume more VRAM — the KV cache grows linearly with context and quadratically affects attention computation.
⚠️ VRAM Cost of Context
KV cache VRAM ≈ 2 × num_layers × num_heads × head_dim × ctx_size × dtype_bytes. For a 7B model at Q4, adding 4096 ctx costs ~0.5–1 GB. At 32768 ctx it costs ~4–6 GB extra. Use the minimum context you need.
-c 2048 # Conservative — saves VRAM, good for chat -c 8192 # Balanced — suits most tasks -c 32768 # Full long-context — Qwen2.5 supports up to 128K

Context vs VRAM at Different Sizes (7B Q4_K_M)

Model weights (~4.1 GB fixed)always loaded
34% of 12 GB — constant regardless of context
KV Cache @ ctx=2048~0.5 GB extra
38% total — safe for 12 GB
KV Cache @ ctx=8192~2 GB extra
~51% total — comfortable
KV Cache @ ctx=32768~7 GB extra
~92% total — very tight on 12 GB, reduce -ngl or use Q4 cache

Threads -t / -tb

-t   --threads
CPU threads for generation (token-by-token decode)
integer default: physical core count server + cli
Number of CPU threads used for the decode phase (generating tokens one by one). Only matters for layers running on CPU. If all layers are on GPU (-ngl 99), this has minimal effect.
💡 Best Practice
Set to physical cores (not hyperthreads). E.g., for a 12-core/24-thread CPU, use -t 12. Using too many threads causes context-switching overhead that slows inference.
-t 12 # 12 physical cores for CPU decode layers
-tb   --threads-batch
CPU threads for prompt processing (prefill phase)
integer default: same as -t server + cli
Threads for the prefill/prompt-processing phase (processing the full input prompt in parallel). Can often be set higher than -t since prefill is more parallelizable. Separate from -t to let you tune each phase independently.
-t 8 -tb 16 # 8 threads decode, 16 threads prefill

Batch Size -b / -ub

-b   --batch-size
integer default: 2048 server + cli
Maximum tokens processed in a single batch during the prompt ingestion (prefill) phase. Larger batch = faster prompt processing but more VRAM. For short prompts, this barely matters.
Memory Tip
On 12 GB VRAM with a 13B model, reduce to -b 512 to free VRAM during prompt processing. The speed difference for short prompts is negligible.
-b 512 # Reduced for VRAM-tight setups -b 4096 # Aggressive — maximize prompt throughput if VRAM allows
-ub   --ubatch-size
integer default: 512 server + cli
Physical micro-batch size — how many tokens are actually computed in one CUDA kernel call during prefill. Must be ≤ -b. Smaller = less VRAM peak usage. Larger = more GPU parallelism. Tune this if you're getting OOM during long-prompt processing.
-b 2048 -ub 256 # Large logical batch, small physical batch (OOM prevention)

Temperature --temp

--temp   --temperature
float default: 0.8 server + cli
Divides the logits before softmax, controlling randomness. Lower = more deterministic, higher = more creative/varied. Temperature=0 is greedy decoding (always picks highest probability token).
📊 Quick Reference
0.0 = deterministic · 0.1–0.3 = factual/code · 0.7–0.9 = balanced chat · 1.0–1.3 = creative writing · 2.0+ = chaotic
--temp 0.1 # Code generation / factual Q&A --temp 0.7 # General chat (sweet spot) --temp 1.2 # Creative writing / brainstorming

Top-P --top-p (Nucleus Sampling)

--top-p
nucleus sampling — cumulative probability cutoff
float [0.0–1.0] default: 0.95 server + cli
Restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. At 0.95, only the top-probability tokens summing to 95% of the distribution are candidates. Eliminates low-probability "tail" tokens that cause incoherence.
💡 Interaction with Temperature
Top-P and Temperature work together. Temperature reshapes the distribution, then Top-P samples from it. For factual tasks: low temp + low top-p. For creative: high temp + high top-p or use Min-P instead (often better).
--top-p 0.9 # Standard — good quality/diversity balance --top-p 1.0 # Disabled — full distribution (rely on temp alone) --top-p 0.5 # Very conservative — only high-confidence tokens

Top-K --top-k

--top-k
keep only top K tokens before sampling
integer default: 40 server + cli
Keeps only the top K highest-probability tokens and zeros out all others before sampling. Simpler than Top-P. Setting --top-k 0 disables it (no hard cutoff). Applied before Top-P in the sampling chain.
When to use Top-K
Top-K is a blunter instrument than Top-P — it doesn't care about probability mass, just rank. Generally, use Top-P (or Min-P) instead of Top-K for modern models. If you use both, K is applied first.
--top-k 0 # Disabled — recommended when using top-p --top-k 40 # Classic default --top-k 1 # Greedy (same as --temp 0)

Min-P --min-p

--min-p
minimum probability relative to top token
float [0.0–1.0] default: 0.05 server + cli
Keeps tokens whose probability is at least min_p × (probability of top token). Scales dynamically with confidence: when the model is very confident, fewer tokens pass; when uncertain, more pass. Often superior to Top-P for modern GGUF models.
🌟 Recommended Modern Sampler
Many llama.cpp users replace Top-P entirely with Min-P (set --top-p 1.0, --min-p 0.05). Min-P naturally adapts to model confidence, producing more coherent high-confidence output and appropriately creative low-confidence output.
--top-p 1.0 --min-p 0.05 --temp 0.8 # Modern recommended combo

Repeat Penalty --repeat-penalty

--repeat-penalty
also: --repeat-last-n  --presence-penalty  --frequency-penalty
float default: 1.0 (disabled) server + cli
Penalizes tokens that have appeared in the recent context. Values >1.0 discourage repetition; <1.0 encourages it. Applied to logits by dividing (for positive) or multiplying (for negative). Prevents the model from looping the same phrases.
Related Parameters
--repeat-last-n N — how many previous tokens to check for repeats (default 64, -1 = full context)
--presence-penalty P — OpenAI-style: flat penalty per token present in context
--frequency-penalty P — OpenAI-style: penalty proportional to how often token appeared
--repeat-penalty 1.1 --repeat-last-n 128 # Gentle anti-repetition --repeat-penalty 1.0 # Disabled (default) --presence-penalty 0.2 --frequency-penalty 0.1 # OpenAI-style

Max Tokens -n

-n   --predict   --n-predict
integer default: -1 (unlimited) server + cli
Maximum number of tokens to generate. -1 means generate until the model produces an EOS token or the context is full. In server mode, clients can override this per-request via max_tokens in the API payload.
-n 512 # Cap at 512 output tokens -n -1 # Let model decide when to stop -n -2 # Generate until context full (useful for completion tasks)

Flash Attention -fa

-fa   --flash-attn
flag default: disabled server + cli
Enables Flash Attention v2 — a memory-efficient attention algorithm that avoids materializing the full N×N attention matrix. Instead it computes attention in tiles, dramatically reducing VRAM usage during attention computation.
✅ Benefits
• Reduces VRAM by 30–60% for attention computation
• Enables larger contexts without OOM
• Often 20–40% faster on GPU (RTX 40/50)
• Required for long context (32K+) on 12 GB
⚠️ Caveats
• Requires CUDA build (not CPU-only)
• Incompatible with some older GGUF quant formats
• Must pair with compatible --cache-type-k/v settings
• Best on Ampere+ (RTX 30xx and newer)
-fa # Enable Flash Attention -fa -c 32768 # Flash Attention + long context -fa --cache-type-k q8_0 # Flash Attn + quantized KV cache

Cache Type K/V --cache-type-k / --cache-type-v

--cache-type-k   --cache-type-v
quantize the KV cache to reduce VRAM usage
enum default: f16 server + cli
Controls the data type of the Key and Value caches in the attention mechanism. The KV cache grows linearly with context length and can consume significant VRAM. Quantizing it to Q8 or Q4 can save 50–75% of KV cache VRAM with minimal quality loss.
⚠️ Requires Flash Attention
KV cache quantization (q8_0, q4_0, q5_0) requires -fa (Flash Attention) to be enabled. Without -fa, only f16 and f32 work.

Available Cache Types

TypeBitsVRAM (vs f16)QualityRequires -fa?
f3232-bit2× (more)PerfectNo
f1616-bit1× (baseline)ExcellentNo
q8_08-bit~0.5×Near-losslessYes
q5_05-bit~0.35×Very goodYes
q4_04-bit~0.25×GoodYes
-fa --cache-type-k q8_0 --cache-type-v q8_0 # Best quality + VRAM savings -fa --cache-type-k q4_0 --cache-type-v q4_0 # Maximum VRAM savings (long ctx) --cache-type-k f16 --cache-type-v f16 # Default — no Flash Attn needed
🎯 12 GB VRAM — Recommended KV Cache Config

For a 7B Q4_K_M model with 32K context on 12 GB: use -fa --cache-type-k q8_0 --cache-type-v q8_0. This keeps KV cache at ~2 GB instead of ~4 GB at f16, leaving room for weights + activations. For 13B models at 8K context: -fa --cache-type-k q4_0 --cache-type-v q4_0.

Tensor Override -ot

-ot   --override-tensor
fine-grained control: which tensors go to GPU vs CPU
string (regex=device) default: none server + cli
Overrides where specific tensors are placed, using a regex pattern matched against tensor names. Format: PATTERN=DEVICE where device is CPU, GPU, or GPU0, GPU1 etc. This is the most surgical tool for fitting models into tight VRAM budgets.
How it interacts with -ngl
-ngl is a blunt instrument — it offloads N complete layers to GPU. -ot is surgical — you can keep specific tensor types on CPU even within layers that -ngl assigned to GPU. Very useful for keeping embedding tables on CPU RAM (they're large and rarely the bottleneck).

Common Tensor Name Patterns

PatternMatchesVRAM Impact
blk\.\d+\.attnAll attention weights~40% of layer VRAM
blk\.\d+\.ffnAll FFN/MLP weights~60% of layer VRAM
token_embdEmbedding tableLarge (vocab × dim)
outputLM head / output weightsSame as embeddings
blk\.[2-9][0-9]Layers 20–99Selective layer control
-ot "token_embd=CPU" # Keep embedding on CPU RAM (saves ~1-2GB VRAM) -ot "output=CPU" # Keep LM head on CPU -ot "blk\.3[2-9]\.=CPU" # Layers 32-39 to CPU, rest GPU -ot "token_embd=CPU" -ot "output=CPU" -ngl 99 # Hybrid: all compute layers GPU, I/O on CPU
📐 Advanced: -ot for MOE Models

For Mixture-of-Experts models (DeepSeek, Mixtral), the FFN expert tensors are huge but only a few activate per token. Use -ot "blk\.\d+\.ffn_gate_exps=CPU" to keep sparse expert weights on CPU RAM and only pull them when needed — massive VRAM savings with modest speed cost.

Memory Map --mmap / --mlock / --no-mmap

--mmap   --no-mmap   --mlock
flags default: --mmap enabled server + cli
--mmap (default): model file is memory-mapped — pages are loaded from disk on demand. Fast startup, but first inference may be slower as pages fault in. --no-mmap: entire model loaded into RAM eagerly at startup. Slower launch, but consistent inference speed. --mlock: pins RAM pages so they can't be swapped to disk — prevents latency spikes but requires OS permissions.
--no-mmap # Load fully into RAM — best for repeated inference, no page faults --mmap --mlock # Map + lock in RAM — fast and no swap risk

KV Cache Defrag --defrag-thold

--defrag-thold
float [0.0–1.0] default: -1.0 (disabled) server
When the KV cache fragmentation ratio exceeds this threshold, llama-server automatically defragments it. Fragmentation occurs in server mode when multiple concurrent requests start and end at different times, leaving gaps in the KV cache. Setting 0.1 means "defrag when 10%+ of cache is fragmented".
💡 When to Enable
Enable in production server deployments with many concurrent users. For single-user local use, leave disabled. A threshold of 0.1–0.2 is a good starting point.
--defrag-thold 0.1 # Defrag when 10% fragmentation detected

Host & Port

--host   --port
string / integer default: 127.0.0.1 / 8080 server only
Network interface and port the HTTP server binds to. Default 127.0.0.1 means localhost-only (safe). Use 0.0.0.0 to expose to all network interfaces — use with caution and add --api-key if exposed externally.
--host 127.0.0.1 --port 8080 # Localhost only (default) --host 0.0.0.0 --port 8080 # All interfaces — accessible on LAN

Jinja Templates --jinja

--jinja
also: --chat-template   --chat-template-file
flag / string default: disabled server
--jinja enables Jinja2-based chat template processing — the same system HuggingFace uses for chat formatting. When enabled, llama-server applies the model's built-in chat_template from the GGUF metadata to format messages. This is critical for correct instruction-following behavior.
Why this matters
Each model family (Qwen, Llama, Mistral, Phi) uses a different special token format for system/user/assistant turns. Without the correct template, the model receives malformed input and produces poor outputs. --jinja applies the template automatically from model metadata.

--chat-template — Override template manually

Override the built-in chat template with a named preset or custom Jinja2 string. Useful when GGUF metadata has wrong/missing template.
Template PresetUse For
qwen2Qwen 2.x / 2.5 models
llama3Llama 3.x, Meta models
mistralMistral / Mixtral
chatmlChatML format (many fine-tunes)
gemmaGoogle Gemma
phi3Microsoft Phi-3/4
deepseek2DeepSeek V2/R1
--jinja # Use template from GGUF metadata --chat-template qwen2 # Force Qwen2 template --chat-template-file my_template.j2 # Load custom Jinja2 file

Parallel Slots -np

-np   --parallel   --n-parallel
integer default: 1 server
Number of concurrent inference slots — how many requests can be processed simultaneously. Each slot reserves a portion of the KV cache. VRAM cost scales linearly: total_kv_cache = ctx_size × n_parallel.
💡 For Local Single-User Use
Keep -np 1 (default) for best single-request latency. Increase only for multi-user serving. With -np 4 and ctx=4096, you need 4× the KV cache VRAM.
-np 1 # Single user — best latency -np 4 # 4 concurrent users — 4× KV VRAM cost

API Key --api-key

--api-key   --api-key-file
string default: none (open access) server
Requires clients to pass a Bearer token in the Authorization header. Must match exactly. When set, unauthenticated requests return HTTP 401. Use when exposing llama-server beyond localhost.
--api-key "my-secret-key-here" --api-key-file /run/secrets/llama_api_key # Load from file (safer)

Continuous Batching --cont-batching

--cont-batching   --no-cont-batching
flag default: enabled server
Enables continuous batching — new requests are inserted into the inference pipeline mid-generation without waiting for current requests to finish. Dramatically improves GPU utilization under concurrent load. Enabled by default in llama-server. Disable only for debugging or strict FIFO order requirements.
--cont-batching # Default — good for multi-user serving --no-cont-batching # Disable — strict sequential processing

Hybrid Overview

llama.cpp's killer feature is partial GPU offloading: you split the model's transformer layers between GPU VRAM and CPU RAM. Layers on GPU run fast (CUDA/CUBLAS). Layers on CPU run slower but allow running models far larger than your VRAM.

⚡ How Hybrid Works

The model has N total layers. You set -ngl K. Layers 0 to K-1 → GPU VRAM. Layers K to N-1 → CPU RAM + CPU compute. The embedding table and LM head placement is controlled separately by -ot. Data passes from GPU→CPU→GPU between the split, which creates some overhead on the PCIe bus.

Speed Impact of Hybrid vs Full GPU

ConfigTokens/sec (7B Q4_K_M, RTX 5070)VRAM Used
Full GPU (-ngl 99)~80–110 tok/s~4.5 GB
Hybrid 20/28 layers GPU~35–55 tok/s~3.0 GB
Hybrid 10/28 layers GPU~20–30 tok/s~1.5 GB
Full CPU (-ngl 0)~5–12 tok/s0 GB (RAM only)

VRAM Budgeting

Use this formula to estimate -ngl for your VRAM budget:

formulaVRAM estimation
# VRAM per layer ≈ model_file_size / total_layers
# Total VRAM ≈ (layers_on_gpu × vram_per_layer) + kv_cache + overhead

# Example: Qwen2.5-14B Q4_K_M (8.9 GB file, 48 layers)
vram_per_layer = 8900 MB / 48 = ~185 MB / layer
kv_cache_f16   = 2 × 48 × 40 × 128 × 8192 × 2 bytes ≈ ~3200 MB (8K ctx)
overhead       = ~500 MB (CUDA context, activations)

# For 12 GB VRAM:
available = 12000 - 3200 - 500 = 8300 MB for weights
max_layers = 8300 / 185 = ~44 layers → use -ngl 44

# With quantized KV cache (q8_0):
kv_cache_q8 = ~1600 MB → available = 9900 MB → -ngl 53 (close to all 48!)

Pre-calculated -ngl Values for Common Setups

Model + QuantGPU VRAMctx=2048ctx=8192ctx=32768
7B Q4_K_M12 GB-ngl 99 ✅-ngl 99 ✅-ngl 99 (-fa -ctk q8_0)
7B Q8_012 GB-ngl 99 ✅-ngl 26-ngl 14
13B Q4_K_M12 GB-ngl 40 ✅-ngl 32-ngl 20
14B Q4_K_M16 GB-ngl 48 ✅-ngl 44-ngl 36
4B Q8_012 GB-ngl 99 ✅-ngl 99 ✅-ngl 99 ✅
70B Q4_K_M12 GB-ngl 14-ngl 10-ngl 6

Offload Recipes

Recipe 1 — VRAM Tight (12 GB, 13B model)
bash13B on 12 GB VRAM — aggressive offloading
./llama-server \
  -m ./Qwen2.5-14B-Instruct-Q4_K_M.gguf \
  -ngl 30 \                        # 30/48 layers on GPU
  -c 4096 \                        # Smaller context to save KV cache VRAM
  -b 512 \                         # Smaller batch to reduce peak VRAM
  -fa \                            # Flash attention — saves attention VRAM
  --cache-type-k q8_0 \            # Quantize K cache — halves KV VRAM
  --cache-type-v q8_0 \
  -ot "token_embd=CPU" \           # Embedding to CPU (saves ~1.2 GB)
  -ot "output=CPU" \               # LM head to CPU
  -t 8 \                           # 8 threads for CPU layers
  --host 127.0.0.1 --port 8080 \
  --jinja -a "qwen14b"
Recipe 2 — Large Model, Mostly CPU
bash70B Q4_K_M on 12 GB VRAM + 64 GB RAM
./llama-server \
  -m ./DeepSeek-R1-70B-Q4_K_M.gguf \
  -ngl 10 \                        # Only 10 layers on GPU (70B has 80 layers)
  -c 2048 \                        # Keep context small
  -t 16 \                          # 16 CPU threads for the 70 CPU layers
  -tb 16 \                         # 16 threads for prefill
  -b 512 \
  --no-mmap \                      # Load fully into RAM — avoids page faults
  -fa \
  --cache-type-k q4_0 \            # Maximum KV compression
  --cache-type-v q4_0 \
  --host 127.0.0.1 --port 8080 \
  --jinja -a "deepseek-70b"
# Expect ~3-6 tok/s — the PCIe + CPU compute limits throughput
Recipe 3 — Full GPU (7B, maximized)
bash7B fully on GPU — maximum speed
./llama-server \
  -m ./Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  -ngl 99 \                        # All layers to GPU
  -c 8192 \
  -fa \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -b 2048 \
  --cont-batching \
  -np 2 \                          # 2 parallel slots for multiple users
  --host 127.0.0.1 --port 8080 \
  --jinja \
  -a "qwen7b" \
  --defrag-thold 0.1               # Auto-defrag KV cache

CLI Examples

bashllama-cli — single-shot inference
# Interactive chat (GPU, Qwen2.5)
./llama-cli \
  -m ./Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  -ngl 99 \
  -c 4096 \
  --temp 0.7 \
  --top-p 1.0 \
  --min-p 0.05 \
  --repeat-penalty 1.05 \
  --jinja \
  -i -if           # -i = interactive, -if = interactive first

# One-shot generation (pipe to output)
./llama-cli \
  -m ./model.gguf \
  -ngl 99 \
  -p "Explain ELSS mutual funds in 3 bullet points" \
  -n 256 \
  --temp 0.3 \
  --no-display-prompt \
  -s 42            # -s = seed (reproducible output)

# Benchmark to find optimal -ngl
./llama-bench \
  -m ./Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  -ngl 0,10,20,28 \    # Test multiple values
  -p 512 -n 128 \
  -r 3                 # 3 repetitions per config

Server Launch

bashllama-server — production launch script
#!/bin/bash
# launch_server.sh — parameterized llama-server launch

MODEL="${MODEL:-./models/Qwen2.5-7B-Instruct-Q4_K_M.gguf}"
PORT="${PORT:-8080}"
CTX="${CTX:-8192}"
NGL="${NGL:-99}"
THREADS="${THREADS:-12}"
ALIAS="${ALIAS:-local-model}"

./llama-server \
  -m  "$MODEL" \
  -a  "$ALIAS" \
  -ngl $NGL \
  -c  $CTX \
  -t  $THREADS \
  -tb $THREADS \
  -b  2048 \
  -fa \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cont-batching \
  --defrag-thold 0.1 \
  --jinja \
  --host 127.0.0.1 \
  --port $PORT \
  --log-disable \            # Suppress verbose logs
  2>&1 | tee llama-server.log

# Usage:
# ./launch_server.sh
# MODEL=./14B.gguf NGL=30 CTX=4096 ./launch_server.sh
bashVerify server is running
# Health check
curl http://localhost:8080/health

# List models
curl http://localhost:8080/v1/models | python3 -m json.tool

# Test completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user",   "content": "What is Section 80C?"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }' | python3 -m json.tool

Python Client

pythonclient_openai.py — OpenAI SDK (recommended)
from openai import OpenAI

# llama-server is OpenAI API-compatible
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",   # Required by SDK but ignored if no --api-key set
)

# ─── Non-streaming completion ─────────
response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "system",  "content": "You are Arthavidya, an Indian finance expert."},
        {"role": "user",    "content": "Explain ELSS tax saving mutual funds."},
    ],
    temperature=0.7,
    max_tokens=512,
    top_p=0.9,
    frequency_penalty=0.1,
    presence_penalty=0.05,
)
print(response.choices[0].message.content)

# ─── Streaming response ───────────────
stream = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Write a haiku about investing."}],
    stream=True,
    temperature=1.0,
    max_tokens=100,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()
pythonclient_requests.py — Direct HTTP (no OpenAI SDK)
import requests, json

def chat(messages: list, temperature: float = 0.7, max_tokens: int = 512) -> str:
    response = requests.post(
        "http://localhost:8080/v1/chat/completions",
        json={
            "model": "local-model",
            "messages": messages,
            "temperature": temperature,
            "max_tokens": max_tokens,
            "top_p": 0.9,
            "repeat_penalty": 1.05,   # llama.cpp native field
            "min_p": 0.05,              # llama.cpp extension
        },
        timeout=120,
    )
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

# ─── Example usage ────────────────────
answer = chat([
    {"role": "system", "content": "You are a helpful finance assistant."},
    {"role": "user",   "content": "What is Section 80D deduction?"},
])
print(answer)

# ─── Check server info ────────────────
info = requests.get("http://localhost:8080/v1/models").json()
print(f"Loaded model: {info['data'][0]['id']}")
pythonclient_langchain.py — LangChain integration
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

# Point LangChain at local llama-server
llm = ChatOpenAI(
    model="local-model",
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
    temperature=0.7,
    max_tokens=1024,
)

messages = [
    SystemMessage(content="You are an expert in Indian taxation."),
    HumanMessage(content="What is the difference between ELSS and PPF?"),
]

response = llm.invoke(messages)
print(response.content)

Config Files

Instead of long command-line flags, llama-server accepts a JSON config file via --config. Easier to version-control and manage across environments.

jsonconfig_7b_12gb.json — 7B full GPU config
{
  "model": "./models/Qwen2.5-7B-Instruct-Q4_K_M.gguf",
  "alias": "qwen7b",
  "n_gpu_layers": 99,
  "ctx_size": 8192,
  "batch_size": 2048,
  "ubatch_size": 512,
  "threads": 12,
  "threads_batch": 12,
  "flash_attn": true,
  "cache_type_k": "q8_0",
  "cache_type_v": "q8_0",
  "cont_batching": true,
  "defrag_thold": 0.1,
  "jinja": true,
  "host": "127.0.0.1",
  "port": 8080,
  "n_parallel": 1,
  "temperature": 0.7,
  "top_p": 1.0,
  "min_p": 0.05,
  "repeat_penalty": 1.05
}
jsonconfig_13b_12gb.json — 13B hybrid GPU+CPU
{
  "model": "./models/Qwen2.5-14B-Instruct-Q4_K_M.gguf",
  "alias": "qwen14b",
  "n_gpu_layers": 32,
  "ctx_size": 4096,
  "batch_size": 512,
  "ubatch_size": 256,
  "threads": 12,
  "threads_batch": 16,
  "flash_attn": true,
  "cache_type_k": "q4_0",
  "cache_type_v": "q4_0",
  "tensor_split": "",
  "override_tensor": "token_embd=CPU,output=CPU",
  "no_mmap": false,
  "jinja": true,
  "host": "127.0.0.1",
  "port": 8080
}
bashLaunch with config file
./llama-server --config config_7b_12gb.json

# Override specific values from config via CLI flags
./llama-server --config config_7b_12gb.json --port 8081 -c 16384

Cheat Sheet

All Major Parameters — Quick Reference

FlagLong FormTypeDefaultOne-Line Summary
-m--modelstrPath to GGUF file ← required
-a--aliasstrModel name returned by API
-ngl--n-gpu-layersint0Layers offloaded to GPU VRAM
-c--ctx-sizeintmodelContext window in tokens
-t--threadsintcoresCPU threads for decode
-tb--threads-batchint= -tCPU threads for prefill
-b--batch-sizeint2048Logical batch tokens (prefill)
-ub--ubatch-sizeint512Physical micro-batch (CUDA)
-n--predictint-1Max output tokens (-1=unlimited)
--temp--temperaturefloat0.8Sampling randomness (0=greedy)
--top-pfloat0.95Nucleus sampling cutoff
--top-kint40Top-K token limit (0=off)
--min-pfloat0.05Min probability relative to top
--repeat-penaltyfloat1.0Penalty for repeating tokens
--repeat-last-nint64Context window for repeat check
-fa--flash-attnbooloffEnable Flash Attention v2
--cache-type-kenumf16K-cache dtype (f16/q8_0/q4_0)
--cache-type-venumf16V-cache dtype (f16/q8_0/q4_0)
-ot--override-tensorstrRegex=device tensor placement
--mmap--no-mmapboolonMemory-mapped model loading
--mlockbooloffLock model pages in RAM
--defrag-tholdfloat-1KV cache defrag threshold
--hoststr127.0.0.1Server bind address
--portint8080Server port
--jinjabooloffEnable Jinja2 chat templates
--chat-templatestrOverride chat template preset
-np--parallelint1Concurrent inference slots
--api-keystrBearer token for auth
--cont-batchingboolonContinuous batching (server)
-s--seedint-1RNG seed (-1=random)

Sampling Stack — Execution Order

textHow samplers chain together
Raw logits from model
    ↓
Temperature scaling     (--temp)
    ↓
Top-K filtering         (--top-k 0 to disable)
    ↓
Top-P / nucleus         (--top-p)
    ↓
Min-P filtering         (--min-p)
    ↓
Repeat penalty          (--repeat-penalty)
    ↓
Sample final token

Quick Decision Guide

GoalSettings
Code / factual answers--temp 0.1 --top-p 0.9 --top-k 0 --repeat-penalty 1.0
General chat--temp 0.7 --top-p 1.0 --min-p 0.05 --repeat-penalty 1.05
Creative writing--temp 1.1 --top-p 0.95 --min-p 0.02 --repeat-penalty 1.1
Reproduce output--temp 0 -s 42 (greedy + fixed seed)
Max VRAM savings-fa --cache-type-k q4_0 --cache-type-v q4_0 -ot "token_embd=CPU,output=CPU"
Max speed (full GPU)-ngl 99 -fa -b 4096 -ub 1024 --cont-batching

Official Docs & Links

📦 llama.cpp Repository
github.com/ggerganov/llama.cpp
Main repository. Source code, releases, build instructions, and issue tracker.
🖥️ llama-server Docs
github.com/ggerganov/llama.cpp/blob/master/docs/server.md
Complete llama-server parameter reference, API endpoints, OpenAI compatibility notes.
🔨 Build Guide
github.com/ggerganov/llama.cpp/blob/master/docs/build.md
Platform-specific build instructions: CUDA, Metal, Vulkan, OpenCL, CPU backends.
⚡ Performance Tips
docs/development/token-generation-performance-tips.md
Official performance tuning guide: batch size, threading, GPU offload, and throughput optimization.
✨ Flash Attention Docs
docs/flash-attention.md
Detailed explanation of FA support, compatible quant types, and how KV cache quantization interacts.
📄 GGUF Format Spec
github.com/ggerganov/ggml/blob/master/docs/gguf.md
GGUF file format specification — metadata fields, tensor layout, and quantization types.
💬 GitHub Discussions
github.com/ggerganov/llama.cpp/discussions
Community Q&A, model compatibility reports, performance benchmarks, and config sharing.
🤗 bartowski GGUF Hub
huggingface.co/bartowski
High-quality GGUF quantizations of the latest models — Q4_K_M, Q5_K_M, Q8_0, and more.
🔌 Server API Reference
tools/server/README.md
All REST endpoints: /v1/chat/completions, /v1/completions, /tokenize, /detokenize, /slots, /health.
🚀 Releases & Changelogs
github.com/ggerganov/llama.cpp/releases
Pre-built binaries for Linux, macOS, Windows. CUDA, Metal, and CPU variants available.