mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-06-28 04:30:15 -05:00
* Update README.md Models, MTP, fit * Update parameters.md Disclaimer, terms, new flags, graph split list.
560 lines
47 KiB
Markdown
560 lines
47 KiB
Markdown
# Parameters Documentation
|
||
|
||
Overview of the most common command-line parameters in `ik_llama.cpp` and some info how to use them. It is not exhaustive and may omit some available options.
|
||
|
||
## Table of Contents
|
||
|
||
- [Some jargon](#llm-jargon)
|
||
|
||
- [General Parameters](#general-parameters)
|
||
|
||
- [Speculative Decoding](#speculative-decoding)
|
||
|
||
- [Cache Prompt to Host Memory](#cache-prompt-to-host-memory)
|
||
|
||
- [Sampling](#sampling)
|
||
|
||
- [Template](#prompt-template)
|
||
|
||
- [Context Hacking](#context-hacking)
|
||
|
||
- [Parallel Processing](#parallel-processing)
|
||
|
||
- [Multi-modality](#multi-modality)
|
||
|
||
- [GPU Offload](#gpu-offload)
|
||
|
||
- [Model Options](#model-options)
|
||
|
||
- [Server Options](#server-options)
|
||
|
||
- [Other Tools](#other-tools)
|
||
|
||
- [Unique parameters](#unique-parameters)
|
||
|
||
- [Graph parallel models](#graph-parallel-models)
|
||
|
||
## LLM Jargon
|
||
|
||
Some often used terms.
|
||
|
||
| Term | Meaning |
|
||
| - | - |
|
||
| LLM/model | Large Language Model, language model trained with machine learning on a vast amount of text. |
|
||
| Tensors | The foundational part of a model, are just a multi-dimensional array of numbers (Scalar, Vector, Matrix, Higher Dimensions). |
|
||
| Layers | Modular units that perform specific computations on the tensors. A neural network is essentially a stack of layers, each transforming the data in some way. |
|
||
| Weights | Numerical values associated with the connections between tensors in the layers. |
|
||
| Activations | Output of a layer after it has performed its computations. |
|
||
| FA | Flash Attention is a method to improve the efficiency of transformer models [Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention) |
|
||
| VRAM | Dedicated memory in GPU. |
|
||
| Inference | Run a model to generate responses. |
|
||
| GGUF | The file format used by ik_llama.cpp and llama.cpp |
|
||
| quants| The "compressed" format of the model. |
|
||
| BPW | Bits per weight, measures the "compression". |
|
||
| imatrix | Generated by a model from calibration text. Tweaks the "compression" to reduce loss. |
|
||
| model splits | GGUF file can be split in multiple parts to simplify upload/download. When using such model, specify only the first part. |
|
||
| PP | Prompt processing. |
|
||
| TG | Token generation. |
|
||
| t/s | Token/second, measures PP and TG. |
|
||
| full gpu | All processes offloaded to the GPU. |
|
||
| hybrid cpu/gpu | Partial offload to the GPU. |
|
||
| RAG | Retrieval Augmented Generation. Provide external documents to the LLM for information lookup. |
|
||
| MCP | Model Context Protocol ), an [open standard](https://en.wikipedia.org/wiki/Model_Context_Protocol) for the way artificial intelligence (AI) systems like large language models (LLMs) integrate and share data with external tools, systems, and data sources |
|
||
| AI agent | Tool/program that uses LLM to achieve a goal/task via a series of planning/steps/actions/tool-calling/etc. `Coding agents` are specialized in software goals. |
|
||
| Agent harness | The tools and the infrastructure around the LLM in an AI Agent. `AI Agent = LLM+ Agent harness` |
|
||
|
||
## General Parameters
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `-h, --help, --usage` | Print usage and exit | - | - |
|
||
| `--fit` | Automatically fit to available VRAM | off | Loads as many tensors to the GPU(s) as available VRAM will permit. [PR 1501](https://github.com/ikawrakow/ik_llama.cpp/pull/1501) [PR 1504](https://github.com/ikawrakow/ik_llama.cpp/pull/1504) |
|
||
| `--fit-margin N` | Safety VRAM margin in MiB when using `--fit` | 1024 | Increase this value in case of CUDA OOM when loading the model. Decrease to less than 1024 if the model loads successfully and you feel that too much VRAM has been left unused |
|
||
| `--gpu-fit-margin GPU1,M1,...` | Per GPU fit margin | - | Set the fit margin per GPU when auto-fitting the model. [PR 1872](https://github.com/ikawrakow/ik_llama.cpp/pull/1872) |
|
||
| `-wgt, --worst-graph-tokens N` | Number of tokens to use for worst-case graph | - | Control compute buffer sizes for large batches. Provided "as is" for users that understand the limitations, please don't open issues when using this. [PR 1560](https://github.com/ikawrakow/ik_llama.cpp/pull/1560) |
|
||
| `-t, --threads N` | Number of threads to use during generation | 4 | Try to match the number of physical CPU cores. Avoid odd numbers (e.g. 1,3,...). |
|
||
| `-tb, --threads-batch N` | Number of threads to use during batch and prompt processing | Same as `--threads` | Same as `--threads` When doing full GPU offload, use a lower number (e.g. 2) |
|
||
| `-tm, --threads-mtmd N` | Number of threads to use during multimodal image processing | Same as `--threads-batch` | Control CPU thread count used during multimodal image/audio processing (mmproj encoding), separate from the main LLM thread count. |
|
||
| `-c, --ctx-size N` | Size of the prompt context | 0 (loaded from model) | Influences the size of KV size (memory) therefore look for a value that fits your system then increase as needed (2048, 4096,…). If you use parallel slots, this context size will be split across the slots. |
|
||
| `-n, --predict N` | Number of tokens to predict | -1 (infinity) | -1 (infinity), -2 (until context filled). Safe to leave default. |
|
||
| `-b, --batch-size N` | Logical maximum batch size | 2048 | Safe to leave default. Higher values may give better t/s especially on GPU, while using more memory. |
|
||
| `-ub, --ubatch-size N` | Physical maximum batch size | 512 | Safe to leave default. Similar to `--batch-size N` |
|
||
| `--keep N` | Number of tokens to keep from the initial prompt | 0 | -1 = all |
|
||
| `--chunks N` | Max number of chunks to process | -1 (all) | |
|
||
| `-dr, --dry-run` | Skip loading tensors in the files | - | Skips loading files, yet still report OOM error and print memory usage correctly, which is helpful for manually tuning of very large models. |
|
||
| `--minilog` | Print important information | - | For `llama-server`, log request message for completions/response/anthropic and response. The prompt in the json format and the text response are saved in the log file and printed to the console. [PR 1477](https://github.com/ikawrakow/ik_llama.cpp/pull/1477) |
|
||
| `-fa, --flash-attn` | Enables Flash Attention | on | auto / on / off Improves t/s and reduces memory usage. |
|
||
| `--no-fa, --no-flash-attn` | Disable Flash Attention | | Alternative parameter to turn of FA. See `--flash-attn` |
|
||
| `-mla, --mla-use` | Enable MLA | 3 | 0 / 1 / 2 / 3 For DeepSeek models, and other recent models that are using MLA. [PR 188](https://github.com/ikawrakow/ik_llama.cpp/pull/188) [PR 205](https://github.com/ikawrakow/ik_llama.cpp/pull/205) [PR 235](https://github.com/ikawrakow/ik_llama.cpp/pull/235) [PR 243](https://github.com/ikawrakow/ik_llama.cpp/pull/243) [PR 252](https://github.com/ikawrakow/ik_llama.cpp/pull/252) [PR 253](https://github.com/ikawrakow/ik_llama.cpp/pull/253) [PR 273](https://github.com/ikawrakow/ik_llama.cpp/pull/273) [PR 386](https://github.com/ikawrakow/ik_llama.cpp/pull/386) [PR 497](https://github.com/ikawrakow/ik_llama.cpp/pull/497) [PR 943](https://github.com/ikawrakow/ik_llama.cpp/pull/943) [PR 1821](https://github.com/ikawrakow/ik_llama.cpp/pull/1821) |
|
||
| `-amb, --attention-max-batch` | Max batch size for attention computations | 0 | Specifies the maximum K*Q size in MB we want to tolerate. [PR 237](https://github.com/ikawrakow/ik_llama.cpp/pull/237) |
|
||
| `-fmoe or --fused-moe` | Fused MoE ffn_up and ffn_gate | - | Speedup for MoE models. [PR 229](https://github.com/ikawrakow/ik_llama.cpp/pull/229) |
|
||
| `--no-fmoe, --no-fused-moe` | Disable fused MoE | Enabled | See `--fused-moe` |
|
||
| `-ger, --grouped-expert-routing` | Enable grouped expert routing | Disabled | For BailingMoeV2 architecture (Ling/Ring models). [PR 836](https://github.com/ikawrakow/ik_llama.cpp/pull/836) [PR 838](https://github.com/ikawrakow/ik_llama.cpp/pull/838) |
|
||
| `--no-fug, --no-fused-up-gate` | Disable fused up-gate | Enabled | Turn off the speedup for dense models. [PR 741](https://github.com/ikawrakow/ik_llama.cpp/pull/741) |
|
||
| `--no-mmad, --no-fused-mul-multiadd` | Disable fused mul-multi_add | Enabled | [PR 858](https://github.com/ikawrakow/ik_llama.cpp/pull/858) |
|
||
| `-gr, --graph-reuse` | Enable graph reuse | Enabled | For models with fast TG inference (100+ t/s). [PR 947](https://github.com/ikawrakow/ik_llama.cpp/pull/947) |
|
||
| `--no-gr, --no-graph-reuse` | Disable graph reuse | Disabled | Option to turn off graph reuse. [PR 1094](https://github.com/ikawrakow/ik_llama.cpp/pull/1094) |
|
||
| `-ser, --smart-expert-reduction` | Experts reduction Kmin,t | -1, 0 | Use a custom number of active experts. Powerful, basically REAP from just command line. If we set t = 1, we use a fixed number of experts K_min (`-ser 1,6` will use 6 experts instead of the model default). [PR 239](https://github.com/ikawrakow/ik_llama.cpp/pull/239) |
|
||
| `-mqkv, --merge-qkv` | Merge Q,K,V | 0 | Downside: mmap cannot be used. [PR 878](https://github.com/ikawrakow/ik_llama.cpp/pull/878) [PR 892](https://github.com/ikawrakow/ik_llama.cpp/pull/892) |
|
||
| `-muge, --merge-up-gate-experts` | Merge ffn_up/gate_exps | 0 | Speed up on some models. [PR 1137](https://github.com/ikawrakow/ik_llama.cpp/pull/1137) [PR 1139](https://github.com/ikawrakow/ik_llama.cpp/pull/1139) [PR 1403](https://github.com/ikawrakow/ik_llama.cpp/pull/1403) [PR 1413](https://github.com/ikawrakow/ik_llama.cpp/pull/1413)|
|
||
| `-khad, --k-cache-hadamard` | Use Hadamard transform for K-cache | 0 | May improve KV quality when heavily quantized. [PR 1033](https://github.com/ikawrakow/ik_llama.cpp/pull/1033) [PR 1034](https://github.com/ikawrakow/ik_llama.cpp/pull/1034) |
|
||
| `-vhad, --v-cache-hadamard` | Use Hadamard transform for V-cache | 0 | May improve KV quality when heavily quantized. [PR 1527](https://github.com/ikawrakow/ik_llama.cpp/pull/1527) |
|
||
| `-sas, --scheduler_async` | Async evaluation of compute graphs | 0 | [PR 1089](https://github.com/ikawrakow/ik_llama.cpp/pull/1089) |
|
||
| `-vq, --validate-quants` | Validate quantized data while loading the model | 0 | If there are NaNs in the model, you will get info about the tensors containing NaNs. [PR 977](https://github.com/ikawrakow/ik_llama.cpp/pull/977) |
|
||
| `-sp, --special` | Special tokens output enabled | false | |
|
||
| `--no-warmup` | Skip warming up the model with an empty run | - | |
|
||
| `--mlock` | Force system to keep model in RAM rather than swapping or compressing | - | |
|
||
| `--no-mmap` | Do not memory-map model (slower load but may reduce pageouts) | - | |
|
||
| `--ui-mcp-proxy, --webui-mcp-proxy` | Experimental: whether to enable MCP CORS proxy - do not enable in untrusted environments | disabled | Support CORS Proxy on llama-server backend side. It is required to make external mcp server work on llamacpp webui. [PR 1904](https://github.com/ikawrakow/ik_llama.cpp/pull/1904) |
|
||
| `--defer-experts` | Defer expert mmap residency on Linux to reduce model load time | false | Using this flag, expert tensor pages are faulted in on demand rather than being eagerly loaded during initialization. This allows us to reduce cold-start latency, thus improving the load time of MoE models, particularly on systems where users are running models off of storage. [PR 1634](https://github.com/ikawrakow/ik_llama.cpp/pull/1634) |
|
||
| `-rtr, --run-time-repack` | Repack tensors if interleaved variant is available | - | May improve performance on some systems. [PR 147](https://github.com/ikawrakow/ik_llama.cpp/pull/147) |
|
||
| `--ctx-checkpoints` | set the number of checkpoints per slot | - | enable checkpoint for recurrent models Qwen3-Next and Qwen3.5-MoE. [PR 1310](https://github.com/ikawrakow/ik_llama.cpp/pull/1310) |
|
||
| `--ctx-checkpoints-interval` | minimum number of tokens between each context checkpoint. | - | If you want to create the checkpoint more frequently, set it to a small value. If it's set to positive number, it saves checkpoints during TG at this interval. During PP, it can only save checkpoint every batch size, so it becomes minimum number of tokens between each context checkpoint. [PR 1310](https://github.com/ikawrakow/ik_llama.cpp/pull/1310) |
|
||
|
||
## Speculative Decoding
|
||
|
||
A technique that can significantly accelerate token generation by predicting multiple tokens ahead of the main model.
|
||
|
||
Check the details [here](./speculative.md).
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `-td, --threads-draft N` | Number of threads to use during generation | Same as `--threads` | |
|
||
| `-tbd, --threads-batch-draft N` | Number of threads to use during batch and prompt processing | Same as `--threads-draft` | |
|
||
| `-ps, --p-split N` | Speculative decoding split probability | 0.1 | |
|
||
| `-cd, --ctx-size-draft N` | Size of the prompt context for the draft model | 0 (loaded from model) | Similar to `--ctx-size` but applied to the draft model, if used. |
|
||
| `-ctkd, --cache-type-k-draft TYPE` | KV cache data type for K for the draft model | - | For draft model, see: `-ctk` |
|
||
| `-ctvd, --cache-type-v-draft TYPE` | KV cache data type for V for the draft model | - | For draft model, see: `-ctk` |
|
||
| `-draft, --draft-params` | Comma-separated list of draft model parameters | - | |
|
||
| `--spec-type SPEC[:k=v,...]` | Canonical speculative stage entry; repeat to configure the supported two-stage chain | - | Types: `none`, `draft`, `mtp`, `ngram-cache`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `suffix`. Canonical keys: `n_max`, `n_min`, `p_min`, `ngram_size_n`, `ngram_size_m`, `ngram_min_hits`, `suffix_min_match_len`, `suffix_max_depth`, `suffix_corpus`. String values may escape commas as `\,` or quote the value inside the stage payload. Example: `--spec-type ngram-mod:n_max=64,n_min=2,ngram_size_n=8 --spec-type mtp:n_max=1,p_min=0.0` |
|
||
| `--spec-autotune` | Automatically tune speculative params to maximize tokens/sec | - | Automatically determines the near-optimal arguments for the type of speculation being performed [PR 1595](https://github.com/ikawrakow/ik_llama.cpp/pull/1595) |
|
||
| `--recurrent-ckpt-mode MODE` | Checkpoint strategy for recurrent/hybrid speculative decoding | auto | One of: - `auto` auto-select: per-step if CUDA full-GPU, gpu-fallback otherwise - `per-step` save SSM state per draft step in VRAM; no re-decode on rejection - `gpu-fallback` copy state to GPU buffer; re-decode on rejection - `cpu` serialise state via llama_state_seq; re-decode on rejection [PR 1669](https://github.com/ikawrakow/ik_llama.cpp/pull/1669) [PR 1774](https://github.com/ikawrakow/ik_llama.cpp/pull/1774) |
|
||
|
||
Notes:
|
||
|
||
- Legacy `--spec-stage`, `--draft-*`, `--spec-ngram-*`, `--suffix-*`, and `-mtp` flags are rejected with replacement guidance.
|
||
- Explicit stage chains currently support at most two stages.
|
||
- Supported self-spec stage names are `ngram-cache`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, and `suffix`.
|
||
- Composite stage chains disable speculative autotune.
|
||
|
||
## Cache Prompt to Host Memory
|
||
|
||
When user starts a new conversation, the old conversation's kv cache will be saved in ram and can be retrieved later. This greatly reduces prompt processing time when switching between conversations and can have as many conversation as your ram is allowed.
|
||
|
||
Note: When the available memory is very limited, turn this option off (`-cram 0`) to avoid memory swaping.
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `-cram, --cache-ram N` | Set the maximum cache size in MiB | 8192 | -1 = no limit, 0 = disable Very useful when the variations of the same prompt are re-sent to the model (coding agents, etc.). [PR 954](https://github.com/ikawrakow/ik_llama.cpp/pull/954) |
|
||
| `-crs, --cache-ram-similarity N` | Max similarity of prompt tokens to cache tokens that triggers prompt cache | 0.50 | |
|
||
| `-cram-n-min, --cache-ram-n-min N` | Minimum number of cached tokens that triggers prompt cache | 0 | |
|
||
|
||
## Sampling
|
||
|
||
Sampling refers to the techniques used by models to generate text by selecting the next word or token based on probabilities.
|
||
|
||
Good overview on [kalomaze/llm_samplers_explained.md](https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e).
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `--samplers SAMPLERS` | Samplers used for generation in order, separated by `;` | dry;top_k;tfs_z;typical_p;top_p;min_p;xtc;top_n_sigma;temperature;adaptive_p | Powerful option to customize samplers. Try to keep the default order otherwise effects will be minimized. Example to use only min_p and temperature: `--samplers min_p;temperature` |
|
||
| `--sampling-seq SEQUENCE` | Simplified sequence for samplers | dkfypmxntw | Same as `--samplers`, just shorter format. |
|
||
| `--banned-string-file` | File path of the list of banned strings on each line | | |
|
||
| `--banned-n` | Number of tokens banned in the phrase during rewind. | -1 | -1 means all tokens [PR 1185](https://github.com/ikawrakow/ik_llama.cpp/pull/1185) |
|
||
| `--expiring-logit-bias-file FILENAME` | Load bias states from a custom file format | - | [PR 1731](https://github.com/ikawrakow/ik_llama.cpp/pull/1731) [PR 1770](https://github.com/ikawrakow/ik_llama.cpp/pull/1770) |
|
||
|
||
## Prompt Template
|
||
|
||
Incorrect prompt template or it's format may break the model output.
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `--jinja` | Set custom jinja chat template | Template taken from model's metadata | Mandatory for Tool Calling. |
|
||
| `--chat-template JINJA_TEMPLATE` | Use jinja template for chat | Disabled | If there is no official `tool_use` Jinja template, you may want to set `--chat-template chatml` to use a default that works with many models |
|
||
| `--chat-template-file file_with_JINJA_TEMPLATE` | Load jinja template for chat from the file | - | Sometimes the model producer or community fixes the template after the GGUF files are released, therefore it’ metadata contains buggy version. To avoid re-downloading the entire model file, download only the .jinja file the use it (`--chat-template-file /models/Qwen_Qwen3-Coder-30B-A3B-Instruct-fixed.jinja`). |
|
||
| `--reasoning-format FORMAT` | Controls whether thought tags are allowed and/or extracted from the response | none | One of: - `none` leaves thoughts unparsed in `message.content` - `deepseek` puts thoughts in `message.reasoning_content` (except in streaming mode, which behaves as `none`) - `deepseek-legacy` keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`. This is useful when the frontend (including agents) is hardcoded to use just a specific format. |
|
||
| `--chat-template-kwargs JSON` | Sets additional params for the json template parser | - | Example for gpt-oss: `--chat-template-kwargs '{"reasoning_effort": "medium"}'` |
|
||
| `--reasoning-budget N` | Controls the amount of thinking allowed | -1 (unrestricted) | 0 (disable thinking) |
|
||
| `--reasoning-tokens FORMAT` | Exclude reasoning tokens to select the slot more accurately | auto | |
|
||
| `--reasoning` | Control reasoning on and off | - | on / off / auto [PR 1376](https://github.com/ikawrakow/ik_llama.cpp/pull/1376) |
|
||
| `--reasoning-budget` | Token budget for thinking | -1 | -1 for unrestricted, 0 for immediate end, N>0 for token budget [PR 1376](https://github.com/ikawrakow/ik_llama.cpp/pull/1376) |
|
||
| `--reasoning-budget-message` | Message injected before the end-of-thinking tag when reasoning budget is exhausted | none | [PR 1376](https://github.com/ikawrakow/ik_llama.cpp/pull/1376) |
|
||
| `--parallel-tool-calls` | enable parallel tool calls | - | [PR 1376](https://github.com/ikawrakow/ik_llama.cpp/pull/1376) |
|
||
| `--skip-chat-parsing` | force a pure content parser, even if a Jinja template is specified; model will output everything | - | [PR 1376](https://github.com/ikawrakow/ik_llama.cpp/pull/1376) |
|
||
| `--peg` | Use peg parser for qwen3.5 models. | - | Force Qwen3.5 model to use peg parser to process tool calls, which fixes the crash when the model calls the non existing function. [PR 1490](https://github.com/ikawrakow/ik_llama.cpp/pull/1490) |
|
||
|
||
## Context Hacking
|
||
|
||
KV cache improves speed and efficiency especially at long context by reusing [past calculations](https://huggingface.co/blog/not-lain/kv-caching).
|
||
|
||
SWA keeps a sliding window of the prompt when prompt is longer than the context size and shift the kv cache accordingly to avoid reprocessing the whole prompt.
|
||
|
||
The context (a.k.a. KV cache) is stored on the device where the associated attention tensors are.
|
||
|
||
MLA models already have the cache compressed, it doesn't really makes sense to compress it with the available parameters.
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `-dkvc, --dump-kv-cache` | Verbose print of the KV cache | - | |
|
||
| `-nkvo, --no-kv-offload` | Disable KV offload | - | Keep KV on CPU. |
|
||
| `-ctk, --cache-type-k TYPE` | KV cache data type for K | f16 | Reduces K size in KV which improves speed and reduces memory requirements, but may reduce output quality. |
|
||
| `-ctv, --cache-type-v TYPE` | KV cache data type for V | f16 | See: `-ctk` |
|
||
| `-mtprot, --mtp-requantize-output-tensor type` | Use output requantized to type for MTP | - | Improves TG performance for when using MTP. It requantize the tensor on-the-fly while loading the model, see [PR 1809](https://github.com/ikawrakow/ik_llama.cpp/pull/1809) for details and [PR 1810](https://github.com/ikawrakow/ik_llama.cpp/pull/1810) `--extra-output-tensor` as offline requantize alternative. |
|
||
| `--mtmd-kq-type type` | Define the type used for the `K*Q` matrix multiplication | - | Use one of `f16`/`bf16` instead of `f32` to improve speed up multimodal |
|
||
| `--no-context-shift` | Disable context-shift | - | |
|
||
| `--context-shift` | Set context-shift | on | auto / on / off / 0 / 1 [PR 973](https://github.com/ikawrakow/ik_llama.cpp/pull/973) |
|
||
|
||
## Parallel Processing
|
||
|
||
Serve multiple users/frontends in parallel.
|
||
|
||
Some frontends, like the included Webui, can use this feature to allow user start a new chat while another one is still generating.
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `-np, --parallel N` | Number of parallel sequences to decode | 1 | Useful when frontend support it. See `--ctx-size` |
|
||
|
||
## Multi-modality
|
||
|
||
Use multimodal models.
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `--mmproj FILE` | path to a multimodal projector file | - | Usually separate `.gguf` files are available for download, e.g. [mmproj-Qwen_Qwen3.6-35B-A3B-f16.gguf](https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF/blob/main/mmproj-Qwen_Qwen3.6-35B-A3B-f16.gguf) for Qwen3.6-35B-A3B |
|
||
| `--image-min-tokens N` | Minimum number of tokens each image can take, only used by vision models with dynamic resolution | read from model | - |
|
||
| `--image-max-tokens N` | Maximum number of tokens each image can take, only used by vision models with dynamic resolution | read from model | - |
|
||
| `--no-mmproj-offload` | Disable GPU offloading for multimodal projector | enabled | See `--threads-mtmd` |
|
||
|
||
## GPU Offload
|
||
|
||
ik_llama.cpp, like [llama.cpp](https://github.com/ggml-org/llama.cpp), uses CPU as a base for processing.
|
||
|
||
Therefore, the "offloading" term is used when sending some processing to another device (like GPU).
|
||
|
||
As the GPUs (including their VRAM) are more powerful for LLM specific processing than CPU+RAM, the aim is to offload as much as possible to the GPU.
|
||
|
||
Beside the improved quants (better quality and performance at the same size; usable low BPW), superior performance (faster PP ang TG), ik_llama.cpp really shines at providing:
|
||
- Detailed output log which e.g. includes layers and buffers sizes to support offload calculations.
|
||
- A big collection of parameters to tweak offloading (what/where runs: processing, tensors, KV cache, operations, etc.).
|
||
- Split mode `graph` when multiple GPUs are available, including mixes of different GPU types, various VRAM sizes.
|
||
- Many KV cache options, including Hadamard, which allows squeezing every GB of memory.
|
||
- Highly optimized algorithm to automatically load as many tensors to the GPU(s) `--fit`.
|
||
|
||
A. **Find the model size** in GB
|
||
|
||
Ideally, it should fit entirely in the VRAM (`-ngl 999`). It needs the size of the model file plus the size of KV cache (which depends by the context length `--ctx-size 4096`) and some buffers.
|
||
|
||
Note that the model size influences the speed as well, with smaller sizes being faster (less data to move around and calculate).
|
||
|
||
```
|
||
llama-server --model /my_local_files/gguf/Qwen_Qwen3-0.6B-IQ4_NL.gguf --ctx-size 4096 -ngl 999
|
||
```
|
||
|
||
B. When the **model size is too large** to fit in VRAM
|
||
|
||
Some tradeoffs are required.
|
||
|
||
1. Choose a lower quant. This varies by models, generally:
|
||
|
||
- BF16 is too big, doesn't really make sense to be used for inference.
|
||
- Q8_0 has almost the same quality as BF16 while being half size.
|
||
- Q6_0 has almost the same quality as Q8_0. For quants under Q6_0 the imatrix usage is recommended. On the model metadata look for `quantize.imatrix.*` fields to see if that file was using it.
|
||
- IQ5_K is close to the Q8_0 while being smaller.
|
||
- IQ4_XS `i` and `iqk` have minimal loss.
|
||
- IQ4_KS
|
||
- IQ4_KSS
|
||
- IQ3_K from here `iqk` makes it possible to have model still usable.
|
||
- IQ2_K
|
||
- IQ2_KS
|
||
- IQ2_XXS
|
||
|
||
Notes:
|
||
- The `i` quants are a category, they are not related to imatrix. Use of imatrix is optional (but generally recommend) and is supported by *all quant types* (`legacy`, `k`, `i`, `iqk`) except bitnet.
|
||
- Look in the logs to see the quant types used by the loaded model:
|
||
```
|
||
llama_model_loader: - type f32: 113 tensors
|
||
llama_model_loader: - type q6_K: 198 tensors
|
||
```
|
||
|
||
2. Quantize the KV cache. By default, `f16` is used. As with model quantization, this varies by model, some being sensitive, while others working with very low quant.
|
||
|
||
- Look in the logs for KV details:
|
||
```
|
||
./llama-server -m /models/Qwen_Qwen3-0.6B-Q6_K.gguf -c 1024
|
||
[...]
|
||
llama_kv_cache_init: CPU KV buffer size = 3584.00 MiB
|
||
llama_new_context_with_model: KV self size = 3584.00 MiB, K (f16): 1792.00 MiB, V (f16): 1792.00 MiB
|
||
```
|
||
|
||
```
|
||
./llama-server -m /models/Qwen_Qwen3-0.6B-Q6_K.gguf -c 1024 --cache-type-k q8_0 --cache-type-v q8_0
|
||
[...]
|
||
llama_kv_cache_init: CPU KV buffer size = 59.50 MiB
|
||
llama_new_context_with_model: KV self size = 59.50 MiB, K (q8_0): 29.75 MiB, V (q8_0): 29.75 MiB
|
||
```
|
||
|
||
- To have access to more quant types, build with `GGML_IQK_FA_ALL_QUANTS=ON`, otherwise only `F16`, `Q8_0`, `Q6_0`, and, if the CPU provides native `BF16` support, `BF16` FA kernels will be included. After [PR 1549](https://github.com/ikawrakow/ik_llama.cpp/pull/1549), on **CPU** are enabled as well `Q4_1`, `IQ4_NL`, `Q4_0` by default to allow people experiment; use `GGML_IQK_FA_ALL_QUANTS=OFF` to reduce build time if those quants are not needed.
|
||
- K-cache may need better quant than V-cache to reduce quality loss, they can be specified separately `--cache-type-k q8_0 --cache-type-v q8_0`
|
||
- It needs FA `--flash-attn` flag, which is already turned on by default.
|
||
- Fast quant type Q8_KV `-ctk q8_KV` [PR 208](https://github.com/ikawrakow/ik_llama.cpp/pull/208)
|
||
- Using `--k-cache-hadamard` on quants lower than `Q6_0` may give better results. Additionally, `ik_llama.cpp` provides `--v-cache-hadamard` for the V-cache. Example: `--cache-type-k q6_0 --k-cache-hadamard --cache-type-v q6_0 --v-cache-hadamard`
|
||
- `Q4_0` achieves low perplexity even without Hadamard.
|
||
|
||
3. Offload less to the GPU. Try to find a mix of parameters that better suits your system that default.
|
||
|
||
- Try `--fit`. `ik_llama.cpp` automatically determine which tensors to offload to the GPUs based on the available VRAM.
|
||
|
||
- Use `--no-kv-offload` to keep KV cache on CPU. This is provided for flexibility, and practically not desired as reduces the prompt processing speed.
|
||
|
||
- Identify tensors, how many layers (also shape and more metadata) by opening the GGUF model file on the Web browser [bartowski/Qwen_Qwen3-0.6B-IQ4_NL.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-0.6B-GGUF/blob/main/Qwen_Qwen3-0.6B-IQ4_NL.gguf) then scroll down to the Tensors table. For the split models, look to each file part.
|
||
|
||
Or, if you already have the quant locally you can just run `gguf_dump.py`:
|
||
```
|
||
python3 gguf-py/scripts/gguf_dump.py /models/Qwen_Qwen3-0.6B-IQ4_NL.gguf
|
||
```
|
||
|
||
- Use `--dry-run` to observe the memory usage.
|
||
|
||
- `-ngl`, `-ot`, `--cpu-moe`, `--n-cpu-moe N`
|
||
- For MoE models, use a number greater than the number of model layers with `-ngl`. If unsure, use a large number like `-ngl 999`.
|
||
- It's good to explicitly put up/down/gate onto the GPU for speedups.
|
||
- Up/Gate shouldn't be on separate GPU devices because it might cause a bit of a deadlock.
|
||
- For models with shared experts (like GPT-OSS), they should end up on GPU.
|
||
- In some quants the layers aren't uniform so it can be better to skip larger layers if more smaller blocks will fit without empty space where nothing fits.
|
||
- You put anything that says "exps" in your slowest memory, and anything else in your fastest memory (VRAM). Those ffn "exps" are the sparse experts tensors, the ones that get actually used only 2-5% of the times (depending on the model). If then you have extra VRAM to spare, you start putting some of the exps into VRAM too, for some improvements.
|
||
- Some layers (layers are called `blk.n` in gguf), are different in some models. For example [GLM5](https://huggingface.co/ubergarm/GLM-5-GGUF/blob/main/IQ3_KS/GLM-5-IQ3_KS-00002-of-00008.gguf) the first three layers are different (blk.0(14), blk.1(14), blk.2(14) vs. blk.10(19), blk.11(19),...), they don't have exps, they have dense ffn, so they should all go in VRAM. Dense layers are very good to speed up mixed inference systems, as a much larger share of active parameters is fixed, and hence you know which to put in faster VRAM. Also the layers from the 4th onwards have shared exps, "shexp", those too go to VRAM as they are always active.
|
||
- For MoE models you can play with `--cpu-moe`, `--n-cpu-moe N`, `-ooae`/`-no-ooae` before moving to `-ot`.
|
||
- In general, in a single GPU + CPU system, you just do something like this:
|
||
|
||
`-ngl 999` To put all layers in VRAM by default
|
||
|
||
`-ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn._exps.=CPU"` To create exceptions and put back in ram anything that has "ffn" and "_exps" in its name, and that sits in layers called "blk.n", where "n" (the lawyer number) is any match between 0 and 9, or between 1 to 7 + 0 to 9 (aka a number between 10 and 79), or 8 + 0 to 7 (aka a number between 80 and 87).
|
||
Basically a complicated way of saying put all experts from layer 0 to 87 in ram. Experts from layer 88 to 93 (there's 93 layers in qwen3vl 235b) can sit in VRAM still. (Thats all I can load on a 5090).
|
||
|
||
C. Other tips
|
||
|
||
- Ensure that you use a CUDA version that supports your GPU(s)
|
||
- Check for errors `lspci -vvv | grep -F 'at lane'`
|
||
- Multiple GPUs
|
||
- `nvidia-smi topo -p2p r`
|
||
- Change the order of GPUs with `CUDA_VISIBLE_DEVICES=...` until the best GPU is used appropriately, especially when GPUs are different (type, capability, slot speed, etc.).
|
||
- If you are not happy with the allocations done by `--fit` across GPUs, use `-ts` to manually tweak.
|
||
- Look for `ReBAR`/`Resizable BAR` support for your Motherboard, CPU, BIOS/UEFI and GPU. Then for the "patched driver" for your GPUs to enable GPU to GPU direct communication.
|
||
|
||
### Common GPU configurations and popular models
|
||
|
||
WIP
|
||
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `-ngl, --gpu-layers N` | Number of layers to store in VRAM | - | For better speed you aim to offload the entire model in GPU memory. To identify how many layers (also shape and more metadata) open the GGUF model file on the Web browser [bartowski/Qwen_Qwen3-0.6B-IQ4_NL.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-0.6B-GGUF/blob/main/Qwen_Qwen3-0.6B-IQ4_NL.gguf) then scroll down to the Tensors table. Use a number higher than the numbers of model layers to fully offload (`--gpu-layers` 99, for a model with less than 99 layers). See `--ctx-size` and reduce it to the minimum needed. If model fails to load due to the insufficient GPU memory, reduce the number of layers (`--gpu-layers 20`, for a model with 40 layers will offload only the first 20 layers). |
|
||
| `-ngld, --gpu-layers-draft N` | Number of layers to store in VRAM for the draft model | - | For draft model, see `--gpu-layers` |
|
||
| `--cpu-moe` | Keep all MoE weights in CPU memory | - | Simple offload mode for MoE. [PR 841](https://github.com/ikawrakow/ik_llama.cpp/pull/841) |
|
||
| `--n-cpu-moe N` | Keep MoE weights of the first N layers in CPU memory | - | Similar to `--cpu-moe` but when some GPU memory is available to store some layers. |
|
||
| `-sm, --split-mode SPLIT_MODE` | How to split the model across multiple GPUs | none | When you have more than one GPU, how to split the model across multiple GPUs, one of: - `none` use one GPU only. - `graph` split model tensors and computation graph across GPUs. `graph` is exclusive here and extremely effective for dense and MoE [PR 1080](https://github.com/ikawrakow/ik_llama.cpp/pull/1080). - `layer` split layers and KV across GPUs Example: `-sm graph ` |
|
||
| `-ts, --tensor-split SPLIT` | Fraction of the model to offload to each GPU (comma-separated) | - | Powerful for tweaking. Example: `-ts 3,1` |
|
||
| `-dev, --device dev1,dev2` | Comma-separated list of devices to use for offloading | none | If there are many GPUs available on the system and only selected ones need to be used. Example: `-dev CUDA0,CUDA1` |
|
||
| `-devd, --device-draft dev1,dev2` | Comma-separated list of devices for draft model | none | For draft model, see `--device` |
|
||
| `-mg, --main-gpu i` | The GPU to use for the model (with split-mode = none) | - | |
|
||
| `-cuda fa-offset=value` | FP16 precision offset for FA calculation | 0 | Rarely, fp16 precision is inadequate, at least for some models, when computing FA for very long contexts. Value must be a valid floating point number in the interval [0...3] (this is checked and if the supplied value is outside this interval it is ignored). By the default the offset is zero. If you find that a model works up to a given context length but then starts producing gibberish/incoherent output/endless repetitions, it is very likely it is due to f16 overflow in the FA calculation, and using this command line option is likely to solve it. [PR 1198](https://github.com/ikawrakow/ik_llama.cpp/pull/1198) |
|
||
| `-ot or --override-tensor` | Override where model weights are stored | - | Override where model weights are stored using regular expressions. This allows for example to keep the MoE experts on the CPU and to offload only the attention and not repeating layers to the GPU. Example: `\.ffn_.*_exps\.=CPU` [PR 232](https://github.com/ikawrakow/ik_llama.cpp/pull/232) |
|
||
| `-op or --offload-policy a,b` | Manually define the offload policy | - | a and b are integers. One can have multiple pairs following the -op or --offload-policy argument (i.e., -op a1,b1,a2,b2,a3,b3...). The first integer defines the op (see below). The second integer is 0 or 1 and defines if the op should be offloaded (1) or not offloaded (0) to the GPU. The first integer is simply the `enum` value in the `ggml_op enum`. If the op is set to -1, then all op offloads are set to enabled or disabled. Examples: `-op -1,0`: disable all offload to the GPU `-op 26,0`: disable offload of matrix multiplications to the GPU `-op 27,0`: disable offload of indirect matrix multiplications to the GPU (used for the experts in a MoE model) `-op 29,0`: disable fused up-gate-unary op offload to the GPU (applied to MoE models with `-fmoe`) [PR 405](https://github.com/ikawrakow/ik_llama.cpp/pull/405) |
|
||
| `--offload-only-active-experts or -ooae` | On MOE offload only active experts | ON | `-ooae` is not related to where the model weights get stored. Instead, once we have some MoE tensors (ffn_(up|gate|down)_exps.weight) on the CPU, and during batch processing the scheduler decides to copy them to a GPU to perform the corresponding matrix multiplications, `-ooae` tells the scheduler to offload only the activated experts. Offloading only the activated experts is useful for some models, where often the number of activated experts is much smaller than the total number of experts, so `-ooae` reduces the amount of `RAM -> VRAM` data transfer. A model where this makes a significant difference for hybrid CPU/GPU inference is GPT-OSS-120B. For many MoE models and large batches basically all experts are activated, so this option makes no difference (or can even slightly lower performance because it costs some time to determine which experts are active, but if all experts turn out to be active, this time was spent for nothing). [PR 698](https://github.com/ikawrakow/ik_llama.cpp/pull/698) |
|
||
| `-no-ooae` | Disable offload only active experts | - | See `-ooae` |
|
||
| `-smf16, --split-mode-f16` | Use f16 for data exchange between GPUs | 1 | [PR 1087](https://github.com/ikawrakow/ik_llama.cpp/pull/1087) |
|
||
| `-smf32, --split-mode-f32` | Use f32 for data exchange between GPUs | 0 | [PR 1087](https://github.com/ikawrakow/ik_llama.cpp/pull/1087) |
|
||
| `-grt, --graph-reduce-type` | Type for data exchange between GPUs | f32 | q8_0 / bf16 / f16 / f32 Reduce the data transferred between GPUs [PR 1154](https://github.com/ikawrakow/ik_llama.cpp/pull/1154) |
|
||
| `-smgs, --split-mode-graph-scheduling` | Force Split Mode Graph Scheduling | 0 | [PR 1068](https://github.com/ikawrakow/ik_llama.cpp/pull/1068) |
|
||
| `--max-gpu N` | Define (and use) a maximum number of GPUs per layer with split mode "graph" | | This is of interest when there are more than 2 GPUs available, but using all of them leads to a lower performance than using just 2 (or using the default split mode "layer") [PR 1051](https://github.com/ikawrakow/ik_llama.cpp/pull/1051) |
|
||
| `-cuda, --cuda-params` | Comma-separated list of cuda parameters | - | Powerful way to tweak Fusion, GPU offload threshold, and MMQ-ID threshold. [PR 910](https://github.com/ikawrakow/ik_llama.cpp/pull/910) [PR 1813](https://github.com/ikawrakow/ik_llama.cpp/pull/1813) |
|
||
|
||
## Model Options
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `--check-tensors` | Check model tensor data for invalid values | false | |
|
||
| `--override-kv KEY=TYPE:VALUE` | Override model metadata by key | - | Advanced option to override model metadata by key. May be specified multiple times. types: int, float, bool, str. Example: `--override-kv tokenizer.ggml.add_bos_token=bool:false` |
|
||
| `-m, --model FNAME` | Model path | models/$filename | Mandatory, the GGUF model file to be served. |
|
||
| `-md, --model-draft FNAME` | Draft model for speculative decoding | unused | Required when an explicit `draft` stage is used. |
|
||
| `--spec-type SPEC[:k=v,...]` | Canonical speculative stage entry; repeat for the supported two-stage chain | none | Use stage-local keys like `n_max`, `n_min`, `p_min`, `ngram_size_n`, `ngram_size_m`, `ngram_min_hits`, `suffix_min_match_len`, `suffix_max_depth`, and `suffix_corpus`. |
|
||
|
||
### Request-Level Speculative Overrides
|
||
|
||
When the server is started with speculative decoding enabled, request JSON may override:
|
||
|
||
- `speculative.n_max`
|
||
- `speculative.n_min`
|
||
- `speculative.p_min`
|
||
- `speculative.stages`
|
||
|
||
Request-level `speculative.stages` is constrained:
|
||
|
||
- The number of stages must match the stage chain configured at server startup.
|
||
- Each request stage must keep the same `type` as the corresponding startup stage.
|
||
- Only `type`, `n_max`, `n_min`, and `p_min` are accepted per request.
|
||
- Structural stage parameters such as ngram sizes, ngram hit thresholds, and suffix depth remain startup-only.
|
||
|
||
## Server Options
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `--host HOST` | IP address to listen | 127.0.0.1 | Change to 0.0.0.0 when endpoint will be accessed from another computer. Keep in mind to never expose the server to Internet. |
|
||
| `--port PORT` | Port to listen | 8080 | |
|
||
| `--webui NAME` | Controls which webui to server | auto | Flexibility in choosing the integrated powerful WebUIs: - `none`: disable webui - `auto`: default webui - `llamacpp`: llamacpp webui |
|
||
| `--api-key KEY` | API key to use for authentication | none | Add a custom API KEY. Clients will need to specify it when connecting. |
|
||
| `-a, --alias` | set alias for model name (to be used by REST API) | none | Configure the server to serve and reply with specific model name. |
|
||
|
||
## Other Tools
|
||
|
||
### sweep_bench
|
||
|
||
Benchmark utility that performs a series of prompt processing batches followed by TG.
|
||
The KV cache is not cleared, so the N_KV columns tells you how many tokens were in the KV cache when the PP/TG was processed.
|
||
|
||
`llama-sweep-bench` understands all parameters that one would use in `llama-server` or `llama-cli` (but obviously not all get used, only those that are related to loading the model, setting up the context parameters, and running the benchmark).
|
||
|
||
```
|
||
llama-sweep-bench -m /models/model.gguf -c 12288 -ub 512 -rtr -fa -ctk q8_0 -ctv q8_0
|
||
```
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `-nrep N, --n-repetitions N` | Define the number of repetitions used at zero context | - | [PR 1176](https://github.com/ikawrakow/ik_llama.cpp/pull/1176) |
|
||
| `-n` | Specifies he number of TG tokens | - | If not specified, it is set to u-batch/4 [PR 897](https://github.com/ikawrakow/ik_llama.cpp/pull/897) |
|
||
| `--minilog` | Reduce the verbosity | - | [PR 1468](https://github.com/ikawrakow/ik_llama.cpp/pull/1468) |
|
||
|
||
### llama-bench
|
||
|
||
Benchmark utility.
|
||
|
||
```
|
||
llama-bench -tgb 4,16 -p 512 -n 128 other_arguments
|
||
```
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `-tgb (or --threads-gen-batch)` | Enable having different number of threads for generation and batch processing | - | [PR 284](https://github.com/ikawrakow/ik_llama.cpp/pull/284) |
|
||
| `--fit` | Automatically fit to available VRAM | 0 | 0 / 1 [PR 1542](https://github.com/ikawrakow/ik_llama.cpp/pull/1542) |
|
||
| `--fit-margin N` | Safety VRAM margin in MiB when using `--fit` | 1024 | |
|
||
|
||
### Imatrix
|
||
|
||
Create imatrix from calibration dataset.
|
||
|
||
```
|
||
llama-imatrix -m /models/model-bf16.gguf -f /models/calibration_data_v5_rc.txt -o /models/model.imatrix
|
||
```
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `--layer-similarity or -lsim` | Collect statistics about activations change caused by a layer using cosine similarity | - | [PR 328](https://github.com/ikawrakow/ik_llama.cpp/pull/328) |
|
||
| `--hide-imatrix` | Store "top_secret" in the imatrix data file name | - | And in calibration dataset fields, and zeros in the batch size and number of chunks used to compute the imatrix. [PR 329](https://github.com/ikawrakow/ik_llama.cpp/pull/329) |
|
||
| `--output-draft FNAME ` | Paired draft output file | derived from `--output` | [PR 1803](https://github.com/ikawrakow/ik_llama.cpp/pull/1803) |
|
||
|
||
Notes:
|
||
- Use `convert_imatrix_gguf_to_dat.py` to convert the "new" GGUF imatrix files to the format supported here. [PR 1405](https://github.com/ikawrakow/ik_llama.cpp/pull/1405)
|
||
- imatrix calculation for models with merged ffn_up/gate_exps tensors is supported, see [PR 1418](https://github.com/ikawrakow/ik_llama.cpp/pull/1418) [PR 1419](https://github.com/ikawrakow/ik_llama.cpp/pull/1419)
|
||
|
||
### Quantization
|
||
|
||
Quantize models to reduce size and improve speed.
|
||
|
||
For very large models, is a good practice (to avoid re-download if some info like the jinja template needs fixes) to split the model and keep only the metadata in the first split.
|
||
|
||
```
|
||
llama-quantize --imatrix /models/model.imatrix /models/model-bf16.gguf /models/model-IQ4_NL.gguf IQ4_NL
|
||
```
|
||
|
||
```
|
||
llama-gguf-split --split --split-max-size 1G --no-tensor-first-split /models/model-IQ4_NL.gguf /models/parts/model-IQ4_NL.gguf
|
||
```
|
||
|
||
| Parameter | Description | Default | Notes/Examples |
|
||
| - | - | - | - |
|
||
| `--custom-q` | Custom quantization rules with regular expressions | - | Example: `llama-quantize --imatrix some_imatrix --custom-q "regex1=typ1,regex2=type2..." some_model some_output_file some_base_quant` [PR 244](https://github.com/ikawrakow/ik_llama.cpp/pull/244) |
|
||
| `--dry-run` | Prints the tensor types and resulting tensor sizes, but does not run the quantization, so it is very fast. | - | Useful for experimenting with --custom-q before running the actual quantization. [PR 1309](https://github.com/ikawrakow/ik_llama.cpp/pull/1309) |
|
||
| `--partial-requant` | quantize only missing split files in the split quantized .gguf destination directory | - | - |
|
||
| `--symmetric-q40` | Use [-7:7] range for Q4_0 quantization (turns off imatrix) | - | This is useful for some models that have been trained to int4 using this specific quantization range (e.g., Kimi-2.6) [PR 1677](https://github.com/ikawrakow/ik_llama.cpp/pull/1677) |
|
||
| `--slow-iq2ks` | Use the original very slow IQ2_KS quantization method | - | Alternative to the compile-time option [PR 1677](https://github.com/ikawrakow/ik_llama.cpp/pull/1677) |
|
||
| `--extra-output-tensor ggml_type` | Requantize and add output tensor of that type. | - | [PR 1810](https://github.com/ikawrakow/ik_llama.cpp/pull/1810) see `--mtp-requantize-output-tensor type` as on-the-fly alternative. |
|
||
|
||
### Build Arguments
|
||
|
||
Build with `cmake`.
|
||
|
||
In general, use as few build flags as possible.
|
||
The building process automatically detects the available hardware features and enables them.
|
||
Also, `ik_llama.cpp` have safe default options.
|
||
|
||
```
|
||
cmake -B build -DGGML_NATIVE=ON
|
||
|
||
cmake --build build --config Release -j$(nproc)
|
||
```
|
||
|
||
| Argument | Notes/Examples |
|
||
| - | - |
|
||
| `-DGGML_ARCH_FLAGS="-march=armv8.2-a+dotprod+fp16"` | Direct access to ARCH options. |
|
||
| `-DGGML_CUDA=ON` | Build with CUDA support. By default it builds to native CUDA. |
|
||
| `-DCMAKE_CUDA_ARCHITECTURES=86` | Build for specific CUDA GPU Compute Capability, e.g. 8.6 for RTX30*0 |
|
||
| `-DGGML_RPC=ON` | Build the RPC backend. |
|
||
| `-DGGML_IQK_FA_ALL_QUANTS=ON` | More KV quantization types [PR 197](https://github.com/ikawrakow/ik_llama.cpp/pull/197) |
|
||
| `-DIQK_SLOW_IQ2KS_QUANTIZE=1` | See `--slow-iq2ks` for a better alternative. Disables the default new faster IQ2_KS quantization [PR 1672](https://github.com/ikawrakow/ik_llama.cpp/pull/1672) |
|
||
| `-DLLAMA_SERVER_SQLITE3=ON` | Sqlite3 for mikupad |
|
||
| `-DCMAKE_TOOLCHAIN_FILE=[...]` | Example: on Windows tells `cmake` where is `sqlite3`. |
|
||
| `-DGGML_NATIVE=ON` | Turn it off when cross-compiling. |
|
||
| `-DGGML_NCCL=OFF` | To disable usage of NCCL. |
|
||
| `-DGGML_MAX_CONTEXTS=2048` | Only need this if you are planning to use quants generated with the Thireus quantization suite |
|
||
|
||
### Environment variables
|
||
|
||
Use them on the command line.
|
||
|
||
```
|
||
CUDA_VISIBLE_DEVICES=0,2 llama-server -m /models/model-bf16.gguf
|
||
```
|
||
|
||
| Name | Notes/Examples |
|
||
| - | - |
|
||
| CUDA_VISIBLE_DEVICES | Use only specified GPUs. Example: Use first and 3rd `CUDA_VISIBLE_DEVICES=0,2` |
|
||
| GGML_CUDA_NO_PINNED | Do not use pinned memory |
|
||
|
||
## Unique parameters
|
||
|
||
WIP
|
||
|
||
| | `ik_llama.cpp` exclusive | Not available on `ik_llama.cpp` |
|
||
| - | - | - |
|
||
| Parameter | `-rtr` | |
|
||
|
||
## Graph parallel models
|
||
|
||
Models architectures [supported](https://github.com/ikawrakow/ik_llama.cpp/blob/022bd00aab9ec8428c4811275de89796c677d278/src/llama.cpp#L3056) by `--split-mode graph`
|
||
|
||
```
|
||
LLM_ARCH_LLAMA,
|
||
LLM_ARCH_QWEN3MOE,
|
||
LLM_ARCH_GLM4_MOE,
|
||
LLM_ARCH_MISTRAL3,
|
||
LLM_ARCH_COMMAND_R,
|
||
LLM_ARCH_COHERE2,
|
||
LLM_ARCH_MIMO2,
|
||
LLM_ARCH_QWEN3,
|
||
LLM_ARCH_QWEN3VL,
|
||
LLM_ARCH_HUNYUAN_MOE,
|
||
LLM_ARCH_OPENAI_MOE,
|
||
LLM_ARCH_ERNIE4_5_MOE,
|
||
LLM_ARCH_MINIMAX_M2,
|
||
LLM_ARCH_SEED_OSS,
|
||
LLM_ARCH_STEP35,
|
||
LLM_ARCH_QWEN35,
|
||
LLM_ARCH_QWEN35MOE,
|
||
LLM_ARCH_GEMMA4,
|
||
LLM_ARCH_DEEPSEEK2,
|
||
LLM_ARCH_GLM_DSA,
|
||
LLM_ARCH_MISTRAL4,
|
||
LLM_ARCH_MELLUM,
|
||
LLM_ARCH_LAGUNA,
|
||
```
|