* grammar: Fix grammar root symbol check (#19761)
* grammar: fix bad check for root symbol, correct error logging
* add tests to demonstrate root symbol check failure
# Conflicts:
# tests/test-grammar-integration.cpp
* common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604)
* grammar: add test case for nullable symbol loop
Reproduce stack overflow (or OOM) with ( [x]* )* found while adding
GBNF support to ripgrep-edit.
llama-server reproducer:
curl \
-X POST \
-d '{
"messages": [{ "role": "user", "content": "write yes" }],
"grammar": "root ::= ( [x]* )*"
}' \
-H "Content-Type: application/json" \
http://localhost:8811/v1/chat/completions
* grammar: prevent stack overflow with nullable symbol loop
Fix a potential stack overflow in llama_grammar_advance_stack that
could occur when processing grammars with nullable symbols that lead
to infinite derivations of empty strings. The fix introduces cycle
detection by tracking visited stacks to prevent infinite recursion.
rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A20
rg-edit directive: """Rewrite: fix the following segfault:
[..]
⚫ Testing segfault. Grammar:
root ::= ( [x]* )*
root ::= ( [x]* )*
Segmentation fault build/bin/test-grammar-integration"""
gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
("~/llama.cpp/tests/test-grammar-integration.cpp")
("~/llama.cpp/grammars/./list.gbnf")
("~/llama.cpp/grammars/./json_arr.gbnf")
("~/llama.cpp/grammars/./json.gbnf")
("~/llama.cpp/grammars/./japanese.gbnf")
("~/llama.cpp/grammars/./english.gbnf")
("~/llama.cpp/grammars/./chess.gbnf")
("~/llama.cpp/grammars/./c.gbnf")
("~/llama.cpp/grammars/./arithmetic.gbnf")
("~/llama.cpp/grammars/./README.md"))
* grammar: convert recursive llama_grammar_advance_stack to iterative
This change converts the function to an iterative approach using
explicit stacks, which prevents deep recursion and eliminates the risk
of stack overflow.
rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A30
rg-edit directive: """Rewrite: fix the following segfault:
[..]
⚫ Testing segfault. Grammar:
root ::= ( [x]* )*
root ::= ( [x]* )*
Segmentation fault build/bin/test-grammar-integration
convert from recursive to interactive"""
gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
("~/llama.cpp/tests/test-grammar-integration.cpp")
("~/llama.cpp/grammars/./list.gbnf")
("~/llama.cpp/grammars/./json_arr.gbnf")
("~/llama.cpp/grammars/./json.gbnf")
("~/llama.cpp/grammars/./japanese.gbnf")
("~/llama.cpp/grammars/./english.gbnf")
("~/llama.cpp/grammars/./chess.gbnf")
("~/llama.cpp/grammars/./c.gbnf")
("~/llama.cpp/grammars/./arithmetic.gbnf")
("~/llama.cpp/grammars/./README.md"))
v2: Added a `std::set` to perform tree-based lookups with O(N log N)
complexity. Testing with a parallel run of `test-grammar-integration`
shows a double-digit percentage increase in runtime. An
`unordered_set` with O(1) hashing was also evaluated, but the overhead
of constructing hash keys from pointers made it significantly slower
than the rbtree implementation that only requires an ordering
operator. The performance regression in the test suite appears
justified by the overall reduction in algorithmic complexity.
Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
* grammar: add test case for hang in repetition grammar processing
This commit adds a new test case to the grammar integration tests that
specifically targets a hang scenario in the repetition grammar parser
found while adding GBNF support to ripgrep-edit.
llama-server reproducer:
curl \
-X POST \
-d '{
"messages": [{ "role": "user", "content": "write yes" }],
"grammar": "root ::= (([^x]*){0,99}){0,99}"
}' \
-H "Content-Type: application/json" \
http://localhost:8811/v1/chat/completions
* grammar: add repetition threshold check
The change introduces a maximum repetition threshold to avoid
excessive rule expansion during grammar parsing. When parsing
repetition patterns like {m,n}, the parser now calculates the
potential number of rules that would be generated and throws an error
if the product of previous rules and new rules exceeds the threshold.
A test case was added to verify the threshold is properly enforced for
deeply nested repetition patterns that would otherwise cause hangs.
---------
Co-authored-by: Asbjørn Olling <asbjornolling@gmail.com>
Co-authored-by: Andrea Arcangeli <aarcange@redhat.com>
* MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4)
Extends -sm graph (split-mode graph) to MLA-style attention across the
DEEPSEEK2, GLM_DSA, and MISTRAL4 architectures. Previously these archs
fell back to -sm layer regardless of the user's flag.
Implementation:
- Per-rank attention build in build_deepseek2_tp_attention with
view-sliced FlashAttention, split-buffer output projection, and
ggml_reduce across devices
- wk_b / wv_b absorbed weights replicated per device via materialize()
in llm_prepare_mla (these can't live in a split buffer)
- KV cache replication path (replicated_k_l) for graph-mode TP
- distribute_mla_tensors_for_split_mode_graph routes attention/norm
tensors into ctx_split; expert tensors stay per-layer
- Implements ggml_backend_cuda_split_buffer_get_tensor for the
replicated / row-split / col-split inverse paths
- Early-reject guard in src/llama.cpp that auto-downgrades -sm graph
to -sm layer (with a warning) when incompatible loader flags are set:
-ncmoe, -cmoe, -ot, -rtr, -muge
New CLI flag:
- -gap | --graph-attn-precision <f16|f32> (default f16)
See the PR description for the full validation matrix (3 archs x 2/4/8
GPU counts), perf numbers, VRAM accounting, and known limitations.
* Some tweaks
* materialize lambda: per-head split for graph-mode tp_replicate
7dd19e19 changed wk_b/wv_b distribution from mirror to per-head split
(split_dim=2) via prepare_split_tensors. That path only fires when
wk_b/wv_b are loaded from GGUF.
Models that store only wkv_b in GGUF derive wk_b/wv_b at load via
llm_prepare_mla, going through the materialize lambda, which was
untouched and still produced mirror replicas (split_dim=-1, full n_head
per device).
build_deepseek2_tp_attention now does mul_mat(wk_b_local, q_nope_perm)
without the prior view_3d slice, so a mirror replica passes an n_head
tensor where the kernel expects n_head_local. Result: silent SIGSEGV
right after model load.
Mirror logic in materialize is replaced with the same per-head split as
prepare_split_tensors: head_offsets derived from wo split, each rank
gets a tensor with ne[2]=n_head_local, data copied from the appropriate
source byte slice. Singular `computed` tensor keeps full metadata for
tensors_by_name lookups.
Tested: 8x3090, -sm graph -mla 3 -fa on now boots cleanly and
sweep-benches without crash. Log confirms new path: "Computed
blk.X.attn_k_b.weight ... split across N devices on dim=2".
* cleanup: indent fix + remove dead view_3d slicing and debug printf
- build_deepseek2.cpp: re-indent the self_attention block in
build_deepseek2_layer_attention (lines 253-670). Block was at column 0
inside a function body; now at the expected 4/8-space indent.
- build_deepseek2.cpp: drop the commented-out view_3d slicing and debug
printfs left over after 7dd19e19's switch to direct mul_mat on
per-rank wk_b_local / wv_b_local. Update the stale 'wk_b is
replicated (split_dim=-1)' comment to match the new split_dim=2
reality.
- ggml-cuda.cu: remove the leftover debug printf in
ggml_backend_cuda_split_buffer_get_tensor.
No behavior change. Verified with a clean rebuild and DSV2.5 +
GLM-4.7-Flash sweep-bench runs.
* llm_load_tensors: gate incompatible-flag warning to MLA archs
The -ncmoe / -rtr / -muge / -ot warning under -sm graph currently fires
for all archs that support graph mode. That's an over-reach: the
incompatibility is specific to the MLA TP paths (DEEPSEEK2, GLM_DSA,
MISTRAL4) — Gemma4 graph mode existed pre-PR and works with those flags.
Gate the warning to MLA archs only.
Also refreshes two stale comments left over from the wk_b/wv_b
mirror -> per-head-split rewrite:
- src/llama.cpp llm_prepare_mla: "Replicate wk_b/wv_b ..." now reads
"Per-head split wk_b/wv_b ..." to match what the materialize lambda
actually does post-823a39e2.
- src/llama-load-tensors.cpp distribute_mla_tensors_for_split_mode_graph:
drop the wkv_b row-split mention (wkv_b is no longer created under
graph mode after 7dd19e19) and correct the wk_b/wv_b distribution
description (per-head split, not per-device replicated).
---------
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
common/chat, server: refactor, move all conversion functions to common, add tests (#20690)
jinja : remove unused header (#22310)
common : fix jinja warnings with clang 21 (#22313)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
chat: fix handling of space in reasoning markers (#22353)
* chat: fix handling of space in reasoning markers
common : re-arm reasoning budget after DONE on new <think> (#22323)
common : determine generation prompt using longest common prefix (#22657)
common/autoparser: fixes for newline handling / forced tool calls (#22654)
* chat/autoparser: the fixes
* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.
* Trim whitespace on apply instead
common/chat : preserve media markers for typed-content templates (#22634)
common : revert reasoning budget +inf logit bias (#22740)
common : do not wrap raw strings in schema parser for tagged parsers (#22827)
common : enable streaming JSON argument values (#23173)
* common : remove atomic from json arguments
* common : remove parsing logic on JSON arguments
common : do not pass prompt tokens to reasoning budget sampler (#22488)
reasoning-budget: clone should do a deep-copy (#23095)
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
If a slot is reused for a standard completion (`/v1/completions`) after
being used for a chat completion (`/v1/chat/completions`), the previous
chat's PEG parser would remain active in the slot's parameters. This
caused standard text completions to throw on the raw text.
Add `-tm` / `--threads-mtmd` to control CPU thread count used during
multimodal image/audio processing (mmproj encoding), separate from the
main LLM thread count.
This allows running the LLM on GPU with minimal CPU threads (e.g. `-t 1`)
to reduce sync overhead, while using many threads (e.g. `-tm 16`) for
CPU-bound mmproj encoding with `--no-mmproj-offload`.
Fallback chain when `-tm` is not specified:
1. `--threads-batch` (-tb) — multimodal encoding is a batch/prefill-like
operation, so it makes sense to track with batch thread count
2. `--threads` (-t) — final default
Works with both mtmd-cli and llama-server.
AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev
The get_batch_ubatch() function unconditionally inflated n_batch and
n_ubatch whenever --mmproj was specified, regardless of whether the
mmproj model actually ran on the GPU. This boosted batch size applies
to both the main context and the MTP draft context, since
params_base.speculative.cparams_dft is derived from
common_context_params_to_llama(params_base).
When mmproj runs on CPU (--no-mmproj-offload), this batch inflation
is unnecessary for mmproj itself (CPU compute is sized by image
dimensions independently), but it still inflates the MTP compute buffer
proportionally. For large images (e.g. --image-max-tokens 4096), the
MTP compute buffer ballooned to ~2020 MiB and triggered an OOM even
though the mmproj model was fully on CPU and should have saved VRAM.
Restrict the batch inflation to !params.mmproj.path.empty() &&
params.mmproj_use_gpu so it only triggers when mmproj actually occupies
GPU memory. When mmproj runs on CPU, the existing per-chunk decode
splitting in mtmd_helper_decode_image_chunk_impl handles large images
correctly with the default batch size.
AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev
* Avoid copying the per-step SSM state (CUDA)
* Avoid copying the per-step SSM state (CPU)
* Allocate only what is necessary for per-step SSM state
* Cleanup
* Use AVX version VNNI intrinsic when AVX512VNNI not available.
* remove changes under HAVE_FANCY_SIMD
---------
Co-authored-by: XZiar <xziar@xziar.xziar>
This patch addresses two latent bugs in examples/speculative/speculative.cpp
that prevent llama-speculative.exe from running on greedy sampling
(temp=0) or producing rejection-sampling output (temp>0):
1. Line 191: `params.sparams.grammar = { COMMON_GRAMMAR_TYPE_NONE, "" };`
invokes `common_grammar(type, grammar)` which asserts
`type != NONE || !grammar.empty()`. Both conditions fail with the
intended-to-be-empty grammar, so every speculative run hits a hard
`GGML_ASSERT` in common/sampling.h:63 immediately after model load.
Fix: default-construct via `common_grammar{}` to bypass the
field-init constructor.
2. Lines 293-294: `GGML_ASSERT(dist_tgt.sorted)` and
`GGML_ASSERT(dist_dft.sorted)` fire whenever the draft sampler does
not set the .sorted flag (which is most modern sampler paths).
Comment them out — the next ~10 lines re-sort both distributions
by id explicitly, so the assertion is incorrect anyway.
Fix: replace the asserts with an explanatory comment.
After both fixes, `llama-speculative.exe` runs to completion. The
acceptance-rate measurement at temp=0 still looks suspicious (0%
across same-family draft/target pairs), but that is a different
issue out of scope for this PR.
Tested on Qwen3-0.6B-IQ4_XS drafting Qwen3-1.7B-IQ4_XS, both base
models from `bartowski/Qwen_Qwen3-*-GGUF` on Windows + ik_llama.cpp
build at HEAD of windows-mingw-default-win10 (which is itself a
follow-up to PR #1755).
The default of 0x602 (Windows 8) causes a build failure on any toolchain
where _WIN32_WINNT propagates into vendored cpp-httplib (notably MinGW with
the bundled w64devkit GCC). cpp-httplib's httplib.h has, for some time
now, contained:
#ifdef _WIN32
#if defined(_WIN32_WINNT) && _WIN32_WINNT < 0x0A00
#error "cpp-httplib doesn't support Windows 8 or lower. Please use
Windows 10 or later."
#endif
#endif
so the entire llama-server target fails to compile on Windows + MinGW
unless the user passes -DGGML_WIN_VER=0x0A00 manually.
Bumping the default to 0x0A00 (Windows 10) keeps Windows 8 reachable for
anyone who explicitly requests it (-DGGML_WIN_VER=0x602) while letting the
default Windows + MinGW build succeed end-to-end. Windows 8 / 8.1 reached
end of support in January 2023, and Windows 10 is a strict superset of the
Win8 surface used elsewhere (PrefetchVirtualMemory etc.), so this is
strictly additive on the API side.
Verified by building with w64devkit 2.8.0 (gcc 16.1.0) on Windows 11
without any -DGGML_WIN_VER override: all 266 ninja targets link cleanly,
including bin/llama-server.exe, and llama-cli runs Qwen3-4B-Thinking-2507
IQ4_XS at ~6.2 tok/s with q8_0 KV at 4096 context.
* Add Turing and Ampere (A100) GGML to docker build file
At the moment, the docker file for image builds do not build for CUDA architectures below 8.6, and ik_llama.cpp specifies support for architectures Turing and above, this PR sets the CUDA architecture list to include the architecture for Turing (7.5) and A100 (8.0)
* Remove 80 because few ppl have A100s and it does seem like many cuda arches cause issues for build
* switch to 86-real and 89-real with 75, 80, 90 using virtual ptx jit
* nvm, even adding 90-virtual causes linker error
---------
Co-authored-by: Codex <codex@local>