4541 Commits

Author SHA1 Message Date
Samuel Oliveira Alves
9f7ba245ab
Update autofix and presets (#1867)
* Add configuration files for format, presets and examples

* add clang in pre-commit config

* remove clang configurations

* Refactor .gitignore for consistency in formatting
2026-05-24 07:30:44 +03:00
Kawrakow
0c45696db4
Minor logging cleanup (#1873) 2026-05-24 07:29:32 +03:00
Kawrakow
809a63bbb7
Fix MLA models with ngl < n_layer (#1870)
* Fix split mode graph with ngl < n_layer (MLA models)

* It is actually not related to split mode graph
2026-05-24 07:29:17 +03:00
dungquixote42
642c038ccd
Extend expiring logit bias to other sampling parameters (#1770)
* initial commit

* fix underflow bug, add debug prints, update macro/variable names

* fix phrases-sharing-1-flag bug, replace macros with struct member function

* cleanup

* fix file parsing

* string_split_open_close() -> string_extract(), improve escape handling

* support multiple nested entries

* make persistent entries global, simplify file parsing

* cosmetic changes

* add support for jumping to exitword

* update variable names

* fix bad search bug

* better debug prints, reorg

* replace lambda with string_is_found(), add string_unescape() for debug

* add support for inline comments

* add missing debug print macro

* fix type promotion bug

* actually fix type promotion bug
2026-05-23 19:19:12 +03:00
Justin Martin
40d8cb196a
llama-quantize: enable --extra-output-tensor with COPY (#1871) 2026-05-23 13:52:34 +03:00
Kawrakow
a6bb509305
Fix split mode graph with ngl < n_layer (#1869) 2026-05-23 12:58:09 +03:00
Kawrakow
3f45ba9387
MTP tweaks 3 (#1862) 2026-05-23 07:23:20 +03:00
Samuel Oliveira Alves
19e09e81d4
Change MTP graph input preparation with additional parameters and validation checks (#1866) 2026-05-23 07:22:04 +03:00
Kawrakow
b3d39cff8b
Fix split mode graph for Qwen35-MoE + MTP (#1861) 2026-05-22 09:23:53 +03:00
thad0ctor
b26521b9ef
Fix raw-vs-local device id confusion under -dev/-devd subsets (#1826)
llm_load_tensors stores `default_layer_device[i]` as a local index into
`model.devices` (consistent with `device_mem[]`, `model.splits[]`, and
all graph-building consumers), but the four
`llama_default_buffer_type_offload(model, default_layer_device[i])`
callsites passed it through as if it were a raw post-CVD device id.
Under `-dev`/`-devd` subsets where `model.devices != {0..N-1}`, this
selected the wrong buffer type. Wrap with `model.devices[...]` to match
the existing `model.devices[main_gpu]` pattern on the adjacent lines.

llama_init_from_model has the same bug for `main_gpu`: every consumer
(auto-fit override at line 3428, MTP clamp, the `model.devices[main_gpu]`
translations at lines 3678/3682, and graph-building `splits[main_gpu]`)
treats it as a local index, but the five single-GPU backend init paths
(CUDA, Vulkan, SYCL, Kompute, CANN) pass `model->main_gpu` straight to
the backend init, which expects a raw device id. e.g. `-dev CUDA1` with
default `--main-gpu 0` and `split_mode=NONE` called
`ggml_backend_cuda_init(0)` instead of `cuda_init(1)`. Compute
`main_gpu_id` once and use it for all five paths.
2026-05-22 08:32:52 +03:00
Samuel Oliveira Alves
d51036a0c4
fix: reset KV cache and prompt state in server_slot and server_context (#1860) 2026-05-22 08:14:47 +03:00
Kawrakow
48a55f74e4
Disable split mode graph for Qwen35-MoE when MTP is enabled (#1858) 2026-05-21 16:29:35 +03:00
Kawrakow
4b73de246b
Fix crash with split mode graph and partial offload (#1857) 2026-05-21 13:36:01 +03:00
Kawrakow
c5dc847d0a
Fix Gemma4-E4B compute graph (#1855) 2026-05-21 12:46:28 +03:00
Kawrakow
3dd282358b Fix compiler warnings 2026-05-21 05:40:08 +00:00
Samuel Oliveira Alves
7b73f45541
Add adaptive sampling clone and free functions to manage memory (#1851) 2026-05-21 08:11:17 +03:00
David Young
aefb8bdd99
MLA TP -khad: ggml_dequant_hadamard fused op + wv_b/wk_b_pp Hadamard fold (#1852)
* ggml: ggml_dequant_hadamard fused op for MLA -khad path

Adds a new ggml op that fuses (ggml_cast -> F32) + (ggml_hadamard) into a
single kernel. Reads a quantized (or F16/F32) source and produces a per-
Hadamard-block F32 chunk with the inverse transform applied, without
materializing a full-size F32 intermediate buffer.

Motivation: the MLA pp_opt path in build_deepseek2.cpp un-encodes the
H-applied cache_nope view at every PP call. Today that runs as a cast
(quant -> F32) followed by a separate ggml_hadamard kernel, costing two
full-size F32 passes per layer per rank per call. Fusing them halves
the bandwidth on the un-encode and removes one kernel launch.

CUDA kernels in dequant_hadamard.cu lift the Walsh-Hadamard butterfly
from hadamard.cu and dequant helpers from dequantize.cuh:

  * qr=1 layout (q8_0): consecutive dequant pair, stage 1 fused with load
  * qr=2 layout (q4_0 / q4_1 / q5_0 / q5_1 / q6_0 / iq4_nl): dequant pair
    at stride qk/2, explicit stage 1 after sync
  * F16 has a dedicated kernel
  * F32 source falls back to the standalone Hadamard op

CPU impl in iqk_cpu_ops.cpp composes the existing type_traits.to_float
dequant with fast_ht for graph completeness. nh in {64, 128, 256, 512}.

* MLA-TP: Hadamard pretransform of wv_b/wk_b_pp for -khad

Fold the 64-block orthonormal Hadamard into wv_b and wk_b_pp once at
context init so the pp_opt mul_mats consume the K cache in its on-disk
encoded basis. The per-PP-call cache_nope un-Hadamard is then skipped
(rope half still un-applied — it goes to FA via concat, no wk_b multiply).

Math is identity by H^T H = I: mul_mat(H@wv_b, H@cache) = wv_b^T @ cache.
For mla=2/3 absorb, composes correctly with the existing post-FA
ggml_hadamard(kqv_compressed, 64).

All-or-nothing across layers under a castable type-allowlist (excludes
1-3 bpw IQ types whose requant blows up beyond PPL noise). Models with
ineligible weights fall back to the runtime un-Hadamard path unchanged.

Composes with the fused ggml_dequant_hadamard op (prior commit): with the
fold active only the rope half still runs the runtime transform, via the
fused kernel.

* MLA-TP: fix TG with -khad after wv_b/wk_b_pp fold

The absorb branch of build_deepseek2_tp_attention applies
ggml_hadamard to kqv_compressed after FA, then multiplies by
wv_b. Pre-fold this was needed because wv_b was un-encoded; with
the wv_b fold (prior commit) the mul_mat already expects
H-encoded kqv_compressed:

  mul_mat(H @ wv_b, kqv_encoded) = wv_b^T @ H @ H @ kqv_unencoded
                                 = wv_b^T @ kqv_unencoded   (H @ H = I)

Skip the post-FA hadamard when model.khad_pretransformed is set
so the two H applications cancel instead of double-applying.

Affects the absorb branch: TG (n_tokens=1), short-context PP
(n_kv < 1024), and models without wk_b_pp. Long-context PP goes
through the pp_opt branch and is unrelated/unchanged.

Reported by @ikawrakow on PR 1852. Verified across mla={1,2,3} x
khad={on,off} x -ctk={q8_0,q4_0} on GLM-4.7-Flash IQ5_K and the
unsloth IQ4_XS variant ik used to reproduce.

* ggml_hadamard: accept F16 and quant sources; drop GGML_OP_DEQUANT_HADAMARD

Per @ikawrakow review on PR 1852: subsume the per-source-type dispatch
into the existing GGML_OP_HADAMARD instead of carrying a separate enum
entry, op constructor, and standalone files.

ggml_hadamard's API is unchanged from the call-site perspective. The
constructor's F32-only assertion is dropped; ggml_cuda_op_hadamard and
iqk_hadamard now dispatch internally:

  - F32 source: existing F32 butterfly (unchanged)
  - F16 source: dedicated kernel
  - q8_0 / q4_0 / q4_1 / q5_0 / q5_1 / q6_0 / iq4_nl: fused dequant +
    butterfly kernel (lifted from the deleted dequant_hadamard.cu)
  - CPU side composes traits.to_float with fast_ht

Net diff: -80 lines. Removes dequant_hadamard.{cu,cuh}, the enum entry,
op table rows, ggml_dequant_hadamard constructor, dispatch cases, and
the DEQUANT_HADAMARD supports_op block.

Verified clean build + TG smoke (mla=3 +khad q8 on GLM-4.7-Flash-IQ4_XS,
same coherent output as prior commit on feat/dequant-hadamard).
2026-05-21 07:29:15 +03:00
Samuel Oliveira Alves
11a1fea9e2
Move embedding management to speculative (#1825)
* refactor speculative decoding with companion context and draft result structures

* feat: add common speculative feature handling in server context

* refactor: move embedings outside server

* feat: harden draft input hidden state in llama context

* remove unused functions

* refactor: streamline speculative feature handling and remove unused code

* remove redundant code

* remove more unused variables

* refactor: implement speculative feature handling
2026-05-20 17:42:48 +03:00
David Young
dd67a9fb24
MLA TP prompt processing optimisation (#1841)
* MLA TP prompt processing optimisation

Adds a per-rank prompt-processing path to build_deepseek2_tp_attention
that materialises K/V from the compressed latent cache and runs a
standard flash_attn instead of the FlashMLA-3 absorb kernel the TP
attention currently uses for all batch sizes. Affects MLA archs under
-sm graph (DEEPSEEK2, GLM_DSA, MISTRAL4).

Gated on n_tokens >= 128 (set by caller) AND n_kv >= 1024. Below
either threshold the absorb path runs unchanged. Token generation
takes the absorb path; only prompt processing at non-trivial context
materialises.

A second piece pre-computes wk_b in a pp_opt-favouring orientation
(wk_b_pp: [kv_lora_rank, qk_nope, n_head]) at llm_prepare_mla time,
so the per-PP-call materialise can mul_mat against the latent cache
directly without an F16 cast + permute + ggml_cont on wk_b each call.
Path A (wkv_b in GGUF) and Path B (only wk_b/wv_b in GGUF) both
populate wk_b_pp through the standard per-rank replica setup.

Measured on 8x RTX 3090, -sm graph -mla 2 -fa on:

  DSV2.5 IQ2_XS         c=8k  ub=2048   PP +51% to +60%
  GLM-4.7-Flash IQ4_XS  c=32k ub=2048   PP -6% (PP@0) to +77% (PP@30720)
  GLM-5.1 IQ1_S q4_0    c=16k ub=2048   PP +5% to +9%

PPL parity within +/-0.2 noise (DSV2.5 bit-identical 5.3917, GLM-4.7
8.83 vs 8.96, GLM-5.1 6.96 vs 7.00). Token-generation throughput
unchanged within noise.

Compute buffer at init:
  DSV2.5         -54 MiB total       (allocator noise)
  GLM-4.7-Flash  +1042 MiB total     (~+173 MiB per non-output device)
  GLM-5.1        0                   (MoE intermediates dominate)

* MLA TP: respect mla=1 vs mla=3 distinction, rename attn_k_b_pp -> attn_kv_b

ikawrakow/ik_llama.cpp#1841 review feedback: the pp_opt path lost the
intended trade-off where mla=1 forgoes pp_opt to save VRAM and mla=3 pays
the wk_b_pp tensor cost for faster long-context PP.

- llm_prepare_mla second pass: gate wk_b_pp synthesis on mla > 1.
  Models that ship wk_b in their GGUF (mainline format) no longer
  allocate the pp_opt-favoring K weight under mla=1.
- llm_prepare_mla first pass (wk_b synthesis from wkv_b): keep
  unconditional under -sm graph. The wk_b_pp materialization here
  shares the wk_b_f32 intermediate with the wk_b synthesis above, and
  isolating just the wk_b_pp branch leaves the synthesized wk_b in a
  state that makes the absorb path produce inf on some quant combos
  (DSV2.5 IQ2_XS). Trade: the synthesized-wkv_b path still pays the
  wk_b_pp allocation under mla=1, but the bigger compute-buffer
  saving (no pp_opt branch at runtime) still applies.
- build_deepseek2 outer pp_opt: include cparams.mla_attn > 1 in the
  pp_opt definition itself, so mla=1 is bypassed throughout (TP and
  non-TP attention paths).
- build_deepseek2 tp pp_opt: require wk_b_pp present. Drop the dead
  runtime wk_b transpose fallback (unreachable now that wk_b_pp is
  guaranteed when tp_pp_opt fires).
- llama_kv_cache_init: have_wkv_b probe now treats wk_b_pp (attn_kv_b)
  as equivalent to wkv_b for the purposes of allowing mla>1 to stay
  put. Without this, -sm graph models that have wk_b/wv_b separately
  in the GGUF (no combined wkv_b) would silently downgrade to mla=1.
- Rename the synthesized tensor "attn_k_b_pp.weight" -> "attn_kv_b.weight"
  to match the mainline naming ik uses.

GLM-5.1 in particular benefits: its mla=3 PP improvement over mla=1 is
negligible on this arch (~0.4% in our sweeps), so users save the
runtime cost by sticking to mla=1.
2026-05-20 17:03:05 +03:00
Kawrakow
40254a51da
Fix MTP when -no-gr is used (#1848) 2026-05-20 13:38:33 +03:00
Kawrakow
eb597df91f Upate AUTHORS 2026-05-20 06:38:39 +00:00
Kawrakow
290935be79
Remove Makefile (#1847) 2026-05-20 09:14:28 +03:00
Kawrakow
6bb3ee3a32
Enable split mode graph for MLA models and partial offload (#1835) 2026-05-20 07:13:55 +03:00
firecoperana
9ae0fb7b2f
Remove reasoning budget logs (#1846)
Co-authored-by: firecoperana <firecoperana>
2026-05-20 07:12:02 +03:00
Samuel Oliveira Alves
77413bc900
Add Hadamard parameters to draft model loading (#1840) 2026-05-19 18:30:41 +03:00
Kawrakow
997c587a6c
Fix #1837 (#1838) 2026-05-19 17:56:21 +03:00
Kawrakow
27d7a74389 Compiler warnings 2026-05-19 05:51:27 +00:00
firecoperana
9ad8b8c6db
common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#1822)
* grammar: Fix grammar root symbol check (#19761)

* grammar: fix bad check for root symbol, correct error logging

* add tests to demonstrate root symbol check failure
# Conflicts:
#	tests/test-grammar-integration.cpp

* common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604)

* grammar: add test case for nullable symbol loop

Reproduce stack overflow (or OOM) with ( [x]* )* found while adding
GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= ( [x]* )*"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: prevent stack overflow with nullable symbol loop

Fix a potential stack overflow in llama_grammar_advance_stack that
could occur when processing grammars with nullable symbols that lead
to infinite derivations of empty strings. The fix introduces cycle
detection by tracking visited stacks to prevent infinite recursion.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A20
rg-edit directive: """Rewrite: fix the following segfault:

[..]
 Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
 ("~/llama.cpp/tests/test-grammar-integration.cpp")
 ("~/llama.cpp/grammars/./list.gbnf")
 ("~/llama.cpp/grammars/./json_arr.gbnf")
 ("~/llama.cpp/grammars/./json.gbnf")
 ("~/llama.cpp/grammars/./japanese.gbnf")
 ("~/llama.cpp/grammars/./english.gbnf")
 ("~/llama.cpp/grammars/./chess.gbnf")
 ("~/llama.cpp/grammars/./c.gbnf")
 ("~/llama.cpp/grammars/./arithmetic.gbnf")
 ("~/llama.cpp/grammars/./README.md"))

* grammar: convert recursive llama_grammar_advance_stack to iterative

This change converts the function to an iterative approach using
explicit stacks, which prevents deep recursion and eliminates the risk
of stack overflow.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A30
rg-edit directive: """Rewrite: fix the following segfault:

[..]
 Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration

convert from recursive to interactive"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
 ("~/llama.cpp/tests/test-grammar-integration.cpp")
 ("~/llama.cpp/grammars/./list.gbnf")
 ("~/llama.cpp/grammars/./json_arr.gbnf")
 ("~/llama.cpp/grammars/./json.gbnf")
 ("~/llama.cpp/grammars/./japanese.gbnf")
 ("~/llama.cpp/grammars/./english.gbnf")
 ("~/llama.cpp/grammars/./chess.gbnf")
 ("~/llama.cpp/grammars/./c.gbnf")
 ("~/llama.cpp/grammars/./arithmetic.gbnf")
 ("~/llama.cpp/grammars/./README.md"))

v2: Added a `std::set` to perform tree-based lookups with O(N log N)
complexity. Testing with a parallel run of `test-grammar-integration`
shows a double-digit percentage increase in runtime. An
`unordered_set` with O(1) hashing was also evaluated, but the overhead
of constructing hash keys from pointers made it significantly slower
than the rbtree implementation that only requires an ordering
operator. The performance regression in the test suite appears
justified by the overall reduction in algorithmic complexity.

Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

* grammar: add test case for hang in repetition grammar processing

This commit adds a new test case to the grammar integration tests that
specifically targets a hang scenario in the repetition grammar parser
found while adding GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= (([^x]*){0,99}){0,99}"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: add repetition threshold check

The change introduces a maximum repetition threshold to avoid
excessive rule expansion during grammar parsing. When parsing
repetition patterns like {m,n}, the parser now calculates the
potential number of rules that would be generated and throws an error
if the product of previous rules and new rules exceeds the threshold.

A test case was added to verify the threshold is properly enforced for
deeply nested repetition patterns that would otherwise cause hangs.

---------

Co-authored-by: Asbjørn Olling <asbjornolling@gmail.com>
Co-authored-by: Andrea Arcangeli <aarcange@redhat.com>
2026-05-19 08:36:49 +03:00
David Young
c07a052315
MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) (#1821)
* MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4)

Extends -sm graph (split-mode graph) to MLA-style attention across the
DEEPSEEK2, GLM_DSA, and MISTRAL4 architectures. Previously these archs
fell back to -sm layer regardless of the user's flag.

Implementation:
- Per-rank attention build in build_deepseek2_tp_attention with
  view-sliced FlashAttention, split-buffer output projection, and
  ggml_reduce across devices
- wk_b / wv_b absorbed weights replicated per device via materialize()
  in llm_prepare_mla (these can't live in a split buffer)
- KV cache replication path (replicated_k_l) for graph-mode TP
- distribute_mla_tensors_for_split_mode_graph routes attention/norm
  tensors into ctx_split; expert tensors stay per-layer
- Implements ggml_backend_cuda_split_buffer_get_tensor for the
  replicated / row-split / col-split inverse paths
- Early-reject guard in src/llama.cpp that auto-downgrades -sm graph
  to -sm layer (with a warning) when incompatible loader flags are set:
  -ncmoe, -cmoe, -ot, -rtr, -muge

New CLI flag:
- -gap | --graph-attn-precision <f16|f32>  (default f16)

See the PR description for the full validation matrix (3 archs x 2/4/8
GPU counts), perf numbers, VRAM accounting, and known limitations.

* Some tweaks

* materialize lambda: per-head split for graph-mode tp_replicate

7dd19e19 changed wk_b/wv_b distribution from mirror to per-head split
(split_dim=2) via prepare_split_tensors. That path only fires when
wk_b/wv_b are loaded from GGUF.

Models that store only wkv_b in GGUF derive wk_b/wv_b at load via
llm_prepare_mla, going through the materialize lambda, which was
untouched and still produced mirror replicas (split_dim=-1, full n_head
per device).

build_deepseek2_tp_attention now does mul_mat(wk_b_local, q_nope_perm)
without the prior view_3d slice, so a mirror replica passes an n_head
tensor where the kernel expects n_head_local. Result: silent SIGSEGV
right after model load.

Mirror logic in materialize is replaced with the same per-head split as
prepare_split_tensors: head_offsets derived from wo split, each rank
gets a tensor with ne[2]=n_head_local, data copied from the appropriate
source byte slice. Singular `computed` tensor keeps full metadata for
tensors_by_name lookups.

Tested: 8x3090, -sm graph -mla 3 -fa on now boots cleanly and
sweep-benches without crash. Log confirms new path: "Computed
blk.X.attn_k_b.weight ... split across N devices on dim=2".

* cleanup: indent fix + remove dead view_3d slicing and debug printf

- build_deepseek2.cpp: re-indent the self_attention block in
  build_deepseek2_layer_attention (lines 253-670). Block was at column 0
  inside a function body; now at the expected 4/8-space indent.
- build_deepseek2.cpp: drop the commented-out view_3d slicing and debug
  printfs left over after 7dd19e19's switch to direct mul_mat on
  per-rank wk_b_local / wv_b_local. Update the stale 'wk_b is
  replicated (split_dim=-1)' comment to match the new split_dim=2
  reality.
- ggml-cuda.cu: remove the leftover debug printf in
  ggml_backend_cuda_split_buffer_get_tensor.

No behavior change. Verified with a clean rebuild and DSV2.5 +
GLM-4.7-Flash sweep-bench runs.

* llm_load_tensors: gate incompatible-flag warning to MLA archs

The -ncmoe / -rtr / -muge / -ot warning under -sm graph currently fires
for all archs that support graph mode. That's an over-reach: the
incompatibility is specific to the MLA TP paths (DEEPSEEK2, GLM_DSA,
MISTRAL4) — Gemma4 graph mode existed pre-PR and works with those flags.
Gate the warning to MLA archs only.

Also refreshes two stale comments left over from the wk_b/wv_b
mirror -> per-head-split rewrite:
- src/llama.cpp llm_prepare_mla: "Replicate wk_b/wv_b ..." now reads
  "Per-head split wk_b/wv_b ..." to match what the materialize lambda
  actually does post-823a39e2.
- src/llama-load-tensors.cpp distribute_mla_tensors_for_split_mode_graph:
  drop the wkv_b row-split mention (wkv_b is no longer created under
  graph mode after 7dd19e19) and correct the wk_b/wv_b distribution
  description (per-head split, not per-device replicated).

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-05-19 08:36:17 +03:00
firecoperana
104846ddee
spec : disacard last drafted token with low prob (#1820)
* spec : disacard last drafted token with low prob

* Apply suggestion from @ikawrakow

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-05-19 08:35:35 +03:00
Joel Farthing
f43a9f1cf6
Add per-byte CUDA MoE offload threshold (#1813)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-05-19 08:35:05 +03:00
firecoperana
f645ed1e2d
AutoParser: improve reasoning budget and handling of space/newline in tool calls (#1819)
common/chat, server: refactor, move all conversion functions to common, add tests (#20690)

jinja : remove unused header (#22310)

common : fix jinja warnings with clang 21 (#22313)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

chat: fix handling of space in reasoning markers (#22353)

* chat: fix handling of space in reasoning markers

common : re-arm reasoning budget after DONE on new <think> (#22323)

common : determine generation prompt using longest common prefix (#22657)

common/autoparser: fixes for newline handling / forced tool calls (#22654)

* chat/autoparser: the fixes

* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.

* Trim whitespace on apply instead

common/chat : preserve media markers for typed-content templates (#22634)

common : revert reasoning budget +inf logit bias (#22740)

common : do not wrap raw strings in schema parser for tagged parsers (#22827)

common : enable streaming JSON argument values (#23173)

* common : remove atomic from json arguments

* common : remove parsing logic on JSON arguments

common : do not pass prompt tokens to reasoning budget sampler (#22488)

reasoning-budget: clone should do a deep-copy (#23095)

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-05-19 08:34:19 +03:00
Kawrakow
40aae0b6d8
Check for output_extra.weight when loading Gemma4 assistant models (#1817) 2026-05-18 08:17:05 +03:00
Kawrakow
a407b9ca3d
Fix Qwen3.6-MoE low MTP acceptance rate (#1815)
* Fix Qwen3.6-MoE low MTP acceptance rate

* Fix Gemma4 MTP
2026-05-18 07:26:17 +03:00
gapeleon
c35189d83c
fix(server): reset chat parser on slot reuse to prevent crash (#1763) (#1794)
If a slot is reused for a standard completion (`/v1/completions`) after
being used for a chat completion (`/v1/chat/completions`), the previous
chat's PEG parser would remain active in the slot's parameters. This
caused standard text completions to throw on the raw text.
2026-05-17 18:26:45 +03:00
Kawrakow
0ab9bdf793
Fix Qwen3.5/3.6 MTP and -muge (#1816) 2026-05-17 17:14:47 +03:00
Kawrakow
1f8c603d9c
Quantize: add extra output tensor for MTP (#1810)
* Quantize: add extra output tensor for MTP

* Consistently use --mtp-requantize-output-tensor
2026-05-17 13:59:56 +03:00
Kawrakow
3e573cfea6
MTP: option to use re-quantized output tensor for better TG performance (#1809)
* Option to use re-quantized output tensor for MTP

* Remove quantize extra output option

* Handle interleaved types
2026-05-16 14:40:18 +03:00
Kawrakow
5cc0d86c76
imatrix: use data for ffn_up when data for ffn_gate is missing (#1806) 2026-05-15 14:38:16 +03:00
Samuel Oliveira Alves
f4f4b3ff26
Allow dual speculative decoding (#1789)
* wip: test logic to use multiple specs

* feat: introduce composite speculative decoding stages

* handle MTP context and draft invalidation

* fix: allow gemma mtp for speculative stages

* fix: normalize spec stage keys

* refactor: remove enable_mtp flag and improve speculative stage handling

* fix: update cached text tokens handling for stage chains

* feat: implement sync for external MTP after non-MTP accept
2026-05-15 10:10:40 +03:00
Jun Yamog
53cd4d0ff0
fix: use mmq for volta quantized matmuls (#1785) 2026-05-15 08:11:49 +03:00
Samuel Oliveira Alves
40b65d8f54
feat: add support for draft imatrix output file (#1803) 2026-05-15 08:10:58 +03:00
Kawrakow
4e1851b01a
imatrix: use data for ffn_up when data for ffn_gate is missing (#1805) 2026-05-15 07:28:34 +03:00
Kawrakow
ba72890076
Faster imatrix (#1801)
* Faster imatrix on AVX2

* Slightly better
2026-05-15 07:15:16 +03:00
Samuel Oliveira Alves
35fbe08d6e
disable MTP for parallel slots (#1804) 2026-05-15 07:11:04 +03:00
Samuel Oliveira Alves
0fcffdb64d
feat: map Gemma 4 tensor and support with imatrix (#1796) 2026-05-14 09:01:24 +03:00
Marian M.
b2e7f7f6cd
Update docs (#1800)
* Update README.md

- New model
- New features

* Update parameters.md

- Recent new parameters
2026-05-14 08:44:58 +03:00
Kawrakow
949bb8f1d6
More MTP tweaks (#1792) 2026-05-13 17:55:43 +03:00
ubergarm
ca52a825db
feat: add --threads-mtmd for independent multimodal thread count (#1797)
Add `-tm` / `--threads-mtmd` to control CPU thread count used during
multimodal image/audio processing (mmproj encoding), separate from the
main LLM thread count.

This allows running the LLM on GPU with minimal CPU threads (e.g. `-t 1`)
to reduce sync overhead, while using many threads (e.g. `-tm 16`) for
CPU-bound mmproj encoding with `--no-mmproj-offload`.

Fallback chain when `-tm` is not specified:
 1. `--threads-batch` (-tb) — multimodal encoding is a batch/prefill-like
    operation, so it makes sense to track with batch thread count
 2. `--threads` (-t) — final default

Works with both mtmd-cli and llama-server.

AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev
2026-05-13 17:44:43 +03:00
Forkoz
8a0f912cb2
Remove outdated asserts from mmproj (#1795) 2026-05-13 17:40:11 +03:00