451 Commits

Author SHA1 Message Date
Farmadupe
af62a37acd
Prune examples/llava. Dead code. (#2025)
examples/llava has been replaced by mtmd since late 2025, and has 
been out-of-build in ik_llama.cpp since examples/CMakeLists.txt removed it
in #798.

Repointed descriptions from llava to mtmd where they remained.
2026-06-26 08:48:48 +02:00
Joel Farthing
8686ea708b
chat: Cohere2MoE/North Code: parse unopened thinking under --reasoning off (follow-up to #1968) (#2012)
* Handle Cohere2MoE unopened thinking before tools

* Cohere2MoE: route unopened thinking to reasoning_content; test in active target

Follow-up to #1968. Gate extract_reasoning on reasoning_format only (drop the
"&& enable_thinking" addition) so the unopened-thinking handling does not also
change where an opened thinking block is routed. Under --reasoning off
(enable_thinking=false, reasoning_format defaults to DEEPSEEK) an orphaned
thinking block is now quarantined in reasoning_content with clean content and a
native tool call, instead of leaking the thinking prose into the user-facing
answer.

Move the Cohere2MoE end-to-end parser cases into tests/test-chat-auto-parser.cpp,
which CMake actually builds. tests/test-chat.cpp has been disabled in
tests/CMakeLists.txt since #723, so cohere coverage added there never ran in CI;
revert the local band-aids to that file.

* Cohere2MoE: harden parser from NMC eval findings

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-24 09:04:41 +02:00
Yap Sok Ann
997b289d93
jinja: give each for-loop iteration a fresh scope (#2018)
`{% set %}` of a non-loop variable inside a `{% for %}` body leaks
across iterations when the assignment is conditionally skipped. Each
iteration should start with a clean scope, matching standard Jinja2
semantics.

This fixes the issue with GLM-5.2 chat template when:
* turn 1 is a tool call with reasoning
* turn 2 is a tool call without reasoning

In this case, the reasoning content for turn 1 would be wrongly
duplicated to turn 2, resulting in degraded model performance.
2026-06-24 08:58:36 +02:00
firecoperana
befbc0945b
server: variance based checkpoint eviction (#2020)
Co-authored-by: firecoperana <firecoperana>
2026-06-24 08:54:07 +02:00
Jun Yamog
69a8336d08
Add native MiniMax-M3 tool call parser (#2008) 2026-06-23 09:36:02 +02:00
Kawrakow
0d59973e4a
Fix MTP warmup for GLM models (#1992) 2026-06-19 08:59:55 +02:00
Kawrakow
f5e5753c32
Fix Qwen35 mtp warmup (#1987)
* Use hidden state from prev token from qwen mtp

* Fix Qwen35 MTP warmup

* Cleanup + remove unnecessary crippling performance by not using accept to sample draft token

* Provide API to gtet the model arch string

---------

Co-authored-by: SamuelOliveirads <samueloliveira32df@gmail.com>
2026-06-18 09:03:40 +02:00
Joel Farthing
d37d92b54c
chat: add Cohere2MoE North Code parser (#1968)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-16 15:27:30 +02:00
SamuelOliveirads
6cae8c7ba2 clean logs 2026-06-14 21:07:57 -03:00
SamuelOliveirads
0d75eee35a remove duplicated code and unnecesary refactor 2026-06-14 16:02:02 -03:00
SamuelOliveirads
3b1a0f88d5 Add logging for DFlash statistics and clean up workspace handling 2026-06-13 20:14:08 -03:00
SamuelOliveirads
3a1d46c4d1 Merge remote-tracking branch 'origin/main' into feat/dflash-implementation
# Conflicts:
#	common/common.cpp
#	common/speculative.cpp
#	convert_hf_to_gguf.py
#	examples/server/server-context.cpp
#	examples/server/server-context.h
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama.cpp
2026-06-13 17:27:52 -03:00
Kawrakow
5f917a64b3
Merge pull request #1958 from ikawrakow/ik/handle_think_no_space 2026-06-12 21:27:23 +02:00
Samuel Oliveira Alves
8a38025174
Refactor: Move spec outside server (#1949)
* Refactor speculative decoding: move logic outside of server

* remove duplicated tokens in mtp kv cache

* narrow to only discard draft cells in MTP

* revert mtp_speculative_gen_draft
2026-06-12 18:12:39 +02:00
Kawrakow
175819b4fb Style 2026-06-12 06:19:06 +00:00
Kawrakow
3dbc3241b9 Handle forced-open reasoning tag without trailing whitespace 2026-06-12 05:43:11 +00:00
Joel Farthing
8d91d3c3d9
common: gate empty-start reasoning extraction (#1955)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-12 07:16:24 +02:00
firecoperana
ca0c1c5f85
fix Qwen3.6 outputs blank <think></think> in response when thinking is off (#1951)
Co-authored-by: firecoperana <firecoperana>
2026-06-11 07:26:07 +02:00
firecoperana
2a1148384c
server: fix double submits of infill (#1944)
Co-authored-by: firecoperana <firecoperana>
2026-06-10 07:48:15 +02:00
Joel Farthing
71d5aa21f7
common: handle Laguna chat delimiters (#1943)
* common: handle Laguna chat delimiters

* common: limit tool parser changes to end-delimited content

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-10 07:46:19 +02:00
Kawrakow
366e478cb6
Bug fixes (#1940)
* Bug fixes

* More
2026-06-10 07:45:49 +02:00
Farmadupe
6b9de3dbaa
Fix mrope application across chunk boundaries (Fixes #993 and #1902 -- part 2) (#1918)
* (qwen3vl) Correct calculation for injection point of deepstack image embeddings

INjection point for deepstack embeddings used Hyperparameter n_embd_inp(), which caused the hidden state to be double accounted for, causing an OOB array access. The correct accessor is n_embd()

* Fix m-rope when pipeline parallelism is enabled
2026-06-05 17:10:02 +02:00
SamuelOliveirads
08e4590dcb implement gpu argmax 2026-06-04 20:45:12 -03:00
Samuel Oliveira Alves
007d640098
Standardize speculative decoding arguments on the server (#1908)
* refactor spec args

* add shell-safe quoting of string-valued stage keys in speculative decoding
2026-06-04 15:44:57 +02:00
firecoperana
6c0180d702
server: enable mcp proxy (#1904)
* update http lib

* Add cors proxy

---------

Co-authored-by: firecoperana <firecoperana>
2026-06-04 15:43:07 +02:00
SamuelOliveirads
dc43cdf06b move dflash for it own file 2026-06-02 10:22:13 -03:00
SamuelOliveirads
3d73312d9d apply workspace support for KV cache 2026-06-01 09:55:34 -03:00
SamuelOliveirads
ed403dca27 Use windows update in kv cache 2026-05-31 14:51:21 -03:00
SamuelOliveirads
1369e68471 fix graph mask, swa layers and tokens positions 2026-05-31 11:12:03 -03:00
SamuelOliveirads
532499836e improve DFlash caching and profiling capabilities 2026-05-30 21:36:10 -03:00
SamuelOliveirads
9f5f70cf7e implement target position tracking and context management 2026-05-29 23:11:38 -03:00
Kawrakow
8960c5ba5e
Add extra nodes when dealing with MLA and amb (#1899) 2026-05-29 15:17:24 +03:00
SamuelOliveirads
82cff238fe Initial dflash implementation 2026-05-28 18:57:58 -03:00
Kawrakow
b4e1d916c5
Per GPU fit margin (#1872) 2026-05-25 08:16:45 +03:00
dungquixote42
642c038ccd
Extend expiring logit bias to other sampling parameters (#1770)
* initial commit

* fix underflow bug, add debug prints, update macro/variable names

* fix phrases-sharing-1-flag bug, replace macros with struct member function

* cleanup

* fix file parsing

* string_split_open_close() -> string_extract(), improve escape handling

* support multiple nested entries

* make persistent entries global, simplify file parsing

* cosmetic changes

* add support for jumping to exitword

* update variable names

* fix bad search bug

* better debug prints, reorg

* replace lambda with string_is_found(), add string_unescape() for debug

* add support for inline comments

* add missing debug print macro

* fix type promotion bug

* actually fix type promotion bug
2026-05-23 19:19:12 +03:00
Kawrakow
3f45ba9387
MTP tweaks 3 (#1862) 2026-05-23 07:23:20 +03:00
Samuel Oliveira Alves
7b73f45541
Add adaptive sampling clone and free functions to manage memory (#1851) 2026-05-21 08:11:17 +03:00
Samuel Oliveira Alves
11a1fea9e2
Move embedding management to speculative (#1825)
* refactor speculative decoding with companion context and draft result structures

* feat: add common speculative feature handling in server context

* refactor: move embedings outside server

* feat: harden draft input hidden state in llama context

* remove unused functions

* refactor: streamline speculative feature handling and remove unused code

* remove redundant code

* remove more unused variables

* refactor: implement speculative feature handling
2026-05-20 17:42:48 +03:00
firecoperana
9ae0fb7b2f
Remove reasoning budget logs (#1846)
Co-authored-by: firecoperana <firecoperana>
2026-05-20 07:12:02 +03:00
David Young
c07a052315
MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) (#1821)
* MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4)

Extends -sm graph (split-mode graph) to MLA-style attention across the
DEEPSEEK2, GLM_DSA, and MISTRAL4 architectures. Previously these archs
fell back to -sm layer regardless of the user's flag.

Implementation:
- Per-rank attention build in build_deepseek2_tp_attention with
  view-sliced FlashAttention, split-buffer output projection, and
  ggml_reduce across devices
- wk_b / wv_b absorbed weights replicated per device via materialize()
  in llm_prepare_mla (these can't live in a split buffer)
- KV cache replication path (replicated_k_l) for graph-mode TP
- distribute_mla_tensors_for_split_mode_graph routes attention/norm
  tensors into ctx_split; expert tensors stay per-layer
- Implements ggml_backend_cuda_split_buffer_get_tensor for the
  replicated / row-split / col-split inverse paths
- Early-reject guard in src/llama.cpp that auto-downgrades -sm graph
  to -sm layer (with a warning) when incompatible loader flags are set:
  -ncmoe, -cmoe, -ot, -rtr, -muge

New CLI flag:
- -gap | --graph-attn-precision <f16|f32>  (default f16)

See the PR description for the full validation matrix (3 archs x 2/4/8
GPU counts), perf numbers, VRAM accounting, and known limitations.

* Some tweaks

* materialize lambda: per-head split for graph-mode tp_replicate

7dd19e19 changed wk_b/wv_b distribution from mirror to per-head split
(split_dim=2) via prepare_split_tensors. That path only fires when
wk_b/wv_b are loaded from GGUF.

Models that store only wkv_b in GGUF derive wk_b/wv_b at load via
llm_prepare_mla, going through the materialize lambda, which was
untouched and still produced mirror replicas (split_dim=-1, full n_head
per device).

build_deepseek2_tp_attention now does mul_mat(wk_b_local, q_nope_perm)
without the prior view_3d slice, so a mirror replica passes an n_head
tensor where the kernel expects n_head_local. Result: silent SIGSEGV
right after model load.

Mirror logic in materialize is replaced with the same per-head split as
prepare_split_tensors: head_offsets derived from wo split, each rank
gets a tensor with ne[2]=n_head_local, data copied from the appropriate
source byte slice. Singular `computed` tensor keeps full metadata for
tensors_by_name lookups.

Tested: 8x3090, -sm graph -mla 3 -fa on now boots cleanly and
sweep-benches without crash. Log confirms new path: "Computed
blk.X.attn_k_b.weight ... split across N devices on dim=2".

* cleanup: indent fix + remove dead view_3d slicing and debug printf

- build_deepseek2.cpp: re-indent the self_attention block in
  build_deepseek2_layer_attention (lines 253-670). Block was at column 0
  inside a function body; now at the expected 4/8-space indent.
- build_deepseek2.cpp: drop the commented-out view_3d slicing and debug
  printfs left over after 7dd19e19's switch to direct mul_mat on
  per-rank wk_b_local / wv_b_local. Update the stale 'wk_b is
  replicated (split_dim=-1)' comment to match the new split_dim=2
  reality.
- ggml-cuda.cu: remove the leftover debug printf in
  ggml_backend_cuda_split_buffer_get_tensor.

No behavior change. Verified with a clean rebuild and DSV2.5 +
GLM-4.7-Flash sweep-bench runs.

* llm_load_tensors: gate incompatible-flag warning to MLA archs

The -ncmoe / -rtr / -muge / -ot warning under -sm graph currently fires
for all archs that support graph mode. That's an over-reach: the
incompatibility is specific to the MLA TP paths (DEEPSEEK2, GLM_DSA,
MISTRAL4) — Gemma4 graph mode existed pre-PR and works with those flags.
Gate the warning to MLA archs only.

Also refreshes two stale comments left over from the wk_b/wv_b
mirror -> per-head-split rewrite:
- src/llama.cpp llm_prepare_mla: "Replicate wk_b/wv_b ..." now reads
  "Per-head split wk_b/wv_b ..." to match what the materialize lambda
  actually does post-823a39e2.
- src/llama-load-tensors.cpp distribute_mla_tensors_for_split_mode_graph:
  drop the wkv_b row-split mention (wkv_b is no longer created under
  graph mode after 7dd19e19) and correct the wk_b/wv_b distribution
  description (per-head split, not per-device replicated).

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-05-19 08:36:17 +03:00
firecoperana
104846ddee
spec : disacard last drafted token with low prob (#1820)
* spec : disacard last drafted token with low prob

* Apply suggestion from @ikawrakow

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-05-19 08:35:35 +03:00
firecoperana
f645ed1e2d
AutoParser: improve reasoning budget and handling of space/newline in tool calls (#1819)
common/chat, server: refactor, move all conversion functions to common, add tests (#20690)

jinja : remove unused header (#22310)

common : fix jinja warnings with clang 21 (#22313)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

chat: fix handling of space in reasoning markers (#22353)

* chat: fix handling of space in reasoning markers

common : re-arm reasoning budget after DONE on new <think> (#22323)

common : determine generation prompt using longest common prefix (#22657)

common/autoparser: fixes for newline handling / forced tool calls (#22654)

* chat/autoparser: the fixes

* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.

* Trim whitespace on apply instead

common/chat : preserve media markers for typed-content templates (#22634)

common : revert reasoning budget +inf logit bias (#22740)

common : do not wrap raw strings in schema parser for tagged parsers (#22827)

common : enable streaming JSON argument values (#23173)

* common : remove atomic from json arguments

* common : remove parsing logic on JSON arguments

common : do not pass prompt tokens to reasoning budget sampler (#22488)

reasoning-budget: clone should do a deep-copy (#23095)

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-05-19 08:34:19 +03:00
Kawrakow
1f8c603d9c
Quantize: add extra output tensor for MTP (#1810)
* Quantize: add extra output tensor for MTP

* Consistently use --mtp-requantize-output-tensor
2026-05-17 13:59:56 +03:00
Kawrakow
3e573cfea6
MTP: option to use re-quantized output tensor for better TG performance (#1809)
* Option to use re-quantized output tensor for MTP

* Remove quantize extra output option

* Handle interleaved types
2026-05-16 14:40:18 +03:00
Samuel Oliveira Alves
f4f4b3ff26
Allow dual speculative decoding (#1789)
* wip: test logic to use multiple specs

* feat: introduce composite speculative decoding stages

* handle MTP context and draft invalidation

* fix: allow gemma mtp for speculative stages

* fix: normalize spec stage keys

* refactor: remove enable_mtp flag and improve speculative stage handling

* fix: update cached text tokens handling for stage chains

* feat: implement sync for external MTP after non-MTP accept
2026-05-15 10:10:40 +03:00
Samuel Oliveira Alves
40b65d8f54
feat: add support for draft imatrix output file (#1803) 2026-05-15 08:10:58 +03:00
Kawrakow
949bb8f1d6
More MTP tweaks (#1792) 2026-05-13 17:55:43 +03:00
ubergarm
ca52a825db
feat: add --threads-mtmd for independent multimodal thread count (#1797)
Add `-tm` / `--threads-mtmd` to control CPU thread count used during
multimodal image/audio processing (mmproj encoding), separate from the
main LLM thread count.

This allows running the LLM on GPU with minimal CPU threads (e.g. `-t 1`)
to reduce sync overhead, while using many threads (e.g. `-tm 16`) for
CPU-bound mmproj encoding with `--no-mmproj-offload`.

Fallback chain when `-tm` is not specified:
 1. `--threads-batch` (-tb) — multimodal encoding is a batch/prefill-like
    operation, so it makes sense to track with batch thread count
 2. `--threads` (-t) — final default

Works with both mtmd-cli and llama-server.

AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev
2026-05-13 17:44:43 +03:00
ubergarm
f478a3ec0b
fix: only inflate n_batch for GPU-offloaded mmproj, not CPU (#1788)
The get_batch_ubatch() function unconditionally inflated n_batch and
n_ubatch whenever --mmproj was specified, regardless of whether the
mmproj model actually ran on the GPU. This boosted batch size applies
to both the main context and the MTP draft context, since
params_base.speculative.cparams_dft is derived from
common_context_params_to_llama(params_base).

When mmproj runs on CPU (--no-mmproj-offload), this batch inflation
is unnecessary for mmproj itself (CPU compute is sized by image
dimensions independently), but it still inflates the MTP compute buffer
proportionally. For large images (e.g. --image-max-tokens 4096), the
MTP compute buffer ballooned to ~2020 MiB and triggered an OOM even
though the mmproj model was fully on CPU and should have saved VRAM.

Restrict the batch inflation to !params.mmproj.path.empty() &&
params.mmproj_use_gpu so it only triggers when mmproj actually occupies
GPU memory. When mmproj runs on CPU, the existing per-chunk decode
splitting in mtmd_helper_decode_image_chunk_impl handles large images
correctly with the default batch size.

AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev
2026-05-13 09:08:42 +03:00
Samuel Oliveira Alves
be8435793e
Pre-allocate buffers for hybrid model checkpoints (#1774)
* hybrid-spec: improve recurrent checkpoint handling in speculative decoding

* change per-step save to support scheduling and asynchronous tensor operations

* remove redudant backend tensor fallback

* improve recurrent tensor handling for split graph
2026-05-12 07:21:25 +03:00