4666 Commits

Author SHA1 Message Date
Kawrakow
f81673c7db
Merge pull request #1972 from ikawrakow/ik/minimaxm3_smgraph
Split mode graph for MiniMax-M3
2026-06-15 13:44:19 +02:00
Kawrakow
e927adc4ad
Merge pull request #1969 from Farmadupe/resize_algo_fix
Correct image resize algorithm for all qwens after qwen2vl and gemma4
2026-06-15 13:39:11 +02:00
Kawrakow
00d96744de
Merge pull request #1967 from Farmadupe/stb_image_resize2
Replace image resizers with avx2/neon simd impls from stb_img_resize2.h
2026-06-15 13:38:31 +02:00
Kawrakow
1dc4ea938a
Merge pull request #1962 from ikawrakow/ik/fix_1961
Fix #1961
2026-06-15 13:00:27 +02:00
Kawrakow
c24d50dd88 Split mode graph for MiniMax-M3 2026-06-15 08:41:34 +00:00
Kawrakow
567854aeab
Merge pull request #1963 from jkyamog/minimax-m3-support
Add preliminary MiniMax-M3 support
2026-06-15 10:16:10 +02:00
Jun Yamog
c08d194edd Use standard graph helpers for MiniMax-M3 2026-06-15 02:00:46 +00:00
Jun Yamog
c538210e6d Add MiniMax-M3 chat template 2026-06-15 01:29:13 +00:00
SamuelOliveirads
6cae8c7ba2 clean logs 2026-06-14 21:07:57 -03:00
Thomas Green
19f08160ad Correct image resize algorithm for all qwens after qwen2vl and gemma4 2026-06-14 21:57:11 +01:00
Thomas Green
574f22b3c7 Replace image resizers with avx2/neon simd impls from stb_img_resize2.h 2026-06-14 20:28:08 +01:00
SamuelOliveirads
0d75eee35a remove duplicated code and unnecesary refactor 2026-06-14 16:02:02 -03:00
Kawrakow
4f1ec69ae5
Merge pull request #1965 from Nexesenex/fix_q8_0_graph_reduce_type
CUDA: Fix Q8_0 graph reduce type
2026-06-14 16:32:48 +02:00
Nexesenex
0fdac83272 Fix Q8_0 graph reduce type
Analogous to the BF16 fix in eea6a82b25, this adds proper Q8_0
type handling in ggml_cuda_op_add:

- Add k_add_q8_0_f32 kernel: dequantize Q8_0, add F32, store F32
- Add k_add_q8_0_q8_0_f32 kernel: dequantize two Q8_0, add, store F32
- Add Q8_0+Q8_0/Q8_0+F32/F32+Q8_0 branches in the F32 dst (else) block,
  preventing Q8_0 data from falling through to the incorrect half cast
- Expand Q8_0 dst branch to handle F32+Q8_0->Q8_0 (swapped args), not
  just Q8_0+F32->Q8_0
2026-06-14 16:13:17 +02:00
Jun Yamog
0df00b3b94 Add preliminary MiniMax-M3 support 2026-06-14 12:23:20 +00:00
Kawrakow
c73bfbe9ce Fix #1961 2026-06-14 07:42:39 +00:00
Kawrakow
670a3f6f5b
Merge pull request #1960 from BeccaLabs/fix/rpc-device-init
fix: initialize rpc_device endpoint and device index before parsing
2026-06-14 08:14:07 +02:00
SamuelOliveirads
3b1a0f88d5 Add logging for DFlash statistics and clean up workspace handling 2026-06-13 20:14:08 -03:00
BECCA-Labs
053202f97a fix: initialize rpc_device endpoint and device index before parsing 2026-06-13 16:13:44 -05:00
SamuelOliveirads
3a1d46c4d1 Merge remote-tracking branch 'origin/main' into feat/dflash-implementation
# Conflicts:
#	common/common.cpp
#	common/speculative.cpp
#	convert_hf_to_gguf.py
#	examples/server/server-context.cpp
#	examples/server/server-context.h
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama.cpp
2026-06-13 17:27:52 -03:00
Kawrakow
5f917a64b3
Merge pull request #1958 from ikawrakow/ik/handle_think_no_space 2026-06-12 21:27:23 +02:00
Samuel Oliveira Alves
8a38025174
Refactor: Move spec outside server (#1949)
* Refactor speculative decoding: move logic outside of server

* remove duplicated tokens in mtp kv cache

* narrow to only discard draft cells in MTP

* revert mtp_speculative_gen_draft
2026-06-12 18:12:39 +02:00
Farmadupe
d1339249d7
Cleanup: Unify location of m-rope repacking for token and embd (#1924)
* unify location of rope-position-array rewriting prior to ubatching

* Reorder terms.
2026-06-12 08:27:50 +02:00
Simon Lundell
b1eb8bb0a1
server: gate llama_decode_stop() to the active decode (fix queued-cancel cascade) (#1941)
With --parallel 1, a client disconnect/timeout on a *queued* request aborts the
*active* decode of a different client (llama_decode: failed to decode, ret = -3 /
"Decode process is cancelled by user"), releasing the slot with the request
unfinished. To the active client the stream silently stalls and never returns,
while the server reports healthy — easy to misdiagnose as a network/proxy wedge.

Root cause: llama_decode_stop() signals a process-global stop flag that the
active decode loop polls. examples/server/server.cpp calls it *ungated* from the
request reader's connection-closed paths, so any reader closing (including a
queued, not-yet-running task's) trips the global flag against whatever decode is
currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" +
hybrid/recurrent ret=-3), which did not gate these call sites against non-active
readers, so the queued-cancel-kills-active cascade still fires on current main.

Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the
three llama_decode_stop() sites on it, so the global stop is signalled only when
one of THIS reader's tasks is on a slot (the active decode). A queued task's
disconnect then only drops that queued task. Verified in production under heavy
concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero
active-decode kills). Stdlib-only reproducer in the PR description.

Caveat: any_task_on_slot() reads the slots vector from the reader thread — the
same race class as the existing process-global flag; can be tightened to a
per-context/per-task cancellation if preferred.
2026-06-12 08:25:44 +02:00
Marian M.
5fb707d19b
Update docs (#1956)
* Update README.md

Models, MTP, fit

* Update parameters.md

Disclaimer, terms, new flags, graph split list.
2026-06-12 08:24:22 +02:00
Kawrakow
175819b4fb Style 2026-06-12 06:19:06 +00:00
Kawrakow
3dbc3241b9 Handle forced-open reasoning tag without trailing whitespace 2026-06-12 05:43:11 +00:00
Joel Farthing
8d91d3c3d9
common: gate empty-start reasoning extraction (#1955)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-12 07:16:24 +02:00
Kawrakow
022bd00aab
Optimize Cohere2-MoE graph parallel (#1948)
* Optimzie Cohere2-MoE graph parallel

* Minor
2026-06-11 07:26:42 +02:00
firecoperana
ca0c1c5f85
fix Qwen3.6 outputs blank <think></think> in response when thinking is off (#1951)
Co-authored-by: firecoperana <firecoperana>
2026-06-11 07:26:07 +02:00
Kawrakow
c0d25e8fa1
Gemma4 E2B/E4B tweaks (#1947)
* Gemma4 E2B/E4B tweaks

* A few more named nodes
2026-06-10 15:28:54 +02:00
Joel Farthing
4a1e2eaa69
model: add Cohere2-MoE North Mini Code support (#1945)
* Add Cohere2 MoE North Mini Code support

* Fix Cohere2 MoE expert tensor emission

* Enhance Cohere2-MoE support by modifying tensor handling and configuration logic

* Fix Cohere2-MoE graph split reduce handling

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-10 15:28:27 +02:00
Kawrakow
e6f8112f3b
Adjust CUDA FA kernel parameters for head size 512 on Turing (#1942) 2026-06-10 07:49:21 +02:00
firecoperana
2a1148384c
server: fix double submits of infill (#1944)
Co-authored-by: firecoperana <firecoperana>
2026-06-10 07:48:15 +02:00
Joel Farthing
71d5aa21f7
common: handle Laguna chat delimiters (#1943)
* common: handle Laguna chat delimiters

* common: limit tool parser changes to end-delimited content

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-10 07:46:19 +02:00
Kawrakow
366e478cb6
Bug fixes (#1940)
* Bug fixes

* More
2026-06-10 07:45:49 +02:00
Kawrakow
2768b62515
Split mode graph for Laguna (#1939) 2026-06-09 10:13:30 +02:00
Kawrakow
11c3546235
Support for alternative Gemma4 assistant (#1937) 2026-06-09 09:30:12 +02:00
Kawrakow
a38d29232d
CPU FA: disable mask optimization (#1935) 2026-06-09 09:13:19 +02:00
Joel Farthing
bbe1a511ee
model: add Poolside Laguna XS.2 support (#1911)
* llama: register Laguna architecture

* llama: add Laguna graph support

* llama: place Laguna MoE tensors for cpu-moe

* gguf: add Laguna metadata and tokenizer ids

* convert: support Poolside Laguna XS.2

* model: align Laguna RoPE and graph semantics

* model: align Laguna partial offload with review feedback

* model: localize Laguna SWA YaRN defaults

* model: localize Laguna SWA RoPE constants

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-08 18:33:12 +02:00
Kawrakow
eea6a82b25
Fix bf16 graph reduce type (#1938) 2026-06-08 16:51:05 +02:00
Kawrakow
1660459db5
CUDA FA: cover Gemma4-4B/2B assistant (#1934) 2026-06-08 08:18:26 +02:00
Kawrakow
b50b0919d5
CPU FA: Check for empty attention mask (#1923) 2026-06-08 07:54:57 +02:00
Joel Farthing
2f2ca7adb1
convert: support Gemma4UnifiedAssistantForCausalLM (#1925)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-08 07:43:43 +02:00
Joel Farthing
3c0f7b2f47
Gemma4: allow missing shared-KV edge tensors (#1927)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-08 07:25:19 +02:00
Farmadupe
6b9de3dbaa
Fix mrope application across chunk boundaries (Fixes #993 and #1902 -- part 2) (#1918)
* (qwen3vl) Correct calculation for injection point of deepstack image embeddings

INjection point for deepstack embeddings used Hyperparameter n_embd_inp(), which caused the hidden state to be double accounted for, causing an OOB array access. The correct accessor is n_embd()

* Fix m-rope when pipeline parallelism is enabled
2026-06-05 17:10:02 +02:00
Kawrakow
1b53a58bf9
Enable split mode graph for Gemma4-12B (#1922) 2026-06-05 10:59:22 +02:00
SamuelOliveirads
08e4590dcb implement gpu argmax 2026-06-04 20:45:12 -03:00
Farmadupe
1520eda980
prompt cache: Fix assertion that prompt cache does ot rewind to middle of image (#1913) 2026-06-04 17:53:06 +02:00
Chip Bradford
19dcc1f7d1
CUDA : support head_dim 512 with gqa_ratio % 8 (unblocks Gemma 4 12B) (#1921)
The MMA flash-attention dispatcher only instantiated ncols2 = 8 and 4 for
head_dim 512, so any other GQA ratio hit GGML_ABORT. Gemma 4 12B's global
attention layers use head_dim 512 with a 16:1 GQA ratio (16 query heads /
1 KV head), which aborts at load. Because MTP speculative decoding requires
flash attention, this also blocks the Gemma 4 12B MTP drafter entirely.

Instantiating ncols2 = 16 there is not viable: it exceeds the maximum dynamic
shared memory on Ada (cudaFuncSetAttribute returns invalid argument). Instead,
route gqa_ratio % 8 == 0 (covering 8 and 16) through the existing ncols2 = 8
kernel, which already iterates over Q-head groups (iter_z = ceil(gqa_ratio /
ncols2)). gqa_ratio 8 and 4 behavior is unchanged; this mirrors the divisor
dispatch already used for the 576x512 case below.

Verified on RTX 4070 Ti SUPER (Ada, cc 8.9): Gemma 4 12B + MTP drafter now
runs with flash attention; draft acceptance 43-95% by workload, 1.5-2.2x
end-to-end speedup. The 26B-A4B drafter (gqa_ratio 8) is unaffected.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 17:36:10 +02:00