4603 Commits

Author SHA1 Message Date
Nexesenex
3c9680fd3c Fix Minimax M3 crash when -muge merges up/gate experts
The graph builder for Minimax M3 (build_minimaxm3.cpp) was not passing
model.layers[il].ffn_up_gate_exps to llm_build_std_moe_ffn, unlike
Minimax M2 and all other MoE model graph builders.

When -muge (merge_up_gate_experts) is enabled, the merge creates a single
ffn_up_gate_exps tensor with ffn_up_exps and ffn_gate_exps as views.
Only the parent merged tensor gets the split 'extra' pointer set.
Without passing it as up_gate_exps parameter, the function sees null
split pointers for up/gate (the views) while split_down_exps is valid,
causing the assertion at llama-build-context.cpp:1453 to fail.
2026-06-15 15:00:32 +02:00
Kawrakow
f81673c7db
Merge pull request #1972 from ikawrakow/ik/minimaxm3_smgraph
Split mode graph for MiniMax-M3
2026-06-15 13:44:19 +02:00
Kawrakow
e927adc4ad
Merge pull request #1969 from Farmadupe/resize_algo_fix
Correct image resize algorithm for all qwens after qwen2vl and gemma4
2026-06-15 13:39:11 +02:00
Kawrakow
00d96744de
Merge pull request #1967 from Farmadupe/stb_image_resize2
Replace image resizers with avx2/neon simd impls from stb_img_resize2.h
2026-06-15 13:38:31 +02:00
Kawrakow
1dc4ea938a
Merge pull request #1962 from ikawrakow/ik/fix_1961
Fix #1961
2026-06-15 13:00:27 +02:00
Kawrakow
c24d50dd88 Split mode graph for MiniMax-M3 2026-06-15 08:41:34 +00:00
Kawrakow
567854aeab
Merge pull request #1963 from jkyamog/minimax-m3-support
Add preliminary MiniMax-M3 support
2026-06-15 10:16:10 +02:00
Jun Yamog
c08d194edd Use standard graph helpers for MiniMax-M3 2026-06-15 02:00:46 +00:00
Jun Yamog
c538210e6d Add MiniMax-M3 chat template 2026-06-15 01:29:13 +00:00
Thomas Green
19f08160ad Correct image resize algorithm for all qwens after qwen2vl and gemma4 2026-06-14 21:57:11 +01:00
Thomas Green
574f22b3c7 Replace image resizers with avx2/neon simd impls from stb_img_resize2.h 2026-06-14 20:28:08 +01:00
Kawrakow
4f1ec69ae5
Merge pull request #1965 from Nexesenex/fix_q8_0_graph_reduce_type
CUDA: Fix Q8_0 graph reduce type
2026-06-14 16:32:48 +02:00
Nexesenex
0fdac83272 Fix Q8_0 graph reduce type
Analogous to the BF16 fix in eea6a82b25, this adds proper Q8_0
type handling in ggml_cuda_op_add:

- Add k_add_q8_0_f32 kernel: dequantize Q8_0, add F32, store F32
- Add k_add_q8_0_q8_0_f32 kernel: dequantize two Q8_0, add, store F32
- Add Q8_0+Q8_0/Q8_0+F32/F32+Q8_0 branches in the F32 dst (else) block,
  preventing Q8_0 data from falling through to the incorrect half cast
- Expand Q8_0 dst branch to handle F32+Q8_0->Q8_0 (swapped args), not
  just Q8_0+F32->Q8_0
2026-06-14 16:13:17 +02:00
Jun Yamog
0df00b3b94 Add preliminary MiniMax-M3 support 2026-06-14 12:23:20 +00:00
Kawrakow
c73bfbe9ce Fix #1961 2026-06-14 07:42:39 +00:00
Kawrakow
670a3f6f5b
Merge pull request #1960 from BeccaLabs/fix/rpc-device-init
fix: initialize rpc_device endpoint and device index before parsing
2026-06-14 08:14:07 +02:00
BECCA-Labs
053202f97a fix: initialize rpc_device endpoint and device index before parsing 2026-06-13 16:13:44 -05:00
Kawrakow
5f917a64b3
Merge pull request #1958 from ikawrakow/ik/handle_think_no_space 2026-06-12 21:27:23 +02:00
Samuel Oliveira Alves
8a38025174
Refactor: Move spec outside server (#1949)
* Refactor speculative decoding: move logic outside of server

* remove duplicated tokens in mtp kv cache

* narrow to only discard draft cells in MTP

* revert mtp_speculative_gen_draft
2026-06-12 18:12:39 +02:00
Farmadupe
d1339249d7
Cleanup: Unify location of m-rope repacking for token and embd (#1924)
* unify location of rope-position-array rewriting prior to ubatching

* Reorder terms.
2026-06-12 08:27:50 +02:00
Simon Lundell
b1eb8bb0a1
server: gate llama_decode_stop() to the active decode (fix queued-cancel cascade) (#1941)
With --parallel 1, a client disconnect/timeout on a *queued* request aborts the
*active* decode of a different client (llama_decode: failed to decode, ret = -3 /
"Decode process is cancelled by user"), releasing the slot with the request
unfinished. To the active client the stream silently stalls and never returns,
while the server reports healthy — easy to misdiagnose as a network/proxy wedge.

Root cause: llama_decode_stop() signals a process-global stop flag that the
active decode loop polls. examples/server/server.cpp calls it *ungated* from the
request reader's connection-closed paths, so any reader closing (including a
queued, not-yet-running task's) trips the global flag against whatever decode is
currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" +
hybrid/recurrent ret=-3), which did not gate these call sites against non-active
readers, so the queued-cancel-kills-active cascade still fires on current main.

Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the
three llama_decode_stop() sites on it, so the global stop is signalled only when
one of THIS reader's tasks is on a slot (the active decode). A queued task's
disconnect then only drops that queued task. Verified in production under heavy
concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero
active-decode kills). Stdlib-only reproducer in the PR description.

Caveat: any_task_on_slot() reads the slots vector from the reader thread — the
same race class as the existing process-global flag; can be tightened to a
per-context/per-task cancellation if preferred.
2026-06-12 08:25:44 +02:00
Marian M.
5fb707d19b
Update docs (#1956)
* Update README.md

Models, MTP, fit

* Update parameters.md

Disclaimer, terms, new flags, graph split list.
2026-06-12 08:24:22 +02:00
Kawrakow
175819b4fb Style 2026-06-12 06:19:06 +00:00
Kawrakow
3dbc3241b9 Handle forced-open reasoning tag without trailing whitespace 2026-06-12 05:43:11 +00:00
Joel Farthing
8d91d3c3d9
common: gate empty-start reasoning extraction (#1955)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-12 07:16:24 +02:00
Kawrakow
022bd00aab
Optimize Cohere2-MoE graph parallel (#1948)
* Optimzie Cohere2-MoE graph parallel

* Minor
2026-06-11 07:26:42 +02:00
firecoperana
ca0c1c5f85
fix Qwen3.6 outputs blank <think></think> in response when thinking is off (#1951)
Co-authored-by: firecoperana <firecoperana>
2026-06-11 07:26:07 +02:00
Kawrakow
c0d25e8fa1
Gemma4 E2B/E4B tweaks (#1947)
* Gemma4 E2B/E4B tweaks

* A few more named nodes
2026-06-10 15:28:54 +02:00
Joel Farthing
4a1e2eaa69
model: add Cohere2-MoE North Mini Code support (#1945)
* Add Cohere2 MoE North Mini Code support

* Fix Cohere2 MoE expert tensor emission

* Enhance Cohere2-MoE support by modifying tensor handling and configuration logic

* Fix Cohere2-MoE graph split reduce handling

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-10 15:28:27 +02:00
Kawrakow
e6f8112f3b
Adjust CUDA FA kernel parameters for head size 512 on Turing (#1942) 2026-06-10 07:49:21 +02:00
firecoperana
2a1148384c
server: fix double submits of infill (#1944)
Co-authored-by: firecoperana <firecoperana>
2026-06-10 07:48:15 +02:00
Joel Farthing
71d5aa21f7
common: handle Laguna chat delimiters (#1943)
* common: handle Laguna chat delimiters

* common: limit tool parser changes to end-delimited content

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-10 07:46:19 +02:00
Kawrakow
366e478cb6
Bug fixes (#1940)
* Bug fixes

* More
2026-06-10 07:45:49 +02:00
Kawrakow
2768b62515
Split mode graph for Laguna (#1939) 2026-06-09 10:13:30 +02:00
Kawrakow
11c3546235
Support for alternative Gemma4 assistant (#1937) 2026-06-09 09:30:12 +02:00
Kawrakow
a38d29232d
CPU FA: disable mask optimization (#1935) 2026-06-09 09:13:19 +02:00
Joel Farthing
bbe1a511ee
model: add Poolside Laguna XS.2 support (#1911)
* llama: register Laguna architecture

* llama: add Laguna graph support

* llama: place Laguna MoE tensors for cpu-moe

* gguf: add Laguna metadata and tokenizer ids

* convert: support Poolside Laguna XS.2

* model: align Laguna RoPE and graph semantics

* model: align Laguna partial offload with review feedback

* model: localize Laguna SWA YaRN defaults

* model: localize Laguna SWA RoPE constants

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-08 18:33:12 +02:00
Kawrakow
eea6a82b25
Fix bf16 graph reduce type (#1938) 2026-06-08 16:51:05 +02:00
Kawrakow
1660459db5
CUDA FA: cover Gemma4-4B/2B assistant (#1934) 2026-06-08 08:18:26 +02:00
Kawrakow
b50b0919d5
CPU FA: Check for empty attention mask (#1923) 2026-06-08 07:54:57 +02:00
Joel Farthing
2f2ca7adb1
convert: support Gemma4UnifiedAssistantForCausalLM (#1925)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-08 07:43:43 +02:00
Joel Farthing
3c0f7b2f47
Gemma4: allow missing shared-KV edge tensors (#1927)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-08 07:25:19 +02:00
Farmadupe
6b9de3dbaa
Fix mrope application across chunk boundaries (Fixes #993 and #1902 -- part 2) (#1918)
* (qwen3vl) Correct calculation for injection point of deepstack image embeddings

INjection point for deepstack embeddings used Hyperparameter n_embd_inp(), which caused the hidden state to be double accounted for, causing an OOB array access. The correct accessor is n_embd()

* Fix m-rope when pipeline parallelism is enabled
2026-06-05 17:10:02 +02:00
Kawrakow
1b53a58bf9
Enable split mode graph for Gemma4-12B (#1922) 2026-06-05 10:59:22 +02:00
Farmadupe
1520eda980
prompt cache: Fix assertion that prompt cache does ot rewind to middle of image (#1913) 2026-06-04 17:53:06 +02:00
Chip Bradford
19dcc1f7d1
CUDA : support head_dim 512 with gqa_ratio % 8 (unblocks Gemma 4 12B) (#1921)
The MMA flash-attention dispatcher only instantiated ncols2 = 8 and 4 for
head_dim 512, so any other GQA ratio hit GGML_ABORT. Gemma 4 12B's global
attention layers use head_dim 512 with a 16:1 GQA ratio (16 query heads /
1 KV head), which aborts at load. Because MTP speculative decoding requires
flash attention, this also blocks the Gemma 4 12B MTP drafter entirely.

Instantiating ncols2 = 16 there is not viable: it exceeds the maximum dynamic
shared memory on Ada (cudaFuncSetAttribute returns invalid argument). Instead,
route gqa_ratio % 8 == 0 (covering 8 and 16) through the existing ncols2 = 8
kernel, which already iterates over Q-head groups (iter_z = ceil(gqa_ratio /
ncols2)). gqa_ratio 8 and 4 behavior is unchanged; this mirrors the divisor
dispatch already used for the 576x512 case below.

Verified on RTX 4070 Ti SUPER (Ada, cc 8.9): Gemma 4 12B + MTP drafter now
runs with flash attention; draft acceptance 43-95% by workload, 1.5-2.2x
end-to-end speedup. The 26B-A4B drafter (gqa_ratio 8) is unaffected.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 17:36:10 +02:00
Samuel Oliveira Alves
007d640098
Standardize speculative decoding arguments on the server (#1908)
* refactor spec args

* add shell-safe quoting of string-valued stage keys in speculative decoding
2026-06-04 15:44:57 +02:00
firecoperana
6c0180d702
server: enable mcp proxy (#1904)
* update http lib

* Add cors proxy

---------

Co-authored-by: firecoperana <firecoperana>
2026-06-04 15:43:07 +02:00
firecoperana
074fc7dafd
webui: update llamacpp webui (#1903)
update config

ui: fix audio and video modality detection (#23756)

When model props are fetched asynchronously from the server,
modelPropsVersion is incremented to trigger reactivity, but
only the vision effect was listening to it.

webui: update ignore files

ui: handle audio/vnd.wave as audio WAV file (#23754)

Firefox on Linux uses this MIME type

ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (#23910)

webui: add custom CSS injection via config (#23904)

* webui: add custom CSS injection via config

register a customCSS setting in the Developer section under Custom JSON,
syncable so it rides the existing ui-config pass through. inject the value
into a single style element in the head, reactive on the setting. lets an
operator theme a prebuilt binary through --ui-config without rebuilding,
and lets a user set it from the settings panel.

move the textContent write into a use: action on the head style node.
the action is the idiomatic way to touch a node, so the no-dom-manipulating
lint rule is satisfied without a disable. value stays text through
textContent, never parsed as HTML.

* Update tools/ui/src/lib/constants/settings-keys.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: address review from @allozaur, rename custom config key to customJson with migration

rename the custom config key to customJson across the type, the chat
request builder, the settings save check and the custom tools reader,
keeping the custom API param name unchanged. add a non destructive
migration that copies the legacy custom key to customJson at startup.
only render the head style tag when custom CSS is set.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

server: real-time reasoning interruption via control endpoint (#23971)

Builds on the manual reasoning budget trigger from #23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.

* ui: track reasoning phase via explicit streaming state

Add isReasoning to the chat store, mirroring the isLoading pattern:
per conversation map, private setter, public accessor and reactive
export. Set from the stream callbacks, true on reasoning chunks, false
on the first content chunk, reset on stream end and resynced on
conversation switch. The skip button now keys off isReasoning so it
shows only during the thinking phase, not the whole generation.

* ui: extract control endpoint and action into constants

Move the chat completion routes, the slots route and the reasoning
control action out of chat.service into api-endpoints and a dedicated
control-actions module. No behavior change, drops the magic strings so
the control protocol has a single source of truth.

* server: target reasoning control by completion id

Address @ngxson review on the control endpoint.

Switch from id_slot to the chat completion id to avoid a TOCTOU: the
slot can be reassigned between the lookup and the control request, so
matching the live completion (oaicompat_cmpl_id) is safe and a finished
one simply matches nothing. Rename the action to reasoning_end, guard
it on the reasoning_control flag of the target slot, and reduce the
response to {success} with an optional message.

* ui: target reasoning control by completion id

Keep the streamed completion id on the message and post it back to the
control endpoint instead of probing /slots. Drops the slot discovery
and the TOCTOU that came with it. Action renamed to reasoning_end,
response read as {success}.

* server: address review from @ngxson

Move the control fields into task_params and drop the redundant
comments on the control path.

* server: document the reasoning control endpoint

* Update tools/ui/src/lib/types/database.d.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: rename cmplId to completionId

Per @allozaur review, clearer name for the streamed completion id.

* ui: wire completion id capture through the agentic flow

The webui streams through the agentic flow, which relayed onModel but
not onCompletionId, so the completion id never reached the message and
the control request was never sent. Relay it through the flow and its
callbacks type, declare id on the chunk type, and log an explicit error
when the button fires without a usable id.

* ui: target reasoning control model from the message

The model is a property of the completion, so read it from the streaming
message like the id, not from the model dropdown which is unrelated UI
state. Makes the request self-consistent by construction instead of just
unlikely to drift.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

ui: Add Thinking mode toggle with reasoning effort levels + improvements for Chat Form Add Action UI (#23434)

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* fix: Model tags

ui: simplify network error handling (#23431)

Previously error to string conversion was split in two different files,
with one converting errors into strings, and another function analyzing
those strings to generate yet another string.

Now the the error handling for network fetches has been centralised and
uses directly HTTP error codes whereas possible to generate the
human-readable error strings.

It also fixes an issue where all JSON errors reported from the backend,
such as "Invalid API key", would get turned incorrectly in to
"Failed to connect to server" due to poor matching logic in the
now-gone getErrorMessage function.

update html

ui: Mermaid Diagrams in chat + interactive preview (#24032)

webui: fix tool selector toggle/counter, key tools by stable identity (#24065)

* webui: fix tool selector toggle/counter, key tools by stable identity

Key the disabled set, counts and toggles by a stable per-tool key
instead of bare function name, deduped from one canonical list. Per-tool
checkboxes become presentational (single row handler, no nested button),
category checkboxes drop the tristate (n/total carries partial). One
getEnabledToolsForLLM keeps normalized MCP schemas and dedupes by name.

* ui: use SvelteSet and SvelteMap for local tool collections to satisfy svelte/prefer-svelte-reactivity

Co-authored-by: firecoperana <firecoperana>
2026-06-04 15:41:23 +02:00
Kawrakow
4406e637b5
Split mode graph for Mellum (#1920) 2026-06-04 15:20:41 +02:00