4666 Commits

Author SHA1 Message Date
Samuel Oliveira Alves
007d640098
Standardize speculative decoding arguments on the server (#1908)
* refactor spec args

* add shell-safe quoting of string-valued stage keys in speculative decoding
2026-06-04 15:44:57 +02:00
firecoperana
6c0180d702
server: enable mcp proxy (#1904)
* update http lib

* Add cors proxy

---------

Co-authored-by: firecoperana <firecoperana>
2026-06-04 15:43:07 +02:00
firecoperana
074fc7dafd
webui: update llamacpp webui (#1903)
update config

ui: fix audio and video modality detection (#23756)

When model props are fetched asynchronously from the server,
modelPropsVersion is incremented to trigger reactivity, but
only the vision effect was listening to it.

webui: update ignore files

ui: handle audio/vnd.wave as audio WAV file (#23754)

Firefox on Linux uses this MIME type

ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (#23910)

webui: add custom CSS injection via config (#23904)

* webui: add custom CSS injection via config

register a customCSS setting in the Developer section under Custom JSON,
syncable so it rides the existing ui-config pass through. inject the value
into a single style element in the head, reactive on the setting. lets an
operator theme a prebuilt binary through --ui-config without rebuilding,
and lets a user set it from the settings panel.

move the textContent write into a use: action on the head style node.
the action is the idiomatic way to touch a node, so the no-dom-manipulating
lint rule is satisfied without a disable. value stays text through
textContent, never parsed as HTML.

* Update tools/ui/src/lib/constants/settings-keys.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: address review from @allozaur, rename custom config key to customJson with migration

rename the custom config key to customJson across the type, the chat
request builder, the settings save check and the custom tools reader,
keeping the custom API param name unchanged. add a non destructive
migration that copies the legacy custom key to customJson at startup.
only render the head style tag when custom CSS is set.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

server: real-time reasoning interruption via control endpoint (#23971)

Builds on the manual reasoning budget trigger from #23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.

* ui: track reasoning phase via explicit streaming state

Add isReasoning to the chat store, mirroring the isLoading pattern:
per conversation map, private setter, public accessor and reactive
export. Set from the stream callbacks, true on reasoning chunks, false
on the first content chunk, reset on stream end and resynced on
conversation switch. The skip button now keys off isReasoning so it
shows only during the thinking phase, not the whole generation.

* ui: extract control endpoint and action into constants

Move the chat completion routes, the slots route and the reasoning
control action out of chat.service into api-endpoints and a dedicated
control-actions module. No behavior change, drops the magic strings so
the control protocol has a single source of truth.

* server: target reasoning control by completion id

Address @ngxson review on the control endpoint.

Switch from id_slot to the chat completion id to avoid a TOCTOU: the
slot can be reassigned between the lookup and the control request, so
matching the live completion (oaicompat_cmpl_id) is safe and a finished
one simply matches nothing. Rename the action to reasoning_end, guard
it on the reasoning_control flag of the target slot, and reduce the
response to {success} with an optional message.

* ui: target reasoning control by completion id

Keep the streamed completion id on the message and post it back to the
control endpoint instead of probing /slots. Drops the slot discovery
and the TOCTOU that came with it. Action renamed to reasoning_end,
response read as {success}.

* server: address review from @ngxson

Move the control fields into task_params and drop the redundant
comments on the control path.

* server: document the reasoning control endpoint

* Update tools/ui/src/lib/types/database.d.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: rename cmplId to completionId

Per @allozaur review, clearer name for the streamed completion id.

* ui: wire completion id capture through the agentic flow

The webui streams through the agentic flow, which relayed onModel but
not onCompletionId, so the completion id never reached the message and
the control request was never sent. Relay it through the flow and its
callbacks type, declare id on the chunk type, and log an explicit error
when the button fires without a usable id.

* ui: target reasoning control model from the message

The model is a property of the completion, so read it from the streaming
message like the id, not from the model dropdown which is unrelated UI
state. Makes the request self-consistent by construction instead of just
unlikely to drift.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

ui: Add Thinking mode toggle with reasoning effort levels + improvements for Chat Form Add Action UI (#23434)

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* fix: Model tags

ui: simplify network error handling (#23431)

Previously error to string conversion was split in two different files,
with one converting errors into strings, and another function analyzing
those strings to generate yet another string.

Now the the error handling for network fetches has been centralised and
uses directly HTTP error codes whereas possible to generate the
human-readable error strings.

It also fixes an issue where all JSON errors reported from the backend,
such as "Invalid API key", would get turned incorrectly in to
"Failed to connect to server" due to poor matching logic in the
now-gone getErrorMessage function.

update html

ui: Mermaid Diagrams in chat + interactive preview (#24032)

webui: fix tool selector toggle/counter, key tools by stable identity (#24065)

* webui: fix tool selector toggle/counter, key tools by stable identity

Key the disabled set, counts and toggles by a stable per-tool key
instead of bare function name, deduped from one canonical list. Per-tool
checkboxes become presentational (single row handler, no nested button),
category checkboxes drop the tristate (n/total carries partial). One
getEnabledToolsForLLM keeps normalized MCP schemas and dedupes by name.

* ui: use SvelteSet and SvelteMap for local tool collections to satisfy svelte/prefer-svelte-reactivity

Co-authored-by: firecoperana <firecoperana>
2026-06-04 15:41:23 +02:00
Kawrakow
4406e637b5
Split mode graph for Mellum (#1920) 2026-06-04 15:20:41 +02:00
Joel Farthing
dc51c6f9b2
Add Mellum2 architecture support (#1919)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-04 14:28:02 +02:00
Farmadupe
e08ad51f15
Insert image pad markers for kimi K2.5 and K2.6 (#1912) 2026-06-04 09:27:28 +02:00
SamuelOliveirads
dc43cdf06b move dflash for it own file 2026-06-02 10:22:13 -03:00
SamuelOliveirads
1250f522ed add qwen, gemma and kimi dflash support 2026-06-01 17:14:25 -03:00
SamuelOliveirads
3d73312d9d apply workspace support for KV cache 2026-06-01 09:55:34 -03:00
SamuelOliveirads
ed403dca27 Use windows update in kv cache 2026-05-31 14:51:21 -03:00
SamuelOliveirads
1369e68471 fix graph mask, swa layers and tokens positions 2026-05-31 11:12:03 -03:00
SamuelOliveirads
532499836e improve DFlash caching and profiling capabilities 2026-05-30 21:36:10 -03:00
Samuel Oliveira Alves
3f40e73c36
expand np guardrail for all mtp types (#1901) 2026-05-30 16:19:53 +03:00
SamuelOliveirads
9f5f70cf7e implement target position tracking and context management 2026-05-29 23:11:38 -03:00
Kawrakow
8960c5ba5e
Add extra nodes when dealing with MLA and amb (#1899) 2026-05-29 15:17:24 +03:00
Kawrakow
e75337fec3
quantize: add exception for Gemma4 (#1897) 2026-05-29 10:54:21 +03:00
SamuelOliveirads
82cff238fe Initial dflash implementation 2026-05-28 18:57:58 -03:00
Kawrakow
6eff055a0c
GLM-5 MTP (again) (#1890)
* wip: port MTP architecture

Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.

Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.

* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).

* core: enable hybrid outputs (logits + embeddings) for MTP support

* fix(mtp): correct KV-cache slot finding for updates

* fix(mtp): persist hidden states to prevent context corruption during drafting

* refactor(mtp): clean unused code

* fix(mtp): update server to new functions name

* fix(mtp): fix graph and save hidden state

* mtp: refactor integration, context params and kv cache search

* mtp: fix hidden state extraction and speculative acceptance flow

* server: fix MTP warmup for long prompts and reset token buffer

* llama: refactor MTP operation state to context parameters

* server: fix n_past calculation in MTP acceptance

* llama: fix mtp enable flags

* speculative: refactor MTP to use common_speculative interface

* context: remove unused signatures

* clip: fix deprecated enum-enum conversion warning

* common: fix format string crash in help message

* context: fix mtp activation logic

* llamat: always use the extracted embedding

* llama: get all embeddings to kv cache

* llama: revert logit to not run mtp for not supported arch

* llama: allocate all the n_outputs for MTP

* wip

* server-context: get only the last embedding for hidden state

* ggml-backend: fix array of bounds in debug build

* server-context: run mt kv update to each prompt batch

* revert segmentation fault fixes

* glm-mtp(feat): optimize graph embedding and recursive drafting

* glm5-mtp(feat): add glm 5 mtp logic

* glm-mtp: standardize the MTP graph

* glm 5 mtp: apply post-layer cvec

* glm 5 mtp: mark head as mandatory

* get normed embeddings for glm 5

* Fix GLM5 MTP

* GLM5 MTP: just reuse the layer attention implementation

* Make MTP work with split mode graph

---------

Co-authored-by: samuel <samueloliveira32df@gmail.com>
2026-05-28 18:14:12 +03:00
Kawrakow
6648aa2e6e Fix Gemma4 vision 2026-05-28 15:08:46 +00:00
Kawrakow
3bf7e836c2
Allow Hadamard transform for head sizes that are not power of 2 (#1883)
* Disable K Hadamard transform if K-head size is not a power of 2

* Allow Hadamard transform for head sizes that are not power of 2

* Give more details why Hadamard is not possible

* Arghh
2026-05-27 18:29:32 +03:00
Kawrakow
d503b046f7
Fix GLM MTP with split mode graph (#1887)
* Fix crash with GLM and MTP

* Fix GLM MTP with split mode graph
2026-05-27 07:24:28 +03:00
Kawrakow
1f66f9912f
Fix crash with GLM and MTP (#1885) 2026-05-27 07:24:05 +03:00
Kawrakow
d2da6da05c
Fix cache loading/saving for MLA models and split mode graph (#1884) 2026-05-26 17:07:40 +03:00
Gearstickle
4fbd0c441b
fa: preserve early-termination, fix multi-slot correctness via union of masks (#1880)
* fa: fix FlashQKV early-termination causing S=0 assertion with --parallel N>1

The backward-scan optimization in compute_helper/compute_helper_q checks
only one mask position per k_step block on the last query row (q_step-1)
to find where valid KV entries end. When q_step > 1 and different query
rows have non-overlapping valid KV regions (multi-slot / --parallel N>1),
the scan on the last row's mask can miss blocks that contain valid entries
for earlier rows. This causes those rows to accumulate S=0, triggering
the GGML_ASSERT(S > 0) in normalize_and_store_1row.

Fix: remove the early-termination scan at all 4 sites and iterate all
nk1/k_step blocks unconditionally. The mask already handles correctness:
fully-masked blocks produce smax=-inf and skip V accumulation, so the
performance cost is minimal for TG (small nq1) and acceptable for PP.

Fixes #809

* fa: refactor multi-slot mask fix into mask_effective_nk1() helper

Replace 4× inlined early-termination scans with a shared helper that
computes the effective K boundary by scanning ALL query mask rows
(union-of-masks). This is the minimal fix for multi-slot parallel
inference where different slots have different sequence lengths.

The helper returns the k_step-aligned boundary covering the longest
active sequence across all rows, preserving single-slot performance
(single row = same boundary as before).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Turbomen008 <Turbomen008@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-26 16:16:49 +03:00
Kawrakow
b4e1d916c5
Per GPU fit margin (#1872) 2026-05-25 08:16:45 +03:00
Samuel Oliveira Alves
9f7ba245ab
Update autofix and presets (#1867)
* Add configuration files for format, presets and examples

* add clang in pre-commit config

* remove clang configurations

* Refactor .gitignore for consistency in formatting
2026-05-24 07:30:44 +03:00
Kawrakow
0c45696db4
Minor logging cleanup (#1873) 2026-05-24 07:29:32 +03:00
Kawrakow
809a63bbb7
Fix MLA models with ngl < n_layer (#1870)
* Fix split mode graph with ngl < n_layer (MLA models)

* It is actually not related to split mode graph
2026-05-24 07:29:17 +03:00
dungquixote42
642c038ccd
Extend expiring logit bias to other sampling parameters (#1770)
* initial commit

* fix underflow bug, add debug prints, update macro/variable names

* fix phrases-sharing-1-flag bug, replace macros with struct member function

* cleanup

* fix file parsing

* string_split_open_close() -> string_extract(), improve escape handling

* support multiple nested entries

* make persistent entries global, simplify file parsing

* cosmetic changes

* add support for jumping to exitword

* update variable names

* fix bad search bug

* better debug prints, reorg

* replace lambda with string_is_found(), add string_unescape() for debug

* add support for inline comments

* add missing debug print macro

* fix type promotion bug

* actually fix type promotion bug
2026-05-23 19:19:12 +03:00
Justin Martin
40d8cb196a
llama-quantize: enable --extra-output-tensor with COPY (#1871) 2026-05-23 13:52:34 +03:00
Kawrakow
a6bb509305
Fix split mode graph with ngl < n_layer (#1869) 2026-05-23 12:58:09 +03:00
Kawrakow
3f45ba9387
MTP tweaks 3 (#1862) 2026-05-23 07:23:20 +03:00
Samuel Oliveira Alves
19e09e81d4
Change MTP graph input preparation with additional parameters and validation checks (#1866) 2026-05-23 07:22:04 +03:00
Kawrakow
b3d39cff8b
Fix split mode graph for Qwen35-MoE + MTP (#1861) 2026-05-22 09:23:53 +03:00
thad0ctor
b26521b9ef
Fix raw-vs-local device id confusion under -dev/-devd subsets (#1826)
llm_load_tensors stores `default_layer_device[i]` as a local index into
`model.devices` (consistent with `device_mem[]`, `model.splits[]`, and
all graph-building consumers), but the four
`llama_default_buffer_type_offload(model, default_layer_device[i])`
callsites passed it through as if it were a raw post-CVD device id.
Under `-dev`/`-devd` subsets where `model.devices != {0..N-1}`, this
selected the wrong buffer type. Wrap with `model.devices[...]` to match
the existing `model.devices[main_gpu]` pattern on the adjacent lines.

llama_init_from_model has the same bug for `main_gpu`: every consumer
(auto-fit override at line 3428, MTP clamp, the `model.devices[main_gpu]`
translations at lines 3678/3682, and graph-building `splits[main_gpu]`)
treats it as a local index, but the five single-GPU backend init paths
(CUDA, Vulkan, SYCL, Kompute, CANN) pass `model->main_gpu` straight to
the backend init, which expects a raw device id. e.g. `-dev CUDA1` with
default `--main-gpu 0` and `split_mode=NONE` called
`ggml_backend_cuda_init(0)` instead of `cuda_init(1)`. Compute
`main_gpu_id` once and use it for all five paths.
2026-05-22 08:32:52 +03:00
Samuel Oliveira Alves
d51036a0c4
fix: reset KV cache and prompt state in server_slot and server_context (#1860) 2026-05-22 08:14:47 +03:00
Kawrakow
48a55f74e4
Disable split mode graph for Qwen35-MoE when MTP is enabled (#1858) 2026-05-21 16:29:35 +03:00
Kawrakow
4b73de246b
Fix crash with split mode graph and partial offload (#1857) 2026-05-21 13:36:01 +03:00
Kawrakow
c5dc847d0a
Fix Gemma4-E4B compute graph (#1855) 2026-05-21 12:46:28 +03:00
Kawrakow
3dd282358b Fix compiler warnings 2026-05-21 05:40:08 +00:00
Samuel Oliveira Alves
7b73f45541
Add adaptive sampling clone and free functions to manage memory (#1851) 2026-05-21 08:11:17 +03:00
David Young
aefb8bdd99
MLA TP -khad: ggml_dequant_hadamard fused op + wv_b/wk_b_pp Hadamard fold (#1852)
* ggml: ggml_dequant_hadamard fused op for MLA -khad path

Adds a new ggml op that fuses (ggml_cast -> F32) + (ggml_hadamard) into a
single kernel. Reads a quantized (or F16/F32) source and produces a per-
Hadamard-block F32 chunk with the inverse transform applied, without
materializing a full-size F32 intermediate buffer.

Motivation: the MLA pp_opt path in build_deepseek2.cpp un-encodes the
H-applied cache_nope view at every PP call. Today that runs as a cast
(quant -> F32) followed by a separate ggml_hadamard kernel, costing two
full-size F32 passes per layer per rank per call. Fusing them halves
the bandwidth on the un-encode and removes one kernel launch.

CUDA kernels in dequant_hadamard.cu lift the Walsh-Hadamard butterfly
from hadamard.cu and dequant helpers from dequantize.cuh:

  * qr=1 layout (q8_0): consecutive dequant pair, stage 1 fused with load
  * qr=2 layout (q4_0 / q4_1 / q5_0 / q5_1 / q6_0 / iq4_nl): dequant pair
    at stride qk/2, explicit stage 1 after sync
  * F16 has a dedicated kernel
  * F32 source falls back to the standalone Hadamard op

CPU impl in iqk_cpu_ops.cpp composes the existing type_traits.to_float
dequant with fast_ht for graph completeness. nh in {64, 128, 256, 512}.

* MLA-TP: Hadamard pretransform of wv_b/wk_b_pp for -khad

Fold the 64-block orthonormal Hadamard into wv_b and wk_b_pp once at
context init so the pp_opt mul_mats consume the K cache in its on-disk
encoded basis. The per-PP-call cache_nope un-Hadamard is then skipped
(rope half still un-applied — it goes to FA via concat, no wk_b multiply).

Math is identity by H^T H = I: mul_mat(H@wv_b, H@cache) = wv_b^T @ cache.
For mla=2/3 absorb, composes correctly with the existing post-FA
ggml_hadamard(kqv_compressed, 64).

All-or-nothing across layers under a castable type-allowlist (excludes
1-3 bpw IQ types whose requant blows up beyond PPL noise). Models with
ineligible weights fall back to the runtime un-Hadamard path unchanged.

Composes with the fused ggml_dequant_hadamard op (prior commit): with the
fold active only the rope half still runs the runtime transform, via the
fused kernel.

* MLA-TP: fix TG with -khad after wv_b/wk_b_pp fold

The absorb branch of build_deepseek2_tp_attention applies
ggml_hadamard to kqv_compressed after FA, then multiplies by
wv_b. Pre-fold this was needed because wv_b was un-encoded; with
the wv_b fold (prior commit) the mul_mat already expects
H-encoded kqv_compressed:

  mul_mat(H @ wv_b, kqv_encoded) = wv_b^T @ H @ H @ kqv_unencoded
                                 = wv_b^T @ kqv_unencoded   (H @ H = I)

Skip the post-FA hadamard when model.khad_pretransformed is set
so the two H applications cancel instead of double-applying.

Affects the absorb branch: TG (n_tokens=1), short-context PP
(n_kv < 1024), and models without wk_b_pp. Long-context PP goes
through the pp_opt branch and is unrelated/unchanged.

Reported by @ikawrakow on PR 1852. Verified across mla={1,2,3} x
khad={on,off} x -ctk={q8_0,q4_0} on GLM-4.7-Flash IQ5_K and the
unsloth IQ4_XS variant ik used to reproduce.

* ggml_hadamard: accept F16 and quant sources; drop GGML_OP_DEQUANT_HADAMARD

Per @ikawrakow review on PR 1852: subsume the per-source-type dispatch
into the existing GGML_OP_HADAMARD instead of carrying a separate enum
entry, op constructor, and standalone files.

ggml_hadamard's API is unchanged from the call-site perspective. The
constructor's F32-only assertion is dropped; ggml_cuda_op_hadamard and
iqk_hadamard now dispatch internally:

  - F32 source: existing F32 butterfly (unchanged)
  - F16 source: dedicated kernel
  - q8_0 / q4_0 / q4_1 / q5_0 / q5_1 / q6_0 / iq4_nl: fused dequant +
    butterfly kernel (lifted from the deleted dequant_hadamard.cu)
  - CPU side composes traits.to_float with fast_ht

Net diff: -80 lines. Removes dequant_hadamard.{cu,cuh}, the enum entry,
op table rows, ggml_dequant_hadamard constructor, dispatch cases, and
the DEQUANT_HADAMARD supports_op block.

Verified clean build + TG smoke (mla=3 +khad q8 on GLM-4.7-Flash-IQ4_XS,
same coherent output as prior commit on feat/dequant-hadamard).
2026-05-21 07:29:15 +03:00
Samuel Oliveira Alves
11a1fea9e2
Move embedding management to speculative (#1825)
* refactor speculative decoding with companion context and draft result structures

* feat: add common speculative feature handling in server context

* refactor: move embedings outside server

* feat: harden draft input hidden state in llama context

* remove unused functions

* refactor: streamline speculative feature handling and remove unused code

* remove redundant code

* remove more unused variables

* refactor: implement speculative feature handling
2026-05-20 17:42:48 +03:00
David Young
dd67a9fb24
MLA TP prompt processing optimisation (#1841)
* MLA TP prompt processing optimisation

Adds a per-rank prompt-processing path to build_deepseek2_tp_attention
that materialises K/V from the compressed latent cache and runs a
standard flash_attn instead of the FlashMLA-3 absorb kernel the TP
attention currently uses for all batch sizes. Affects MLA archs under
-sm graph (DEEPSEEK2, GLM_DSA, MISTRAL4).

Gated on n_tokens >= 128 (set by caller) AND n_kv >= 1024. Below
either threshold the absorb path runs unchanged. Token generation
takes the absorb path; only prompt processing at non-trivial context
materialises.

A second piece pre-computes wk_b in a pp_opt-favouring orientation
(wk_b_pp: [kv_lora_rank, qk_nope, n_head]) at llm_prepare_mla time,
so the per-PP-call materialise can mul_mat against the latent cache
directly without an F16 cast + permute + ggml_cont on wk_b each call.
Path A (wkv_b in GGUF) and Path B (only wk_b/wv_b in GGUF) both
populate wk_b_pp through the standard per-rank replica setup.

Measured on 8x RTX 3090, -sm graph -mla 2 -fa on:

  DSV2.5 IQ2_XS         c=8k  ub=2048   PP +51% to +60%
  GLM-4.7-Flash IQ4_XS  c=32k ub=2048   PP -6% (PP@0) to +77% (PP@30720)
  GLM-5.1 IQ1_S q4_0    c=16k ub=2048   PP +5% to +9%

PPL parity within +/-0.2 noise (DSV2.5 bit-identical 5.3917, GLM-4.7
8.83 vs 8.96, GLM-5.1 6.96 vs 7.00). Token-generation throughput
unchanged within noise.

Compute buffer at init:
  DSV2.5         -54 MiB total       (allocator noise)
  GLM-4.7-Flash  +1042 MiB total     (~+173 MiB per non-output device)
  GLM-5.1        0                   (MoE intermediates dominate)

* MLA TP: respect mla=1 vs mla=3 distinction, rename attn_k_b_pp -> attn_kv_b

ikawrakow/ik_llama.cpp#1841 review feedback: the pp_opt path lost the
intended trade-off where mla=1 forgoes pp_opt to save VRAM and mla=3 pays
the wk_b_pp tensor cost for faster long-context PP.

- llm_prepare_mla second pass: gate wk_b_pp synthesis on mla > 1.
  Models that ship wk_b in their GGUF (mainline format) no longer
  allocate the pp_opt-favoring K weight under mla=1.
- llm_prepare_mla first pass (wk_b synthesis from wkv_b): keep
  unconditional under -sm graph. The wk_b_pp materialization here
  shares the wk_b_f32 intermediate with the wk_b synthesis above, and
  isolating just the wk_b_pp branch leaves the synthesized wk_b in a
  state that makes the absorb path produce inf on some quant combos
  (DSV2.5 IQ2_XS). Trade: the synthesized-wkv_b path still pays the
  wk_b_pp allocation under mla=1, but the bigger compute-buffer
  saving (no pp_opt branch at runtime) still applies.
- build_deepseek2 outer pp_opt: include cparams.mla_attn > 1 in the
  pp_opt definition itself, so mla=1 is bypassed throughout (TP and
  non-TP attention paths).
- build_deepseek2 tp pp_opt: require wk_b_pp present. Drop the dead
  runtime wk_b transpose fallback (unreachable now that wk_b_pp is
  guaranteed when tp_pp_opt fires).
- llama_kv_cache_init: have_wkv_b probe now treats wk_b_pp (attn_kv_b)
  as equivalent to wkv_b for the purposes of allowing mla>1 to stay
  put. Without this, -sm graph models that have wk_b/wv_b separately
  in the GGUF (no combined wkv_b) would silently downgrade to mla=1.
- Rename the synthesized tensor "attn_k_b_pp.weight" -> "attn_kv_b.weight"
  to match the mainline naming ik uses.

GLM-5.1 in particular benefits: its mla=3 PP improvement over mla=1 is
negligible on this arch (~0.4% in our sweeps), so users save the
runtime cost by sticking to mla=1.
2026-05-20 17:03:05 +03:00
Kawrakow
40254a51da
Fix MTP when -no-gr is used (#1848) 2026-05-20 13:38:33 +03:00
Kawrakow
eb597df91f Upate AUTHORS 2026-05-20 06:38:39 +00:00
Kawrakow
290935be79
Remove Makefile (#1847) 2026-05-20 09:14:28 +03:00
Kawrakow
6bb3ee3a32
Enable split mode graph for MLA models and partial offload (#1835) 2026-05-20 07:13:55 +03:00
firecoperana
9ae0fb7b2f
Remove reasoning budget logs (#1846)
Co-authored-by: firecoperana <firecoperana>
2026-05-20 07:12:02 +03:00
Samuel Oliveira Alves
77413bc900
Add Hadamard parameters to draft model loading (#1840) 2026-05-19 18:30:41 +03:00