403 Commits

Author SHA1 Message Date
Kawrakow
3cf0f5468f Also these 2026-06-19 15:24:24 +00:00
gapeleon
4f220159b8
Fix (Gemma-4 Vision): Correct KQ mask fill for causal models in non-causal flash-attn mode (#1985)
When llama_set_causal_attn(false) is called on a causal model (e.g.
Gemma-4 during vision image decode), llama_set_inputs took the non-causal
else-branch (designed for pure embedding models).

That path wrote the F16 mask with stride n_tokens instead of n_kv, and iterated batch
indices rather than KV cache cells.

The result was that every image query row beyond the first was
written at the wrong offset, leaving stale -inf values from
previous decodes visible to the GPU kernel. Any conversation
that had built up prior KV mask data would produce all-inf attention scores
for most image tokens, collapsing softmax to NaN and aborting at sampling.

Resolves #1984
2026-06-17 16:52:45 +02:00
Kawrakow
f9078e169b
Merge pull request #1970 from SamuelOliveirads/feat/dflash-implementation
Add DFlash support
2026-06-16 15:07:55 +02:00
SamuelOliveirads
6cae8c7ba2 clean logs 2026-06-14 21:07:57 -03:00
SamuelOliveirads
0d75eee35a remove duplicated code and unnecesary refactor 2026-06-14 16:02:02 -03:00
Jun Yamog
0df00b3b94 Add preliminary MiniMax-M3 support 2026-06-14 12:23:20 +00:00
BECCA-Labs
053202f97a fix: initialize rpc_device endpoint and device index before parsing 2026-06-13 16:13:44 -05:00
SamuelOliveirads
3a1d46c4d1 Merge remote-tracking branch 'origin/main' into feat/dflash-implementation
# Conflicts:
#	common/common.cpp
#	common/speculative.cpp
#	convert_hf_to_gguf.py
#	examples/server/server-context.cpp
#	examples/server/server-context.h
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama.cpp
2026-06-13 17:27:52 -03:00
Farmadupe
d1339249d7
Cleanup: Unify location of m-rope repacking for token and embd (#1924)
* unify location of rope-position-array rewriting prior to ubatching

* Reorder terms.
2026-06-12 08:27:50 +02:00
Joel Farthing
4a1e2eaa69
model: add Cohere2-MoE North Mini Code support (#1945)
* Add Cohere2 MoE North Mini Code support

* Fix Cohere2 MoE expert tensor emission

* Enhance Cohere2-MoE support by modifying tensor handling and configuration logic

* Fix Cohere2-MoE graph split reduce handling

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-10 15:28:27 +02:00
Kawrakow
366e478cb6
Bug fixes (#1940)
* Bug fixes

* More
2026-06-10 07:45:49 +02:00
Kawrakow
2768b62515
Split mode graph for Laguna (#1939) 2026-06-09 10:13:30 +02:00
Kawrakow
11c3546235
Support for alternative Gemma4 assistant (#1937) 2026-06-09 09:30:12 +02:00
Joel Farthing
bbe1a511ee
model: add Poolside Laguna XS.2 support (#1911)
* llama: register Laguna architecture

* llama: add Laguna graph support

* llama: place Laguna MoE tensors for cpu-moe

* gguf: add Laguna metadata and tokenizer ids

* convert: support Poolside Laguna XS.2

* model: align Laguna RoPE and graph semantics

* model: align Laguna partial offload with review feedback

* model: localize Laguna SWA YaRN defaults

* model: localize Laguna SWA RoPE constants

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-08 18:33:12 +02:00
Farmadupe
6b9de3dbaa
Fix mrope application across chunk boundaries (Fixes #993 and #1902 -- part 2) (#1918)
* (qwen3vl) Correct calculation for injection point of deepstack image embeddings

INjection point for deepstack embeddings used Hyperparameter n_embd_inp(), which caused the hidden state to be double accounted for, causing an OOB array access. The correct accessor is n_embd()

* Fix m-rope when pipeline parallelism is enabled
2026-06-05 17:10:02 +02:00
SamuelOliveirads
08e4590dcb implement gpu argmax 2026-06-04 20:45:12 -03:00
Kawrakow
4406e637b5
Split mode graph for Mellum (#1920) 2026-06-04 15:20:41 +02:00
Joel Farthing
dc51c6f9b2
Add Mellum2 architecture support (#1919)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-04 14:28:02 +02:00
SamuelOliveirads
dc43cdf06b move dflash for it own file 2026-06-02 10:22:13 -03:00
SamuelOliveirads
3d73312d9d apply workspace support for KV cache 2026-06-01 09:55:34 -03:00
SamuelOliveirads
ed403dca27 Use windows update in kv cache 2026-05-31 14:51:21 -03:00
SamuelOliveirads
1369e68471 fix graph mask, swa layers and tokens positions 2026-05-31 11:12:03 -03:00
SamuelOliveirads
532499836e improve DFlash caching and profiling capabilities 2026-05-30 21:36:10 -03:00
SamuelOliveirads
9f5f70cf7e implement target position tracking and context management 2026-05-29 23:11:38 -03:00
Kawrakow
8960c5ba5e
Add extra nodes when dealing with MLA and amb (#1899) 2026-05-29 15:17:24 +03:00
SamuelOliveirads
82cff238fe Initial dflash implementation 2026-05-28 18:57:58 -03:00
Kawrakow
6eff055a0c
GLM-5 MTP (again) (#1890)
* wip: port MTP architecture

Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.

Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.

* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).

* core: enable hybrid outputs (logits + embeddings) for MTP support

* fix(mtp): correct KV-cache slot finding for updates

* fix(mtp): persist hidden states to prevent context corruption during drafting

* refactor(mtp): clean unused code

* fix(mtp): update server to new functions name

* fix(mtp): fix graph and save hidden state

* mtp: refactor integration, context params and kv cache search

* mtp: fix hidden state extraction and speculative acceptance flow

* server: fix MTP warmup for long prompts and reset token buffer

* llama: refactor MTP operation state to context parameters

* server: fix n_past calculation in MTP acceptance

* llama: fix mtp enable flags

* speculative: refactor MTP to use common_speculative interface

* context: remove unused signatures

* clip: fix deprecated enum-enum conversion warning

* common: fix format string crash in help message

* context: fix mtp activation logic

* llamat: always use the extracted embedding

* llama: get all embeddings to kv cache

* llama: revert logit to not run mtp for not supported arch

* llama: allocate all the n_outputs for MTP

* wip

* server-context: get only the last embedding for hidden state

* ggml-backend: fix array of bounds in debug build

* server-context: run mt kv update to each prompt batch

* revert segmentation fault fixes

* glm-mtp(feat): optimize graph embedding and recursive drafting

* glm5-mtp(feat): add glm 5 mtp logic

* glm-mtp: standardize the MTP graph

* glm 5 mtp: apply post-layer cvec

* glm 5 mtp: mark head as mandatory

* get normed embeddings for glm 5

* Fix GLM5 MTP

* GLM5 MTP: just reuse the layer attention implementation

* Make MTP work with split mode graph

---------

Co-authored-by: samuel <samueloliveira32df@gmail.com>
2026-05-28 18:14:12 +03:00
Kawrakow
3bf7e836c2
Allow Hadamard transform for head sizes that are not power of 2 (#1883)
* Disable K Hadamard transform if K-head size is not a power of 2

* Allow Hadamard transform for head sizes that are not power of 2

* Give more details why Hadamard is not possible

* Arghh
2026-05-27 18:29:32 +03:00
Kawrakow
d2da6da05c
Fix cache loading/saving for MLA models and split mode graph (#1884) 2026-05-26 17:07:40 +03:00
Kawrakow
b4e1d916c5
Per GPU fit margin (#1872) 2026-05-25 08:16:45 +03:00
Kawrakow
0c45696db4
Minor logging cleanup (#1873) 2026-05-24 07:29:32 +03:00
Kawrakow
809a63bbb7
Fix MLA models with ngl < n_layer (#1870)
* Fix split mode graph with ngl < n_layer (MLA models)

* It is actually not related to split mode graph
2026-05-24 07:29:17 +03:00
Kawrakow
a6bb509305
Fix split mode graph with ngl < n_layer (#1869) 2026-05-23 12:58:09 +03:00
Kawrakow
3f45ba9387
MTP tweaks 3 (#1862) 2026-05-23 07:23:20 +03:00
Samuel Oliveira Alves
19e09e81d4
Change MTP graph input preparation with additional parameters and validation checks (#1866) 2026-05-23 07:22:04 +03:00
Kawrakow
b3d39cff8b
Fix split mode graph for Qwen35-MoE + MTP (#1861) 2026-05-22 09:23:53 +03:00
thad0ctor
b26521b9ef
Fix raw-vs-local device id confusion under -dev/-devd subsets (#1826)
llm_load_tensors stores `default_layer_device[i]` as a local index into
`model.devices` (consistent with `device_mem[]`, `model.splits[]`, and
all graph-building consumers), but the four
`llama_default_buffer_type_offload(model, default_layer_device[i])`
callsites passed it through as if it were a raw post-CVD device id.
Under `-dev`/`-devd` subsets where `model.devices != {0..N-1}`, this
selected the wrong buffer type. Wrap with `model.devices[...]` to match
the existing `model.devices[main_gpu]` pattern on the adjacent lines.

llama_init_from_model has the same bug for `main_gpu`: every consumer
(auto-fit override at line 3428, MTP clamp, the `model.devices[main_gpu]`
translations at lines 3678/3682, and graph-building `splits[main_gpu]`)
treats it as a local index, but the five single-GPU backend init paths
(CUDA, Vulkan, SYCL, Kompute, CANN) pass `model->main_gpu` straight to
the backend init, which expects a raw device id. e.g. `-dev CUDA1` with
default `--main-gpu 0` and `split_mode=NONE` called
`ggml_backend_cuda_init(0)` instead of `cuda_init(1)`. Compute
`main_gpu_id` once and use it for all five paths.
2026-05-22 08:32:52 +03:00
Kawrakow
48a55f74e4
Disable split mode graph for Qwen35-MoE when MTP is enabled (#1858) 2026-05-21 16:29:35 +03:00
Samuel Oliveira Alves
7b73f45541
Add adaptive sampling clone and free functions to manage memory (#1851) 2026-05-21 08:11:17 +03:00
David Young
aefb8bdd99
MLA TP -khad: ggml_dequant_hadamard fused op + wv_b/wk_b_pp Hadamard fold (#1852)
* ggml: ggml_dequant_hadamard fused op for MLA -khad path

Adds a new ggml op that fuses (ggml_cast -> F32) + (ggml_hadamard) into a
single kernel. Reads a quantized (or F16/F32) source and produces a per-
Hadamard-block F32 chunk with the inverse transform applied, without
materializing a full-size F32 intermediate buffer.

Motivation: the MLA pp_opt path in build_deepseek2.cpp un-encodes the
H-applied cache_nope view at every PP call. Today that runs as a cast
(quant -> F32) followed by a separate ggml_hadamard kernel, costing two
full-size F32 passes per layer per rank per call. Fusing them halves
the bandwidth on the un-encode and removes one kernel launch.

CUDA kernels in dequant_hadamard.cu lift the Walsh-Hadamard butterfly
from hadamard.cu and dequant helpers from dequantize.cuh:

  * qr=1 layout (q8_0): consecutive dequant pair, stage 1 fused with load
  * qr=2 layout (q4_0 / q4_1 / q5_0 / q5_1 / q6_0 / iq4_nl): dequant pair
    at stride qk/2, explicit stage 1 after sync
  * F16 has a dedicated kernel
  * F32 source falls back to the standalone Hadamard op

CPU impl in iqk_cpu_ops.cpp composes the existing type_traits.to_float
dequant with fast_ht for graph completeness. nh in {64, 128, 256, 512}.

* MLA-TP: Hadamard pretransform of wv_b/wk_b_pp for -khad

Fold the 64-block orthonormal Hadamard into wv_b and wk_b_pp once at
context init so the pp_opt mul_mats consume the K cache in its on-disk
encoded basis. The per-PP-call cache_nope un-Hadamard is then skipped
(rope half still un-applied — it goes to FA via concat, no wk_b multiply).

Math is identity by H^T H = I: mul_mat(H@wv_b, H@cache) = wv_b^T @ cache.
For mla=2/3 absorb, composes correctly with the existing post-FA
ggml_hadamard(kqv_compressed, 64).

All-or-nothing across layers under a castable type-allowlist (excludes
1-3 bpw IQ types whose requant blows up beyond PPL noise). Models with
ineligible weights fall back to the runtime un-Hadamard path unchanged.

Composes with the fused ggml_dequant_hadamard op (prior commit): with the
fold active only the rope half still runs the runtime transform, via the
fused kernel.

* MLA-TP: fix TG with -khad after wv_b/wk_b_pp fold

The absorb branch of build_deepseek2_tp_attention applies
ggml_hadamard to kqv_compressed after FA, then multiplies by
wv_b. Pre-fold this was needed because wv_b was un-encoded; with
the wv_b fold (prior commit) the mul_mat already expects
H-encoded kqv_compressed:

  mul_mat(H @ wv_b, kqv_encoded) = wv_b^T @ H @ H @ kqv_unencoded
                                 = wv_b^T @ kqv_unencoded   (H @ H = I)

Skip the post-FA hadamard when model.khad_pretransformed is set
so the two H applications cancel instead of double-applying.

Affects the absorb branch: TG (n_tokens=1), short-context PP
(n_kv < 1024), and models without wk_b_pp. Long-context PP goes
through the pp_opt branch and is unrelated/unchanged.

Reported by @ikawrakow on PR 1852. Verified across mla={1,2,3} x
khad={on,off} x -ctk={q8_0,q4_0} on GLM-4.7-Flash IQ5_K and the
unsloth IQ4_XS variant ik used to reproduce.

* ggml_hadamard: accept F16 and quant sources; drop GGML_OP_DEQUANT_HADAMARD

Per @ikawrakow review on PR 1852: subsume the per-source-type dispatch
into the existing GGML_OP_HADAMARD instead of carrying a separate enum
entry, op constructor, and standalone files.

ggml_hadamard's API is unchanged from the call-site perspective. The
constructor's F32-only assertion is dropped; ggml_cuda_op_hadamard and
iqk_hadamard now dispatch internally:

  - F32 source: existing F32 butterfly (unchanged)
  - F16 source: dedicated kernel
  - q8_0 / q4_0 / q4_1 / q5_0 / q5_1 / q6_0 / iq4_nl: fused dequant +
    butterfly kernel (lifted from the deleted dequant_hadamard.cu)
  - CPU side composes traits.to_float with fast_ht

Net diff: -80 lines. Removes dequant_hadamard.{cu,cuh}, the enum entry,
op table rows, ggml_dequant_hadamard constructor, dispatch cases, and
the DEQUANT_HADAMARD supports_op block.

Verified clean build + TG smoke (mla=3 +khad q8 on GLM-4.7-Flash-IQ4_XS,
same coherent output as prior commit on feat/dequant-hadamard).
2026-05-21 07:29:15 +03:00
Samuel Oliveira Alves
11a1fea9e2
Move embedding management to speculative (#1825)
* refactor speculative decoding with companion context and draft result structures

* feat: add common speculative feature handling in server context

* refactor: move embedings outside server

* feat: harden draft input hidden state in llama context

* remove unused functions

* refactor: streamline speculative feature handling and remove unused code

* remove redundant code

* remove more unused variables

* refactor: implement speculative feature handling
2026-05-20 17:42:48 +03:00
David Young
dd67a9fb24
MLA TP prompt processing optimisation (#1841)
* MLA TP prompt processing optimisation

Adds a per-rank prompt-processing path to build_deepseek2_tp_attention
that materialises K/V from the compressed latent cache and runs a
standard flash_attn instead of the FlashMLA-3 absorb kernel the TP
attention currently uses for all batch sizes. Affects MLA archs under
-sm graph (DEEPSEEK2, GLM_DSA, MISTRAL4).

Gated on n_tokens >= 128 (set by caller) AND n_kv >= 1024. Below
either threshold the absorb path runs unchanged. Token generation
takes the absorb path; only prompt processing at non-trivial context
materialises.

A second piece pre-computes wk_b in a pp_opt-favouring orientation
(wk_b_pp: [kv_lora_rank, qk_nope, n_head]) at llm_prepare_mla time,
so the per-PP-call materialise can mul_mat against the latent cache
directly without an F16 cast + permute + ggml_cont on wk_b each call.
Path A (wkv_b in GGUF) and Path B (only wk_b/wv_b in GGUF) both
populate wk_b_pp through the standard per-rank replica setup.

Measured on 8x RTX 3090, -sm graph -mla 2 -fa on:

  DSV2.5 IQ2_XS         c=8k  ub=2048   PP +51% to +60%
  GLM-4.7-Flash IQ4_XS  c=32k ub=2048   PP -6% (PP@0) to +77% (PP@30720)
  GLM-5.1 IQ1_S q4_0    c=16k ub=2048   PP +5% to +9%

PPL parity within +/-0.2 noise (DSV2.5 bit-identical 5.3917, GLM-4.7
8.83 vs 8.96, GLM-5.1 6.96 vs 7.00). Token-generation throughput
unchanged within noise.

Compute buffer at init:
  DSV2.5         -54 MiB total       (allocator noise)
  GLM-4.7-Flash  +1042 MiB total     (~+173 MiB per non-output device)
  GLM-5.1        0                   (MoE intermediates dominate)

* MLA TP: respect mla=1 vs mla=3 distinction, rename attn_k_b_pp -> attn_kv_b

ikawrakow/ik_llama.cpp#1841 review feedback: the pp_opt path lost the
intended trade-off where mla=1 forgoes pp_opt to save VRAM and mla=3 pays
the wk_b_pp tensor cost for faster long-context PP.

- llm_prepare_mla second pass: gate wk_b_pp synthesis on mla > 1.
  Models that ship wk_b in their GGUF (mainline format) no longer
  allocate the pp_opt-favoring K weight under mla=1.
- llm_prepare_mla first pass (wk_b synthesis from wkv_b): keep
  unconditional under -sm graph. The wk_b_pp materialization here
  shares the wk_b_f32 intermediate with the wk_b synthesis above, and
  isolating just the wk_b_pp branch leaves the synthesized wk_b in a
  state that makes the absorb path produce inf on some quant combos
  (DSV2.5 IQ2_XS). Trade: the synthesized-wkv_b path still pays the
  wk_b_pp allocation under mla=1, but the bigger compute-buffer
  saving (no pp_opt branch at runtime) still applies.
- build_deepseek2 outer pp_opt: include cparams.mla_attn > 1 in the
  pp_opt definition itself, so mla=1 is bypassed throughout (TP and
  non-TP attention paths).
- build_deepseek2 tp pp_opt: require wk_b_pp present. Drop the dead
  runtime wk_b transpose fallback (unreachable now that wk_b_pp is
  guaranteed when tp_pp_opt fires).
- llama_kv_cache_init: have_wkv_b probe now treats wk_b_pp (attn_kv_b)
  as equivalent to wkv_b for the purposes of allowing mla>1 to stay
  put. Without this, -sm graph models that have wk_b/wv_b separately
  in the GGUF (no combined wkv_b) would silently downgrade to mla=1.
- Rename the synthesized tensor "attn_k_b_pp.weight" -> "attn_kv_b.weight"
  to match the mainline naming ik uses.

GLM-5.1 in particular benefits: its mla=3 PP improvement over mla=1 is
negligible on this arch (~0.4% in our sweeps), so users save the
runtime cost by sticking to mla=1.
2026-05-20 17:03:05 +03:00
Kawrakow
6bb3ee3a32
Enable split mode graph for MLA models and partial offload (#1835) 2026-05-20 07:13:55 +03:00
Kawrakow
997c587a6c
Fix #1837 (#1838) 2026-05-19 17:56:21 +03:00
David Young
c07a052315
MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) (#1821)
* MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4)

Extends -sm graph (split-mode graph) to MLA-style attention across the
DEEPSEEK2, GLM_DSA, and MISTRAL4 architectures. Previously these archs
fell back to -sm layer regardless of the user's flag.

Implementation:
- Per-rank attention build in build_deepseek2_tp_attention with
  view-sliced FlashAttention, split-buffer output projection, and
  ggml_reduce across devices
- wk_b / wv_b absorbed weights replicated per device via materialize()
  in llm_prepare_mla (these can't live in a split buffer)
- KV cache replication path (replicated_k_l) for graph-mode TP
- distribute_mla_tensors_for_split_mode_graph routes attention/norm
  tensors into ctx_split; expert tensors stay per-layer
- Implements ggml_backend_cuda_split_buffer_get_tensor for the
  replicated / row-split / col-split inverse paths
- Early-reject guard in src/llama.cpp that auto-downgrades -sm graph
  to -sm layer (with a warning) when incompatible loader flags are set:
  -ncmoe, -cmoe, -ot, -rtr, -muge

New CLI flag:
- -gap | --graph-attn-precision <f16|f32>  (default f16)

See the PR description for the full validation matrix (3 archs x 2/4/8
GPU counts), perf numbers, VRAM accounting, and known limitations.

* Some tweaks

* materialize lambda: per-head split for graph-mode tp_replicate

7dd19e19 changed wk_b/wv_b distribution from mirror to per-head split
(split_dim=2) via prepare_split_tensors. That path only fires when
wk_b/wv_b are loaded from GGUF.

Models that store only wkv_b in GGUF derive wk_b/wv_b at load via
llm_prepare_mla, going through the materialize lambda, which was
untouched and still produced mirror replicas (split_dim=-1, full n_head
per device).

build_deepseek2_tp_attention now does mul_mat(wk_b_local, q_nope_perm)
without the prior view_3d slice, so a mirror replica passes an n_head
tensor where the kernel expects n_head_local. Result: silent SIGSEGV
right after model load.

Mirror logic in materialize is replaced with the same per-head split as
prepare_split_tensors: head_offsets derived from wo split, each rank
gets a tensor with ne[2]=n_head_local, data copied from the appropriate
source byte slice. Singular `computed` tensor keeps full metadata for
tensors_by_name lookups.

Tested: 8x3090, -sm graph -mla 3 -fa on now boots cleanly and
sweep-benches without crash. Log confirms new path: "Computed
blk.X.attn_k_b.weight ... split across N devices on dim=2".

* cleanup: indent fix + remove dead view_3d slicing and debug printf

- build_deepseek2.cpp: re-indent the self_attention block in
  build_deepseek2_layer_attention (lines 253-670). Block was at column 0
  inside a function body; now at the expected 4/8-space indent.
- build_deepseek2.cpp: drop the commented-out view_3d slicing and debug
  printfs left over after 7dd19e19's switch to direct mul_mat on
  per-rank wk_b_local / wv_b_local. Update the stale 'wk_b is
  replicated (split_dim=-1)' comment to match the new split_dim=2
  reality.
- ggml-cuda.cu: remove the leftover debug printf in
  ggml_backend_cuda_split_buffer_get_tensor.

No behavior change. Verified with a clean rebuild and DSV2.5 +
GLM-4.7-Flash sweep-bench runs.

* llm_load_tensors: gate incompatible-flag warning to MLA archs

The -ncmoe / -rtr / -muge / -ot warning under -sm graph currently fires
for all archs that support graph mode. That's an over-reach: the
incompatibility is specific to the MLA TP paths (DEEPSEEK2, GLM_DSA,
MISTRAL4) — Gemma4 graph mode existed pre-PR and works with those flags.
Gate the warning to MLA archs only.

Also refreshes two stale comments left over from the wk_b/wv_b
mirror -> per-head-split rewrite:
- src/llama.cpp llm_prepare_mla: "Replicate wk_b/wv_b ..." now reads
  "Per-head split wk_b/wv_b ..." to match what the materialize lambda
  actually does post-823a39e2.
- src/llama-load-tensors.cpp distribute_mla_tensors_for_split_mode_graph:
  drop the wkv_b row-split mention (wkv_b is no longer created under
  graph mode after 7dd19e19) and correct the wk_b/wv_b distribution
  description (per-head split, not per-device replicated).

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-05-19 08:36:17 +03:00
Kawrakow
a407b9ca3d
Fix Qwen3.6-MoE low MTP acceptance rate (#1815)
* Fix Qwen3.6-MoE low MTP acceptance rate

* Fix Gemma4 MTP
2026-05-18 07:26:17 +03:00
Kawrakow
0ab9bdf793
Fix Qwen3.5/3.6 MTP and -muge (#1816) 2026-05-17 17:14:47 +03:00
Kawrakow
1f8c603d9c
Quantize: add extra output tensor for MTP (#1810)
* Quantize: add extra output tensor for MTP

* Consistently use --mtp-requantize-output-tensor
2026-05-17 13:59:56 +03:00
Kawrakow
3e573cfea6
MTP: option to use re-quantized output tensor for better TG performance (#1809)
* Option to use re-quantized output tensor for MTP

* Remove quantize extra output option

* Handle interleaved types
2026-05-16 14:40:18 +03:00
Samuel Oliveira Alves
0fcffdb64d
feat: map Gemma 4 tensor and support with imatrix (#1796) 2026-05-14 09:01:24 +03:00