David Young c07a052315
MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) (#1821)
* MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4)

Extends -sm graph (split-mode graph) to MLA-style attention across the
DEEPSEEK2, GLM_DSA, and MISTRAL4 architectures. Previously these archs
fell back to -sm layer regardless of the user's flag.

Implementation:
- Per-rank attention build in build_deepseek2_tp_attention with
  view-sliced FlashAttention, split-buffer output projection, and
  ggml_reduce across devices
- wk_b / wv_b absorbed weights replicated per device via materialize()
  in llm_prepare_mla (these can't live in a split buffer)
- KV cache replication path (replicated_k_l) for graph-mode TP
- distribute_mla_tensors_for_split_mode_graph routes attention/norm
  tensors into ctx_split; expert tensors stay per-layer
- Implements ggml_backend_cuda_split_buffer_get_tensor for the
  replicated / row-split / col-split inverse paths
- Early-reject guard in src/llama.cpp that auto-downgrades -sm graph
  to -sm layer (with a warning) when incompatible loader flags are set:
  -ncmoe, -cmoe, -ot, -rtr, -muge

New CLI flag:
- -gap | --graph-attn-precision <f16|f32>  (default f16)

See the PR description for the full validation matrix (3 archs x 2/4/8
GPU counts), perf numbers, VRAM accounting, and known limitations.

* Some tweaks

* materialize lambda: per-head split for graph-mode tp_replicate

7dd19e19 changed wk_b/wv_b distribution from mirror to per-head split
(split_dim=2) via prepare_split_tensors. That path only fires when
wk_b/wv_b are loaded from GGUF.

Models that store only wkv_b in GGUF derive wk_b/wv_b at load via
llm_prepare_mla, going through the materialize lambda, which was
untouched and still produced mirror replicas (split_dim=-1, full n_head
per device).

build_deepseek2_tp_attention now does mul_mat(wk_b_local, q_nope_perm)
without the prior view_3d slice, so a mirror replica passes an n_head
tensor where the kernel expects n_head_local. Result: silent SIGSEGV
right after model load.

Mirror logic in materialize is replaced with the same per-head split as
prepare_split_tensors: head_offsets derived from wo split, each rank
gets a tensor with ne[2]=n_head_local, data copied from the appropriate
source byte slice. Singular `computed` tensor keeps full metadata for
tensors_by_name lookups.

Tested: 8x3090, -sm graph -mla 3 -fa on now boots cleanly and
sweep-benches without crash. Log confirms new path: "Computed
blk.X.attn_k_b.weight ... split across N devices on dim=2".

* cleanup: indent fix + remove dead view_3d slicing and debug printf

- build_deepseek2.cpp: re-indent the self_attention block in
  build_deepseek2_layer_attention (lines 253-670). Block was at column 0
  inside a function body; now at the expected 4/8-space indent.
- build_deepseek2.cpp: drop the commented-out view_3d slicing and debug
  printfs left over after 7dd19e19's switch to direct mul_mat on
  per-rank wk_b_local / wv_b_local. Update the stale 'wk_b is
  replicated (split_dim=-1)' comment to match the new split_dim=2
  reality.
- ggml-cuda.cu: remove the leftover debug printf in
  ggml_backend_cuda_split_buffer_get_tensor.

No behavior change. Verified with a clean rebuild and DSV2.5 +
GLM-4.7-Flash sweep-bench runs.

* llm_load_tensors: gate incompatible-flag warning to MLA archs

The -ncmoe / -rtr / -muge / -ot warning under -sm graph currently fires
for all archs that support graph mode. That's an over-reach: the
incompatibility is specific to the MLA TP paths (DEEPSEEK2, GLM_DSA,
MISTRAL4) — Gemma4 graph mode existed pre-PR and works with those flags.
Gate the warning to MLA archs only.

Also refreshes two stale comments left over from the wk_b/wv_b
mirror -> per-head-split rewrite:
- src/llama.cpp llm_prepare_mla: "Replicate wk_b/wv_b ..." now reads
  "Per-head split wk_b/wv_b ..." to match what the materialize lambda
  actually does post-823a39e2.
- src/llama-load-tensors.cpp distribute_mla_tensors_for_split_mode_graph:
  drop the wkv_b row-split mention (wkv_b is no longer created under
  graph mode after 7dd19e19) and correct the wk_b/wv_b distribution
  description (per-head split, not per-device replicated).

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-05-19 08:36:17 +03:00
..