mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-06-28 04:30:15 -05:00
Extends -sm graph (split-mode graph) to MLA-style attention across the DEEPSEEK2, GLM_DSA, and MISTRAL4 architectures. Previously these archs fell back to -sm layer regardless of the user's flag. Implementation: - Per-rank attention build in build_deepseek2_tp_attention with view-sliced FlashAttention, split-buffer output projection, and ggml_reduce across devices - wk_b / wv_b absorbed weights replicated per device via materialize() in llm_prepare_mla (these can't live in a split buffer) - KV cache replication path (replicated_k_l) for graph-mode TP - distribute_mla_tensors_for_split_mode_graph routes attention/norm tensors into ctx_split; expert tensors stay per-layer - Implements ggml_backend_cuda_split_buffer_get_tensor for the replicated / row-split / col-split inverse paths - Early-reject guard in src/llama.cpp that auto-downgrades -sm graph to -sm layer (with a warning) when incompatible loader flags are set: -ncmoe, -cmoe, -ot, -rtr, -muge New CLI flag: - -gap | --graph-attn-precision <f16|f32> (default f16) See the PR description for the full validation matrix (3 archs x 2/4/8 GPU counts), perf numbers, VRAM accounting, and known limitations.