MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) (#1821)

* MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) Extends -sm graph (split-mode graph) to MLA-style attention across the DEEPSEEK2, GLM_DSA, and MISTRAL4 architectures. Previously these archs fell back to -sm layer regardless of the user's flag. Implementation: - Per-rank attention build in build_deepseek2_tp_attention with view-sliced FlashAttention, split-buffer output projection, and ggml_reduce across devices - wk_b / wv_b absorbed weights replicated per device via materialize() in llm_prepare_mla (these can't live in a split buffer) - KV cache replication path (replicated_k_l) for graph-mode TP - distribute_mla_tensors_for_split_mode_graph routes attention/norm tensors into ctx_split; expert tensors stay per-layer - Implements ggml_backend_cuda_split_buffer_get_tensor for the replicated / row-split / col-split inverse paths - Early-reject guard in src/llama.cpp that auto-downgrades -sm graph to -sm layer (with a warning) when incompatible loader flags are set: -ncmoe, -cmoe, -ot, -rtr, -muge New CLI flag: - -gap | --graph-attn-precision <f16|f32> (default f16) See the PR description for the full validation matrix (3 archs x 2/4/8 GPU counts), perf numbers, VRAM accounting, and known limitations. * Some tweaks * materialize lambda: per-head split for graph-mode tp_replicate 7dd19e19 changed wk_b/wv_b distribution from mirror to per-head split (split_dim=2) via prepare_split_tensors. That path only fires when wk_b/wv_b are loaded from GGUF. Models that store only wkv_b in GGUF derive wk_b/wv_b at load via llm_prepare_mla, going through the materialize lambda, which was untouched and still produced mirror replicas (split_dim=-1, full n_head per device). build_deepseek2_tp_attention now does mul_mat(wk_b_local, q_nope_perm) without the prior view_3d slice, so a mirror replica passes an n_head tensor where the kernel expects n_head_local. Result: silent SIGSEGV right after model load. Mirror logic in materialize is replaced with the same per-head split as prepare_split_tensors: head_offsets derived from wo split, each rank gets a tensor with ne[2]=n_head_local, data copied from the appropriate source byte slice. Singular `computed` tensor keeps full metadata for tensors_by_name lookups. Tested: 8x3090, -sm graph -mla 3 -fa on now boots cleanly and sweep-benches without crash. Log confirms new path: "Computed blk.X.attn_k_b.weight ... split across N devices on dim=2". * cleanup: indent fix + remove dead view_3d slicing and debug printf - build_deepseek2.cpp: re-indent the self_attention block in build_deepseek2_layer_attention (lines 253-670). Block was at column 0 inside a function body; now at the expected 4/8-space indent. - build_deepseek2.cpp: drop the commented-out view_3d slicing and debug printfs left over after 7dd19e19's switch to direct mul_mat on per-rank wk_b_local / wv_b_local. Update the stale 'wk_b is replicated (split_dim=-1)' comment to match the new split_dim=2 reality. - ggml-cuda.cu: remove the leftover debug printf in ggml_backend_cuda_split_buffer_get_tensor. No behavior change. Verified with a clean rebuild and DSV2.5 + GLM-4.7-Flash sweep-bench runs. * llm_load_tensors: gate incompatible-flag warning to MLA archs The -ncmoe / -rtr / -muge / -ot warning under -sm graph currently fires for all archs that support graph mode. That's an over-reach: the incompatibility is specific to the MLA TP paths (DEEPSEEK2, GLM_DSA, MISTRAL4) — Gemma4 graph mode existed pre-PR and works with those flags. Gate the warning to MLA archs only. Also refreshes two stale comments left over from the wk_b/wv_b mirror -> per-head-split rewrite: - src/llama.cpp llm_prepare_mla: "Replicate wk_b/wv_b ..." now reads "Per-head split wk_b/wv_b ..." to match what the materialize lambda actually does post-823a39e2. - src/llama-load-tensors.cpp distribute_mla_tensors_for_split_mode_graph: drop the wkv_b row-split mention (wkv_b is no longer created under graph mode after 7dd19e19) and correct the wk_b/wv_b distribution description (per-head split, not per-device replicated). --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-06-28 04:30:15 -05:00 · 2026-05-19 06:36:17 +01:00 · 2026-05-19 06:36:17 +01:00 · c07a052315
commit c07a052315
parent 104846ddee
13 changed files with 1191 additions and 511 deletions
--- a/common/common.cpp
+++ b/common/common.cpp
@ -2074,6 +2074,11 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
        params.reduce_type = argv[i];
        return true;
    }
+    if (arg == "-gap" || arg == "--graph-attn-precision") {
+        CHECK_ARG
+        params.graph_attn_precision = argv[i];
+        return true;
+    }
    if (arg == "--numa") {
        CHECK_ARG
        std::string value(argv[i]);
@ -2885,6 +2890,7 @@ void gpt_params_print_usage(int /*argc*/, char ** argv, const gpt_params & param
    options.push_back({ "*",         "-smf16, --split-mode-f16,",       "Use f16 for data exchange between GPUs (default: %d)", true});
    options.push_back({ "*",         "-smf32, --split-mode-f32,",       "Use f32 for data exchange between GPUs (default: %d)", false});
    options.push_back({ "*",         "-grt, --graph-reduce-type",       "Type for data exchange between GPUs (default: %s)", "f32"});
+    options.push_back({ "*",         "-gap, --graph-attn-precision",    "Flash-attn precision under -sm graph (default: %s)", "f16"});
    options.push_back({ "*",         "-smgs, --split-mode-graph-scheduling,", "Force Split Mode Graph Scheduling (default: %d)", params.split_mode_graph_scheduling});
    options.push_back({ "*",         "-sas,  --scheduler_async,",       "Async evaluation of compute graphs: %d)", params.scheduler_async});
    options.push_back({ "*",         "-vq, --validate-quants",          "validate quantized data while loading the model (default: %d)", params.validate_quants});
@ -4077,6 +4083,7 @@ struct llama_context_params common_context_params_to_llama(const gpt_params & pa
    cparams.type_k = kv_cache_type_from_str(params.cache_type_k);
    cparams.type_v = kv_cache_type_from_str(params.cache_type_v);
    cparams.type_reduce = ggml_type_from_str(params.reduce_type);
+    cparams.type_graph_attn = ggml_type_from_str(params.graph_attn_precision);
    if (!cparams.flash_attn && ggml_is_quantized(cparams.type_v)) {
        throw std::runtime_error("Quantized V cache cannot be used without flash attention");
    }
--- a/common/common.h
+++ b/common/common.h
@ -433,6 +433,7 @@ struct gpt_params {
    std::string cache_type_v = "f16"; // KV cache data type for the V

    std::string reduce_type = "f16";
+    std::string graph_attn_precision = "f16";

    std::string type_k_first = "f16";
    std::string type_k_last  = "f16";
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@ -240,7 +240,7 @@
 // if you need to load more than 64 model shards.
 #define GGML_MAX_CONTEXTS       64
 #endif
-#define GGML_MAX_SRC            10
+#define GGML_MAX_SRC            12
 #ifndef GGML_MAX_NAME
 #define GGML_MAX_NAME           64
 #endif
--- a/ggml/src/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda.cu
@ -1127,6 +1127,18 @@ GGML_CALL static void ggml_backend_cuda_split_buffer_set_tensor([[maybe_unused]]
            }
        }
    }
+    else if (extra->split_dim == 2) {
+        size_t cur_offset = 0;
+        for (int i = 0; i < extra->n_device; ++i) {
+            auto split = extra->splits[i];
+            if (!split) continue;
+            ggml_cuda_set_device(i);
+            auto size = ggml_nbytes(split);
+            const char * buf_host = (const char *)data + cur_offset;
+            CUDA_CHECK(cudaMemcpyAsync(split->data, buf_host, size, cudaMemcpyHostToDevice, cudaStreamPerThread));
+            cur_offset += size;
+        }
+    }
    else {
        fprintf(stderr, "%s: not implemented for split dim %d\n", __func__, extra->split_dim == 0);
        GGML_ABORT("fatal error");
@ -1140,12 +1152,125 @@ GGML_CALL static void ggml_backend_cuda_split_buffer_set_tensor([[maybe_unused]]
 }

 GGML_CALL static void ggml_backend_cuda_split_buffer_get_tensor([[maybe_unused]] ggml_backend_buffer_t buffer, const ggml_tensor * tensor,
-        [[maybe_unused]] void * data, size_t offset, size_t size) {
-    // split tensors must always be set in their entirety at once
+        void * data, size_t offset, size_t size) {
+    // split tensors must always be read in their entirety at once
    GGML_ASSERT(offset == 0);
    GGML_ASSERT(size == ggml_nbytes(tensor));

-    GGML_ABORT("not implemented");
+    if (!tensor->extra) return;
+
+    // Inverse of split_buffer_set_tensor; refuses paths with no defined inverse.
+    auto extra = (ggml_split_tensor_t *)tensor->extra;
+    GGML_ASSERT(extra->n_device <= ggml_backend_cuda_get_device_count());
+
+    // Repacked types are block-de-interleaved by set_tensor; no runtime inverse.
+    {
+        const ggml_type t = tensor->type;
+        const bool is_repacked =
+            t == GGML_TYPE_Q4_0_R8  || t == GGML_TYPE_Q5_0_R4  || t == GGML_TYPE_Q8_0_R8  ||
+            t == GGML_TYPE_Q2_K_R4  || t == GGML_TYPE_Q3_K_R4  || t == GGML_TYPE_Q4_K_R4  ||
+            t == GGML_TYPE_Q5_K_R4  || t == GGML_TYPE_Q6_K_R4  || t == GGML_TYPE_IQ4_NL_R4 ||
+            t == GGML_TYPE_IQ4_XS_R8 || t == GGML_TYPE_Q6_0_R4;
+        if (is_repacked) {
+            GGML_ABORT("%s: get_tensor of repacked type %s is not invertible",
+                       __func__, ggml_type_name(t));
+        }
+    }
+
+    // Explicit-ranges form (non-contiguous expert assignments) is not invertible.
+    void * extra_ptr = nullptr;
+    memcpy(&extra_ptr, tensor->op_params, sizeof(extra_ptr));
+    if (extra_ptr) {
+        GGML_ABORT("%s: get_tensor with explicit ranges is not implemented", __func__);
+    }
+
+    if (extra->split_dim < 0) {
+        // Replicated: read from first present device.
+        GGML_ASSERT(ggml_is_contiguous(tensor));
+        for (int i = 0; i < extra->n_device; ++i) {
+            auto split = extra->splits[i];
+            if (!split) continue;
+            GGML_ASSERT(split->type == tensor->type);
+            ggml_cuda_set_device(i);
+            CUDA_CHECK(cudaMemcpyAsync(data, split->data, ggml_nbytes(tensor),
+                                       cudaMemcpyDeviceToHost, cudaStreamPerThread));
+            CUDA_CHECK(cudaStreamSynchronize(cudaStreamPerThread));
+            return;
+        }
+        GGML_ABORT("%s: no device holds a copy of the replicated tensor", __func__);
+    }
+    else if (extra->split_dim == 0) {
+        // Row-split (concat along ne[0]).
+        GGML_ASSERT(ggml_is_contiguous(tensor));
+        auto tt = ggml_internal_get_type_traits(tensor->type);
+        GGML_ASSERT(tt.row_meta_size == 0);
+        std::vector<char> host_buffer;
+        int64_t ne0_acc = 0;
+        for (int i = 0; i < extra->n_device; ++i) {
+            auto split = extra->splits[i];
+            if (!split) continue;
+            GGML_ASSERT(split->type == tensor->type);
+            GGML_ASSERT(split->ne[0] % tt.blck_size == 0);
+            const size_t split_row_size = ggml_row_size(split->type, split->ne[0]);
+            const size_t dev_bytes      = (size_t)ggml_nrows(split) * split_row_size;
+            if (host_buffer.size() < dev_bytes) host_buffer.resize(dev_bytes);
+            ggml_cuda_set_device(i);
+            CUDA_CHECK(cudaMemcpyAsync(host_buffer.data(), split->data, dev_bytes,
+                                       cudaMemcpyDeviceToHost, cudaStreamPerThread));
+            CUDA_CHECK(cudaStreamSynchronize(cudaStreamPerThread));
+            const size_t source_offset = (ne0_acc / tt.blck_size) * tt.type_size;
+            for (int64_t i02 = 0; i02 < split->ne[2]; ++i02) {
+                for (int64_t i01 = 0; i01 < split->ne[1]; ++i01) {
+                    const char * src = host_buffer.data() + (i02*split->ne[1] + i01) * split_row_size;
+                    char       * dst = (char *)data + i02*tensor->nb[2] + i01*tensor->nb[1] + source_offset;
+                    memcpy(dst, src, split_row_size);
+                }
+            }
+            ne0_acc += split->ne[0];
+        }
+    }
+    else if (extra->split_dim == 1) {
+        // Column/ne[1] split.
+        const size_t row_size = ggml_row_size(tensor->type, tensor->ne[0]);
+        if (tensor->ne[2] > 1) {
+            std::vector<char> host_buffer;
+            int64_t ne1_acc = 0;
+            for (int i = 0; i < extra->n_device; ++i) {
+                auto split = extra->splits[i];
+                if (!split) continue;
+                const size_t dev_bytes = ggml_nbytes(split);
+                if (host_buffer.size() < dev_bytes) host_buffer.resize(dev_bytes);
+                ggml_cuda_set_device(i);
+                CUDA_CHECK(cudaMemcpyAsync(host_buffer.data(), split->data, dev_bytes,
+                                           cudaMemcpyDeviceToHost, cudaStreamPerThread));
+                CUDA_CHECK(cudaStreamSynchronize(cudaStreamPerThread));
+                for (int64_t i02 = 0; i02 < split->ne[2]; ++i02) {
+                    const char * src = host_buffer.data() + i02 * split->ne[1] * row_size;
+                    char       * dst = (char *)data + i02*tensor->nb[2] + ne1_acc*tensor->nb[1];
+                    memcpy(dst, src, split->ne[1] * row_size);
+                }
+                ne1_acc += split->ne[1];
+            }
+        } else {
+            size_t cur_offset = 0;
+            for (int i = 0; i < extra->n_device; ++i) {
+                auto split = extra->splits[i];
+                if (!split) continue;
+                ggml_cuda_set_device(i);
+                const size_t dev_bytes = ggml_nbytes(split);
+                CUDA_CHECK(cudaMemcpyAsync((char *)data + cur_offset, split->data, dev_bytes,
+                                           cudaMemcpyDeviceToHost, cudaStreamPerThread));
+                cur_offset += dev_bytes;
+            }
+            for (int i = 0; i < extra->n_device; ++i) {
+                if (!extra->splits[i]) continue;
+                CUDA_CHECK(cudaStreamSynchronize(cudaStreamPerThread));
+            }
+        }
+    }
+    else {
+        GGML_ABORT("%s: not implemented for split_dim %d", __func__, extra->split_dim);
+    }
 }

 GGML_CALL static void ggml_backend_cuda_split_buffer_clear(ggml_backend_buffer_t buffer, uint8_t value) {
@ -1847,8 +1972,9 @@ static void ggml_cuda_op_mul_mat(
    }

    const int64_t src1_col_stride = ne11;
-    if (quantization_done && ne11 == 1 && ne12 > 1 && ne13 == 1 && ne02 == ne12 && ne02 == dst->ne[2]) {
-        //printf("invoking fast path for %s x %s\n", src0->name, src1->name);
+    // split-buffer src0 has data == NULL; per-device dispatch happens in the slow path below.
+    const bool src0_is_split = src0->buffer && ggml_backend_buft_is_cuda_split(src0->buffer->buft);
+    if (quantization_done && ne11 == 1 && ne12 > 1 && ne13 == 1 && ne02 == ne12 && ne02 == dst->ne[2] && !src0_is_split) {
        int id = ctx.device;
        char  *  src0_dd_i =  dev[id].src0_dd;
        float * src1_ddf_i = dev[id].src1_ddf;
@ -4452,6 +4578,16 @@ GGML_CALL static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t

 GGML_CALL static bool ggml_backend_cuda_supports_op(ggml_backend_t backend, const ggml_tensor * op) {
    ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *) backend->context;
+
+    // Non-mul_mat ops can't read a split-buffer parent (no data ptr); let the scheduler fall back to CPU.
+    if (op->op != GGML_OP_MUL_MAT && op->op != GGML_OP_MUL_MAT_ID) {
+        for (int i = 0; i < GGML_MAX_SRC; i++) {
+            if (op->src[i] && op->src[i]->buffer && ggml_backend_buft_is_cuda_split(op->src[i]->buffer->buft)) {
+                return false;
+            }
+        }
+    }
+
    switch (op->op) {
        case GGML_OP_UNARY:
            switch (ggml_get_unary_op(op)) {
--- a/ggml/src/ggml-cuda/fattn-new-mma.cu
+++ b/ggml/src/ggml-cuda/fattn-new-mma.cu
@ -2132,7 +2132,7 @@ static void ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1(ggml_backend_cuda_con
        //    return;
        //} else {
        if (Q->ne[1] <= 8/ncols2) {
-            if constexpr (DKQ == 512) {
+            if constexpr (DKQ == 512 || DKQ == 576) {
                ggml_cuda_flash_attn_ext_mma_f16_case<DKQ, DV, 2, ncols2>(ctx, dst);
            } else {
                ggml_cuda_flash_attn_ext_mma_f16_case<DKQ, DV, 8/ncols2, ncols2>(ctx, dst);
@ -2275,8 +2275,14 @@ void ggml_cuda_flash_attn_ext_mma_new(ggml_backend_cuda_context & ctx, ggml_tens
        }
        return;
    }
+    if (gqa_ratio % 12 == 0 && Q->ne[1] <= 4 && K->ne[1] >= 2048) {
+        ggml_cuda_flash_attn_ext_mma_f16_case<576, 512, 1, 16>(ctx, dst);
+        return;
+    }
    if (gqa_ratio % 16 == 0) {
        ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<576, 512, 16>(ctx, dst);
+    } else if (gqa_ratio % 8 == 0) {
+        ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<576, 512, 8>(ctx, dst);
    } else if (gqa_ratio % 4 == 0) {
        ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<576, 512, 4>(ctx, dst);
    } else {
--- a/include/llama.h
+++ b/include/llama.h
@ -464,6 +464,7 @@ extern "C" {
        enum ggml_type type_k; // data type for K cache [EXPERIMENTAL]
        enum ggml_type type_v; // data type for V cache [EXPERIMENTAL]
        enum ggml_type type_reduce; // data type for reduce operations
+        enum ggml_type type_graph_attn; // flash-attn precision under -sm graph
        enum ggml_type type_k_first;
        enum ggml_type type_k_last;
        enum ggml_type type_v_first;
--- a/src/graphs/build_deepseek2.cpp
+++ b/src/graphs/build_deepseek2.cpp
--- a/src/llama-build-context.h
+++ b/src/llama-build-context.h
@ -258,6 +258,25 @@ struct llm_build_context {

    ggml_cgraph * build_deepseek2();

+    ggml_tensor * build_deepseek2_tp_attention(
+            ggml_cgraph * gf, int il,
+            ggml_tensor * inpL,
+            ggml_tensor * KQ_mask, ggml_tensor * inp_pos,
+            ggml_tensor * rope_cache,
+            float kq_scale, float attn_factor_scaled,
+            bool use_f32_attn_precision,
+            bool is_lite);
+
+    ggml_tensor * build_deepseek2_layer_attention(
+            ggml_cgraph * gf, int il,
+            ggml_tensor * inpL,
+            ggml_tensor * KQ_mask, ggml_tensor * inp_pos,
+            ggml_tensor * rope_cache,
+            float kq_scale, float attn_factor_scaled,
+            bool use_f32_attn_precision,
+            bool is_lite,
+            bool pp_opt);
+
    ggml_cgraph * build_glm4_moe();

    ggml_cgraph * build_bitnet();
--- a/src/llama-context.h
+++ b/src/llama-context.h
@ -66,6 +66,9 @@ struct llama_kv_cache {
    std::vector<llama_split_tensor> split_v_l;
    std::vector<llama_split_tensor> split_s_l;

+    // Per-device replicas of the MLA compressed-latent KV cache (-sm graph for DEEPSEEK2/GLM_DSA/MISTRAL4).
+    std::vector<llama_split_tensor> replicated_k_l;
+
    std::vector<struct ggml_context *> ctxs;
    std::vector<ggml_backend_buffer_t> bufs;

--- a/src/llama-cparams.h
+++ b/src/llama-cparams.h
@ -50,6 +50,7 @@ struct llama_cparams {
    int  worst_graph_tokens;

    enum ggml_type reduce_type;
+    enum ggml_type graph_attn_precision;
    enum llama_pooling_type pooling_type;
    enum llama_mtp_op_type mtp_op_type;

--- a/src/llama-load-tensors.cpp
+++ b/src/llama-load-tensors.cpp
@ -2497,12 +2497,15 @@ bool create_tensors_helper::create_deepseek2_tensors(const LLM_TN & tn) {

        auto & layer = model.layers[i];

-        layer.attn_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd});
+        // Under -sm graph/attn, norms need per-device replicas via prepare_split_tensors(-1, ...).
+        auto norm_ctx = (model.split_mode == LLAMA_SPLIT_MODE_GRAPH ||
+                         model.split_mode == LLAMA_SPLIT_MODE_ATTN) ? ctx_split : ctx_layer;
+        layer.attn_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd});
        if (!is_lite) {
-            layer.attn_q_a_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_Q_A_NORM, "weight", i), {q_lora_rank});
+            layer.attn_q_a_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_Q_A_NORM, "weight", i), {q_lora_rank});
        }

-        layer.attn_kv_a_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_KV_A_NORM, "weight", i), {kv_lora_rank});
+        layer.attn_kv_a_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_KV_A_NORM, "weight", i), {kv_lora_rank});

        bool merged = false;
        if (ml.merge_qkv) {
@ -2541,7 +2544,10 @@ bool create_tensors_helper::create_deepseek2_tensors(const LLM_TN & tn) {
            layer.wkv_a_mqa = create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_KV_A_MQA, "weight", i),{n_embd, kv_lora_rank + (n_embd_head_qk_rope)});
        }

-        layer.wkv_b     = create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_KV_B,     "weight", i),
+        // Keep wkv_b on a single-device buffer; llm_prepare_mla reads it back to derive wk_b/wv_b.
+        auto wkv_b_ctx = (model.split_mode == LLAMA_SPLIT_MODE_GRAPH ||
+                          model.split_mode == LLAMA_SPLIT_MODE_ATTN) ? ctx_layer : ctx_split;
+        layer.wkv_b     = create_tensor(wkv_b_ctx, tn(LLM_TENSOR_ATTN_KV_B,     "weight", i),
                {kv_lora_rank, n_head * (n_embd_head_qk_nope + n_embd_head_v)}, llama_model_loader::TENSOR_NOT_REQUIRED);
        if (!layer.wkv_b) {
            // Incompatible mainline model. Let's see if we can still load it
@ -2554,15 +2560,18 @@ bool create_tensors_helper::create_deepseek2_tensors(const LLM_TN & tn) {
        }
        layer.wo        = create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_OUT,      "weight", i), {              n_head * (                      n_embd_head_v), n_embd});

-        layer.ffn_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd});
+        layer.ffn_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd});

        if (i < (int) hparams.n_layer_dense_lead) {
            layer.ffn_gate = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd,   n_ff});
            layer.ffn_down = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), {  n_ff, n_embd});
            layer.ffn_up   = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP,   "weight", i), {n_embd,   n_ff});
        } else {
-            layer.ffn_gate_inp = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert});
-            layer.ffn_exp_probs_b = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, 1);
+            // llm_build_std_moe_ffn needs per-device extras on these under -sm graph/attn.
+            const auto moe_ctx = (model.split_mode == LLAMA_SPLIT_MODE_GRAPH ||
+                                  model.split_mode == LLAMA_SPLIT_MODE_ATTN) ? ctx_split : ctx_layer;
+            layer.ffn_gate_inp = create_tensor(moe_ctx, tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert});
+            layer.ffn_exp_probs_b = create_tensor(moe_ctx, tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, 1);

            GGML_ASSERT(n_expert      > 0);
            GGML_ASSERT(n_expert_used > 0);
@ -2612,10 +2621,16 @@ bool create_tensors_helper::create_glm_dsa_tensors(const LLM_TN & tn) {

        auto & layer = model.layers[i];

-        layer.attn_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd}, flags);
-        layer.attn_q_a_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_Q_A_NORM, "weight", i), {q_lora_rank}, flags);
+        // Under -sm graph/attn, norms and MoE-gate tensors need per-device replicas in ctx_split.
+        const auto graph_or_attn = (model.split_mode == LLAMA_SPLIT_MODE_GRAPH ||
+                                    model.split_mode == LLAMA_SPLIT_MODE_ATTN);
+        auto norm_ctx = graph_or_attn ? ctx_split : ctx_layer;
+        auto moe_ctx  = graph_or_attn ? ctx_split : ctx_layer;

-        layer.attn_kv_a_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_KV_A_NORM, "weight", i), {kv_lora_rank}, flags);
+        layer.attn_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd}, flags);
+        layer.attn_q_a_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_Q_A_NORM, "weight", i), {q_lora_rank}, flags);
+
+        layer.attn_kv_a_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_KV_A_NORM, "weight", i), {kv_lora_rank}, flags);

        bool merged = false;
        if (ml.merge_qkv) {
@ -2658,15 +2673,15 @@ bool create_tensors_helper::create_glm_dsa_tensors(const LLM_TN & tn) {
        layer.indexer_attn_k   = create_tensor(ctx_split, tn(LLM_TENSOR_INDEXER_ATTN_K,   "weight", i), {n_embd, hparams.indexer_head_size}, flags);
        layer.indexer_attn_q_b = create_tensor(ctx_split, tn(LLM_TENSOR_INDEXER_ATTN_Q_B, "weight", i), {q_lora_rank, hparams.indexer_n_head * hparams.indexer_head_size}, flags);

-        layer.ffn_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd}, flags);
+        layer.ffn_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd}, flags);

        if (i < (int) hparams.n_layer_dense_lead) {
            layer.ffn_gate = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd,   n_ff}, flags);
            layer.ffn_down = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), {  n_ff, n_embd}, flags);
            layer.ffn_up   = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP,   "weight", i), {n_embd,   n_ff}, flags);
        } else {
-            layer.ffn_gate_inp = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert}, flags);
-            layer.ffn_exp_probs_b = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, flags);
+            layer.ffn_gate_inp = create_tensor(moe_ctx, tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert}, flags);
+            layer.ffn_exp_probs_b = create_tensor(moe_ctx, tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, flags);

            GGML_ASSERT(n_expert      > 0);
            GGML_ASSERT(n_expert_used > 0);
@ -3719,7 +3734,7 @@ bool create_tensors_helper::merge_qkv(const LLM_TN & tn, int i, int bias, bool i

 static void prepare_split_tensors(int split_dim, ggml_context * ctx, ggml_tensor * tensor, llama_split_tensor & split_tensor,
        const std::vector<int> & splits, std::vector<size_t> & mem_used) {
-    GGML_ASSERT(split_dim <= 1);
+    GGML_ASSERT(split_dim <= 2);
    GGML_ASSERT(splits.size() > 1);
    std::string name{tensor->name};
    split_tensor.tensor_splits.resize(splits.size());
@ -3744,6 +3759,17 @@ static void prepare_split_tensors(int split_dim, ggml_context * ctx, ggml_tensor
                split_tensor.tensor_splits[i] = nullptr;
            }
        }
+    }
+    else if (split_dim == 2) {
+        for (int i = 0; i < int(splits.size()); ++i) {
+            if (splits[i] > 0) {
+                split_tensor.tensor_splits[i] = ggml_new_tensor_3d(ctx, tensor->type, tensor->ne[0], tensor->ne[1], splits[i]);
+                auto name_i = name + '.' + std::to_string(i);
+                ggml_set_name(split_tensor.tensor_splits[i], name_i.c_str());
+            } else {
+                split_tensor.tensor_splits[i] = nullptr;
+            }
+        }
    } else {
        for (int i = 0; i < int(splits.size()); ++i) {
            if (splits[i] > 0) {
@ -3769,6 +3795,80 @@ static void prepare_split_tensors(int split_dim, ggml_context * ctx, ggml_tensor
    }
 }

+// MLA tensor distribution for -sm graph / -sm attn.
+// q_a/wkv_a_mqa/norms replicated; q_b row-split by Q head; wo row-split.
+// wk_b/wv_b are per-head split (split_dim=2) — loaded directly when present
+// in the GGUF, or produced per-head-split by llm_prepare_mla()'s materialize
+// lambda when only wkv_b is in the GGUF.
+static void distribute_mla_tensors_for_split_mode_graph(
+        llama_layer & layer,
+        const llama_hparams & hparams,
+        const std::vector<float> & cur_splits,
+        std::vector<size_t> & mem_used,
+        ggml_context * ctx_split,
+        int il) {
+    const std::vector<int> mirror(cur_splits.size(), 1);
+
+    const int n_head        = hparams.n_head(il);
+    const int n_embd_head_k = hparams.n_embd_head_k(il);
+    const int n_embd_head_v = hparams.n_embd_head_v(il);
+    const int qk_rope       = hparams.n_rot;
+    const int qk_nope       = n_embd_head_k - qk_rope;
+
+    // granularity=4: keeps wo row blocks K-quant-aligned (% 256) and gqa_ratio % 4 == 0 for FA-MMA.
+    auto split_heads = create_split(n_head, 4, cur_splits, mem_used);
+
+    // Derive per-tensor column/row splits from head splits.
+    auto split_wq_b_cols  = split_heads;
+    for (auto & s : split_wq_b_cols)  s *= n_embd_head_k;
+    auto split_wo_rows    = split_heads;
+    for (auto & s : split_wo_rows)    s *= n_embd_head_v;
+
+    LLAMA_LOG_DEBUG("  MLA layer %d split_heads:", il);
+    for ([[maybe_unused]] auto s : split_heads) LLAMA_LOG_DEBUG(" %d", s);
+    LLAMA_LOG_DEBUG("\n");
+
+    // Replicated norms (Q-LoRA / KV-LoRA)
+    if (layer.attn_q_a_norm) {
+        prepare_split_tensors(-1, ctx_split, layer.attn_q_a_norm,  layer.split_attn_q_a_norm,  mirror, mem_used);
+    }
+    if (layer.attn_kv_a_norm) {
+        prepare_split_tensors(-1, ctx_split, layer.attn_kv_a_norm, layer.split_attn_kv_a_norm, mirror, mem_used);
+    }
+
+    // Q-side: either wq_a + wq_b (Q-LoRA path, DSV3/K2) or wq directly (DSV2-Lite)
+    if (layer.wq_a) {
+        prepare_split_tensors(-1, ctx_split, layer.wq_a, layer.split_wq_a, mirror, mem_used);
+    }
+    if (layer.wq_b) {
+        prepare_split_tensors(1, ctx_split, layer.wq_b, layer.split_wq_b, split_wq_b_cols, mem_used);
+    } else if (layer.wq) {
+        // DSV2-Lite / no-Q-LoRA path: column-split wq directly along the head dim.
+        auto split_wq_cols = split_heads;
+        for (auto & s : split_wq_cols) s *= n_embd_head_k;
+        prepare_split_tensors(1, ctx_split, layer.wq, layer.split_wq, split_wq_cols, mem_used);
+    }
+
+    // wkv_a_mqa, wk_b, wv_b replicated: the per-head 3D batched mul_mat can't read a split src0.
+    if (layer.wkv_a_mqa) {
+        prepare_split_tensors(-1, ctx_split, layer.wkv_a_mqa, layer.split_wkv_a_mqa, mirror, mem_used);
+    }
+    if (layer.wk_b) {
+        prepare_split_tensors( 2, ctx_split, layer.wk_b, layer.split_wk_b, split_heads, mem_used);
+    }
+    if (layer.wv_b) {
+        prepare_split_tensors( 2, ctx_split, layer.wv_b, layer.split_wv_b, split_heads, mem_used);
+    }
+
+    // Output projection: row-split, partial outputs all-reduced after.
+    if (layer.wo) {
+        prepare_split_tensors(0, ctx_split, layer.wo, layer.split_wo, split_wo_rows, mem_used);
+    }
+    if (layer.ffn_norm) {
+        prepare_split_tensors(-1, ctx_split, layer.ffn_norm, layer.split_ffn_norm, mirror, mem_used);
+    }
+}
+
 static void adjust_split(std::vector<float> & split, const std::vector<size_t> & mem_used, int max_gpu) {
    if (max_gpu < 1 || max_gpu >= int(split.size()) || split.size() != mem_used.size()) {
        return;
@ -4464,6 +4564,16 @@ bool create_tensors_helper::create_tensors() {
                }
            }

+            // MLA tensor distribution (DEEPSEEK2/GLM_DSA/MISTRAL4). Detect by arch + absence of wk
+            // since wkv_b can be null when the model was quantized by mainline llama.cpp.
+            if (layer.wo && !layer.wk &&
+                (model.arch == LLM_ARCH_DEEPSEEK2 ||
+                 model.arch == LLM_ARCH_GLM_DSA ||
+                 model.arch == LLM_ARCH_MISTRAL4)) {
+                distribute_mla_tensors_for_split_mode_graph(
+                    layer, hparams, cur_splits, mem_used, ctx_split, il);
+            }
+
            if (layer.ffn_down && layer.ffn_up && layer.ffn_gate) {
                bool use_split = split_tensors.find(layer.ffn_down) != split_tensors.end() &&
                                 split_tensors.find(layer.ffn_gate) != split_tensors.end() &&
--- a/src/llama-model.h
+++ b/src/llama-model.h
@ -219,6 +219,15 @@ struct llama_layer {
    llama_split_tensor split_sinks;
    llama_split_tensor split_wqkv_gate;

+    // MLA per-device shards (-sm graph for DEEPSEEK2/GLM_DSA/MISTRAL4).
+    llama_split_tensor split_wq_a;
+    llama_split_tensor split_wq_b;
+    llama_split_tensor split_wkv_a_mqa;
+    llama_split_tensor split_wk_b;
+    llama_split_tensor split_wv_b;
+    llama_split_tensor split_attn_q_a_norm;
+    llama_split_tensor split_attn_kv_a_norm;
+
    llama_split_tensor split_ssm_wqkv;
    llama_split_tensor split_ssm_wqkv_gate;
    llama_split_tensor split_ssm_in;
@ -374,6 +383,10 @@ struct llama_layer {
    std::unique_ptr<ggml_tensor> computed_wk_b;
    std::unique_ptr<ggml_tensor> computed_wv_b;
    std::unique_ptr<ggml_tensor> computed_wkv_b;
+
+    // Per-device replicas of computed wk_b/wv_b (-sm graph). Buffers owned via model.bufs.
+    std::vector<std::unique_ptr<ggml_tensor>> computed_wk_b_replicas;
+    std::vector<std::unique_ptr<ggml_tensor>> computed_wv_b_replicas;
 };

 struct llama_lora_adapter;
--- a/src/llama.cpp
+++ b/src/llama.cpp
@ -622,13 +622,14 @@ bool llama_context::update_cache_copies() {
            GGML_ASSERT(model.split_mode == LLAMA_SPLIT_MODE_GRAPH || model.split_mode == LLAMA_SPLIT_MODE_ATTN);
            GGML_ASSERT(model.splits.size() > 1);
            auto vl = !kv_self.v_l.empty() && kv_self.v_l[il] ? (ggml_split_tensor_t *)kv_self.v_l[il]->extra : nullptr;
-            GGML_ASSERT(kl && (!kv_self.v_l[il] || vl));
+            GGML_ASSERT(kl && (kv_self.v_l.empty() || !kv_self.v_l[il] || vl));
            if (vl) {
                GGML_ASSERT(kl->n_device == vl->n_device);
            }
            for (int id = 0; id < kl->n_device; ++id) {
                if (!kl->splits[id]) continue;
-                auto& c = cache_copies[2*model.splits.size()*il + 2*id + 0];
+                size_t idx = 2*model.splits.size()*il + 2*id + 0;
+                auto& c = cache_copies[idx];
                if (!c.cpy || c.cpy->op != GGML_OP_CPY || c.cpy->view_src != kl->splits[id]) {
                    return false;
                }
@ -800,6 +801,7 @@ static bool llama_kv_cache_init(
    bool is_mla_attn = model.arch == LLM_ARCH_DEEPSEEK2 || model.arch == LLM_ARCH_GLM_DSA || model.arch == LLM_ARCH_MISTRAL4;

    bool split_cache   = false;
+    bool replicate_mla = false;
    if ((model.split_mode == LLAMA_SPLIT_MODE_GRAPH || model.split_mode == LLAMA_SPLIT_MODE_ATTN) && !is_mla_attn && offload) {
        cache.split_k_l.reserve(n_layer);
        cache.split_v_l.reserve(n_layer);
@ -808,6 +810,10 @@ static bool llama_kv_cache_init(
        }
        split_cache = true;
    }
+    if ((model.split_mode == LLAMA_SPLIT_MODE_GRAPH || model.split_mode == LLAMA_SPLIT_MODE_ATTN) && is_mla_attn && offload) {
+        cache.replicated_k_l.reserve(n_layer);
+        replicate_mla = true;
+    }

    // count used buffer types
    std::map<ggml_backend_buffer_type_t, int> buft_layer_count;
@ -817,7 +823,7 @@ static bool llama_kv_cache_init(
        const int64_t n_mtp_first = n_layer - hparams.nextn_predict_layers;
        for (int64_t i = 0; i < n_layer; ++i) {
            const bool is_mtp_tail = qwen_mtp && i >= n_mtp_first;
-            if (split_cache && !is_mtp_tail) {
+            if ((split_cache || replicate_mla) && !is_mtp_tail) {
                buft_layer_count[model.buft_layer[i].buft_matrix]++;
                if (model.buft_layer[i].buft != model.buft_layer[i].buft_matrix) {
                    buft_layer_count[model.buft_layer[i].buft]++;
@ -835,7 +841,7 @@ static bool llama_kv_cache_init(
    for (auto & it : buft_layer_count) {
        int n_layers = it.second;
        size_t ctx_mem_size = 8u*n_layers*ggml_tensor_overhead();
-        if (split_cache) ctx_mem_size += 4*model.splits.size()*n_layers*ggml_tensor_overhead();
+        if (split_cache || replicate_mla) ctx_mem_size += 4*model.splits.size()*n_layers*ggml_tensor_overhead();
        struct ggml_init_params params = {
            /*.mem_size   =*/ ctx_mem_size,
            /*.mem_buffer =*/ NULL,
@ -910,7 +916,7 @@ static bool llama_kv_cache_init(
                model.arch == LLM_ARCH_QWEN35MOE) &&
                hparams.nextn_predict_layers > 0 && i >= (int)n_mtp_first_layer;
        //struct ggml_context * ctx = split_cache && !qnext_recurrent ? ctx_map.at(model.buft_layer[i].buft_matrix) : offload ? ctx_map.at(model.buft_layer[i].buft) : cache.ctxs.front();
-        struct ggml_context * ctx = (split_cache && !is_mtp_tail_layer) ? ctx_map.at(model.buft_layer[i].buft_matrix) : offload ? ctx_map.at(model.buft_layer[i].buft) : cache.ctxs.front();
+        struct ggml_context * ctx = ((split_cache || replicate_mla) && !is_mtp_tail_layer) ? ctx_map.at(model.buft_layer[i].buft_matrix) : offload ? ctx_map.at(model.buft_layer[i].buft) : cache.ctxs.front();
        ggml_tensor * k = nullptr;
        ggml_tensor * v = nullptr;
        ggml_tensor * s = nullptr;
@ -919,19 +925,40 @@ static bool llama_kv_cache_init(
            const uint32_t n_embd_head_qk_rope = hparams.n_rot;
            const uint32_t kv_lora_rank = hparams.n_lora_kv;
            //LLAMA_LOG_INFO("%s: layer %d: n_embd_head_qk_rope = %d, kv_lora_rank = %d\n", __func__, i, n_embd_head_qk_rope, kv_lora_rank);
-            if (cparams.flash_attn) {
-                ggml_tensor * kv = ggml_new_tensor_2d(ctx, cache.type_k, kv_lora_rank + n_embd_head_qk_rope, kv_size);
-                ggml_format_name(kv, "cache_k_l%d", i);
-                cache.k_l.push_back(kv);
-            } else {
-                auto kv_type = cparams.mla_attn == 1 ? cache.type_k : cache.type_v;
-                ggml_tensor * kv = ggml_new_tensor_2d(ctx, kv_type, kv_lora_rank + n_embd_head_qk_rope, kv_size);
-                ggml_format_name(kv, "cache_k_l%d", i);
-                cache.k_l.push_back(kv);
-                if (cparams.mla_attn == 1) {
-                    ggml_tensor * kvt = ggml_new_tensor_1d(ctx, cache.type_v, kv_lora_rank*kv_size);
-                    ggml_format_name(kvt, "cache_v_l%d", i);
-                    cache.v_l.push_back(kvt);
+            ggml_type primary_kv_type = cparams.flash_attn ? cache.type_k
+                                       : (cparams.mla_attn == 1 ? cache.type_k : cache.type_v);
+            ggml_tensor * kv = ggml_new_tensor_2d(ctx, primary_kv_type, kv_lora_rank + n_embd_head_qk_rope, kv_size);
+            ggml_format_name(kv, "cache_k_l%d", i);
+            cache.k_l.push_back(kv);
+            if (!cparams.flash_attn && cparams.mla_attn == 1) {
+                ggml_tensor * kvt = ggml_new_tensor_1d(ctx, cache.type_v, kv_lora_rank*kv_size);
+                ggml_format_name(kvt, "cache_v_l%d", i);
+                cache.v_l.push_back(kvt);
+            }
+            // Per-device replicas of the compressed latent KV cache (n_device from wo's split).
+            if (replicate_mla && !is_mtp_tail_layer) {
+                auto wo = model.layers[i].wo;
+                if (wo && wo->extra) {
+                    auto extra_wo = (const ggml_split_tensor_t *)wo->extra;
+                    int n_device = extra_wo->n_device;
+                    auto & repl_k_l = cache.replicated_k_l.emplace_back();
+                    repl_k_l.tensor_splits.resize(n_device, nullptr);
+                    for (int is = 0; is < n_device; ++is) {
+                        if (!extra_wo->splits[is]) continue;
+                        ggml_tensor * rkv = ggml_new_tensor_2d(ctx, primary_kv_type,
+                                kv_lora_rank + n_embd_head_qk_rope, kv_size);
+                        auto split_name = std::string("cache_k_l") + std::to_string(i) + '.' + std::to_string(is);
+                        ggml_set_name(rkv, split_name.c_str());
+                        repl_k_l.tensor_splits[is] = rkv;
+                        mem_split[is] += ggml_nbytes(rkv);
+                    }
+                    repl_k_l.ggml.n_device  = n_device;
+                    repl_k_l.ggml.split_dim = -1;
+                    repl_k_l.ggml.splits    = repl_k_l.tensor_splits.data();
+                    kv->extra = (void *)&repl_k_l.ggml;
+                } else {
+                    GGML_ABORT("MLA layer %d: wo lacks split metadata under -sm graph "
+                               "(distribute_mla_tensors_for_split_mode_graph not run?)", i);
                }
            }
            n_mla++;
@ -1086,8 +1113,9 @@ static bool llama_kv_cache_init(
            cache.bufs.push_back(buf);
        }
    }
-    if (split_cache) {
-        LLAMA_LOG_INFO("%s: KV cache size per device:\n", __func__);
+    if (split_cache || replicate_mla) {
+        LLAMA_LOG_INFO("%s: KV cache size per device%s:\n", __func__,
+                       replicate_mla ? " (MLA replicated)" : "");
        for (int i = 0; i < int(mem_split.size()); ++i) printf("    Device %d:  %g MiB\n", i, mem_split[i]/1024./1024.);
    }

@ -2306,24 +2334,107 @@ static void llm_prepare_mla(llama_model & model, int mla) {

            auto name = std::string{"blk."} + std::to_string(il) + ".attn_k_b.weight";

-            l.computed_wk_b = std::make_unique<ggml_tensor>(*wk_b);
-            l.computed_wk_b->buffer = ggml_backend_buft_alloc_buffer(ggml_backend_buffer_get_type(l.wkv_b->buffer), ggml_nbytes(wk_b));
-            l.computed_wk_b->data   = ggml_backend_buffer_get_base(l.computed_wk_b->buffer);
-            l.computed_wk_b->op = GGML_OP_NONE; // we absolutely need to do this, else the backend will attempt to find the parents
-                                                // of wk_b, which no longer exist, and will therefore crash.
-            for (int j = 0; j < GGML_MAX_SRC; ++j) l.computed_wk_b->src[j] = nullptr;
-            ggml_set_name(l.computed_wk_b.get(), name.c_str());
-            ggml_backend_buffer_set_usage(l.computed_wk_b->buffer, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
-            ggml_backend_tensor_set(l.computed_wk_b.get(), wk_b->data, 0, ggml_nbytes(wk_b));
-            if (ggml_backend_buffer_is_host(l.computed_wk_b->buffer)) {
-                iqk_modify_tensor(l.computed_wk_b.get());
-            }
+            // Per-head split wk_b/wv_b under -sm graph/attn so each rank's batched matmul
+            // reads only its share of heads (split_dim=2), mirroring prepare_split_tensors.
+            const bool tp_replicate =
+                (model.split_mode == LLAMA_SPLIT_MODE_GRAPH || model.split_mode == LLAMA_SPLIT_MODE_ATTN)
+                && l.wo && l.wo->extra;

-            l.wk_b = l.computed_wk_b.get();
-            model.tensors_by_name.push_back(std::make_pair(name, l.wk_b));
+            auto materialize = [&](ggml_tensor * source,
+                                   std::unique_ptr<ggml_tensor> & computed,
+                                   std::vector<std::unique_ptr<ggml_tensor>> & replicas,
+                                   llama_split_tensor & split,
+                                   const std::string & tname) -> ggml_tensor * {
+                if (tp_replicate) {
+                    auto wo_split = (const ggml_split_tensor_t *)l.wo->extra;
+                    const int n_device = wo_split->n_device;
+                    const int64_t n_embd_head_v_full = hparams.n_embd_head_v_full;

-            printf("Computed %s as %d x %d x %d of type %s and stored in buffer %s\n", name.c_str(), (int)wk_b->ne[0], (int)wk_b->ne[1], (int)wk_b->ne[2],
-                    ggml_type_name(wk_b->type), ggml_backend_buffer_name(l.computed_wk_b->buffer));
+                    std::vector<int> head_offsets(n_device + 1, 0);
+                    for (int idx = 0; idx < n_device; ++idx) {
+                        int n_h_id = 0;
+                        if (wo_split->splits[idx]) {
+                            n_h_id = (int)(wo_split->splits[idx]->ne[0] / n_embd_head_v_full);
+                        }
+                        head_offsets[idx + 1] = head_offsets[idx] + n_h_id;
+                    }
+
+                    computed = std::make_unique<ggml_tensor>(*source);
+                    computed->buffer = nullptr;
+                    computed->data   = nullptr;
+                    computed->op = GGML_OP_NONE;
+                    for (int j = 0; j < GGML_MAX_SRC; ++j) computed->src[j] = nullptr;
+                    ggml_set_name(computed.get(), tname.c_str());
+
+                    replicas.resize(n_device);
+                    split.tensor_splits.assign(n_device, nullptr);
+
+                    const size_t head_block_bytes = source->nb[2];
+
+                    for (int id = 0; id < n_device; ++id) {
+                        if (!wo_split->splits[id] || !wo_split->splits[id]->buffer) continue;
+                        const int head_offset  = head_offsets[id];
+                        const int n_head_local = head_offsets[id + 1] - head_offset;
+                        if (n_head_local <= 0) continue;
+
+                        const size_t slice_bytes = (size_t)n_head_local * head_block_bytes;
+                        auto dev_buft = ggml_backend_buffer_get_type(wo_split->splits[id]->buffer);
+                        auto dev_buf  = ggml_backend_buft_alloc_buffer(dev_buft, slice_bytes);
+                        ggml_backend_buffer_set_usage(dev_buf, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
+                        model.bufs.push_back(dev_buf);
+
+                        replicas[id] = std::make_unique<ggml_tensor>(*source);
+                        auto rep = replicas[id].get();
+                        rep->ne[2] = n_head_local;
+                        rep->nb[3] = rep->nb[2] * (size_t)rep->ne[2];
+                        rep->buffer = dev_buf;
+                        rep->data   = ggml_backend_buffer_get_base(dev_buf);
+                        rep->op = GGML_OP_NONE;
+                        for (int j = 0; j < GGML_MAX_SRC; ++j) rep->src[j] = nullptr;
+                        rep->view_src = nullptr;
+                        rep->view_offs = 0;
+                        rep->extra = nullptr;
+                        ggml_set_name(rep, (tname + "." + std::to_string(id)).c_str());
+
+                        const uint8_t * src_bytes = (const uint8_t *)source->data + (size_t)head_offset * head_block_bytes;
+                        ggml_backend_tensor_set(rep, src_bytes, 0, slice_bytes);
+                        if (ggml_backend_buffer_is_host(rep->buffer)) {
+                            iqk_modify_tensor(rep);
+                        }
+                        split.tensor_splits[id] = rep;
+                    }
+
+                    split.ggml.n_device  = n_device;
+                    split.ggml.split_dim = 2;
+                    split.ggml.splits    = split.tensor_splits.data();
+                    computed->extra = (void *)&split.ggml;
+
+                    printf("Computed %s as %d x %d x %d of type %s, split across %d devices on dim=2\n",
+                            tname.c_str(), (int)source->ne[0], (int)source->ne[1], (int)source->ne[2],
+                            ggml_type_name(source->type), n_device);
+                } else {
+                    computed = std::make_unique<ggml_tensor>(*source);
+                    computed->buffer = ggml_backend_buft_alloc_buffer(ggml_backend_buffer_get_type(l.wkv_b->buffer), ggml_nbytes(source));
+                    computed->data   = ggml_backend_buffer_get_base(computed->buffer);
+                    // GGML_OP_NONE so the backend doesn't try to find the (now-freed) parents of source.
+                    computed->op = GGML_OP_NONE;
+                    for (int j = 0; j < GGML_MAX_SRC; ++j) computed->src[j] = nullptr;
+                    ggml_set_name(computed.get(), tname.c_str());
+                    ggml_backend_buffer_set_usage(computed->buffer, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
+                    ggml_backend_tensor_set(computed.get(), source->data, 0, ggml_nbytes(source));
+                    if (ggml_backend_buffer_is_host(computed->buffer)) {
+                        iqk_modify_tensor(computed.get());
+                    }
+
+                    printf("Computed %s as %d x %d x %d of type %s and stored in buffer %s\n",
+                            tname.c_str(), (int)source->ne[0], (int)source->ne[1], (int)source->ne[2],
+                            ggml_type_name(source->type), ggml_backend_buffer_name(computed->buffer));
+                }
+                model.tensors_by_name.push_back(std::make_pair(tname, computed.get()));
+                return computed.get();
+            };
+
+            l.wk_b = materialize(wk_b, l.computed_wk_b, l.computed_wk_b_replicas, l.split_wk_b, name);

            ggml_graph_clear(graph);
            auto wv_b = ggml_cont(ctx, ggml_view_3d(ctx, &wkv_b, kv_lora_rank, n_embd_head_v, n_head,
@ -2338,30 +2449,13 @@ static void llm_prepare_mla(llama_model & model, int mla) {

            name = std::string{"blk."} + std::to_string(il) + ".attn_v_b.weight";

-            l.computed_wv_b = std::make_unique<ggml_tensor>(*wv_b);
-            l.computed_wv_b->buffer = ggml_backend_buft_alloc_buffer(ggml_backend_buffer_get_type(l.wkv_b->buffer), ggml_nbytes(wv_b));
-            l.computed_wv_b->data   = ggml_backend_buffer_get_base(l.computed_wv_b->buffer);
-            l.computed_wv_b->op = GGML_OP_NONE; // we absolutely need to do this, else the backend will attempt to find the parents
-                                                // of wk_b, which no longer exist, and will therefore crash.
-            for (int j = 0; j < GGML_MAX_SRC; ++j) l.computed_wv_b->src[j] = nullptr;
-            ggml_set_name(l.computed_wv_b.get(), name.c_str());
-            ggml_backend_buffer_set_usage(l.computed_wv_b->buffer, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
-            ggml_backend_tensor_set(l.computed_wv_b.get(), wv_b->data, 0, ggml_nbytes(wv_b));
-            if (ggml_backend_buffer_is_host(l.computed_wv_b->buffer)) {
-                iqk_modify_tensor(l.computed_wv_b.get());
-            }
-
-            l.wv_b = l.computed_wv_b.get();
-            model.tensors_by_name.push_back(std::make_pair(name, l.wv_b));
-
-            printf("Computed %s as %d x %d x %d of type %s and stored in buffer %s\n", name.c_str(), (int)wv_b->ne[0], (int)wv_b->ne[1], (int)wv_b->ne[2],
-                    ggml_type_name(wv_b->type), ggml_backend_buffer_name(l.computed_wv_b->buffer));
+            l.wv_b = materialize(wv_b, l.computed_wv_b, l.computed_wv_b_replicas, l.split_wv_b, name);

            ggml_graph_clear(graph);
        }
        ggml_free(ctx);
    }
-    if (mla == 1) return;
+    if (mla == 1 || model.split_mode == LLAMA_SPLIT_MODE_GRAPH) return;

    n_to_compute = 0;
    for (auto& l : model.layers) {
@ -2608,6 +2702,9 @@ static bool is_model_split_supported(const llama_model & model) {
        LLM_ARCH_QWEN35,
        LLM_ARCH_QWEN35MOE,
        LLM_ARCH_GEMMA4,
+        LLM_ARCH_DEEPSEEK2,
+        LLM_ARCH_GLM_DSA,
+        LLM_ARCH_MISTRAL4,
    };
    auto it =  k_supported.find(model.arch);
    return it != k_supported.end();
@ -2878,6 +2975,12 @@ static bool llm_load_tensors(
        const bool unsupported_gemma_split =
            model.arch == LLM_ARCH_GEMMA4_MTP ||
            (model.arch == LLM_ARCH_GEMMA4 && hparams.n_embd_per_layer > 0);
+        const bool is_mla_arch =
+            model.arch == LLM_ARCH_DEEPSEEK2 ||
+            model.arch == LLM_ARCH_GLM_DSA ||
+            model.arch == LLM_ARCH_MISTRAL4;
+        const bool incompatible_loader_opts = is_mla_arch &&
+            (ml.ncmoe > 0 || ml.repack_tensors || ml.merge_up_gate_exps || ml.tensor_buft_overrides);

        if (unsupported_gemma_split) {
            LLAMA_LOG_WARN("\n=========================================================\n");
@ -2887,6 +2990,16 @@ static bool llm_load_tensors(
            LLAMA_LOG_WARN("  => changing split mode to 'layer'\n");
            LLAMA_LOG_WARN("===========================================================\n\n");
            split_mode = LLAMA_SPLIT_MODE_LAYER;
+        } else if (incompatible_loader_opts) {
+            const char * bad_flag = ml.ncmoe > 0           ? "-ncmoe | --n-cpu-moe"
+                                  : ml.repack_tensors      ? "-rtr | --run-time-repack"
+                                  : ml.merge_up_gate_exps  ? "-muge | --merge-up-gate-experts"
+                                                           : "-ot | --override-tensor";
+            LLAMA_LOG_WARN("\n=======================================================\n");
+            LLAMA_LOG_WARN("Split mode 'graph' is not compatible with %s\n", bad_flag);
+            LLAMA_LOG_WARN("  => changing split mode to 'layer'\n");
+            LLAMA_LOG_WARN("=======================================================\n\n");
+            split_mode = LLAMA_SPLIT_MODE_LAYER;
        } else if (!is_model_split_supported(model)) {
            LLAMA_LOG_WARN("\n=======================================================\n");
            LLAMA_LOG_WARN("Split mode 'graph' is not supported for this model\n");
@ -3301,7 +3414,8 @@ static bool llm_load_tensors(

    ml.done_getting_tensors();

-    ml.init_mappings(!defer_expert_mmap, use_mlock ? &model.mlock_mmaps : nullptr, ml.use_thp);
+    // --dry-run skips MAP_POPULATE/WILLNEED — tensor data is never read.
+    ml.init_mappings(!defer_expert_mmap && !dry_run, use_mlock ? &model.mlock_mmaps : nullptr, ml.use_thp);
    model.mappings.reserve(ml.mappings.size());

    // create the backend buffers
@ -3444,8 +3558,13 @@ static bool llm_load_tensors(
        }
    }

-    if (model.arch == LLM_ARCH_DEEPSEEK2 || model.arch == LLM_ARCH_GLM_DSA || model.arch == LLM_ARCH_MISTRAL4) {
-        llm_prepare_mla(model, mla_attn);
+    if ((model.arch == LLM_ARCH_DEEPSEEK2 || model.arch == LLM_ARCH_GLM_DSA || model.arch == LLM_ARCH_MISTRAL4)) {
+        // -sm graph/attn needs wk_b->extra populated; run prepare even under dry-run.
+        const bool graph_mode = (model.split_mode == LLAMA_SPLIT_MODE_GRAPH ||
+                                 model.split_mode == LLAMA_SPLIT_MODE_ATTN);
+        if (!dry_run || graph_mode) {
+            llm_prepare_mla(model, mla_attn);
+        }
    }
    if (model.arch == LLM_ARCH_GEMMA4) {
        llm_scale_gate_inp_s(model, use_mmap_buffer);
@ -5767,6 +5886,7 @@ struct llama_context_params llama_context_default_params() {
        /*.type_k                      =*/ GGML_TYPE_F16,
        /*.type_v                      =*/ GGML_TYPE_F16,
        /*.type_reduce                 =*/ GGML_TYPE_F16,
+        /*.type_graph_attn             =*/ GGML_TYPE_F16,
        /*.type_first_k                =*/ GGML_TYPE_F16,
        /*.type_last_k                 =*/ GGML_TYPE_F16,
        /*.type_first_v                =*/ GGML_TYPE_F16,
@ -6187,6 +6307,11 @@ struct llama_context * llama_init_from_model(
    cparams.worst_graph_tokens = params.worst_case_tokens;

    cparams.reduce_type      = params.type_reduce;
+    cparams.graph_attn_precision = params.type_graph_attn;
+    if (cparams.graph_attn_precision != GGML_TYPE_F16 && cparams.graph_attn_precision != GGML_TYPE_F32) {
+        throw std::runtime_error(format("--graph-attn-precision must be f16 or f32, got %s",
+                                        ggml_type_name(cparams.graph_attn_precision)));
+    }
    cparams.pooling_type     = params.pooling_type;

    cparams.n_ctx            = params.n_ctx           == 0    ? hparams.n_ctx_train           : params.n_ctx;