mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-06-28 04:30:15 -05:00
MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) (#1821)
* MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) Extends -sm graph (split-mode graph) to MLA-style attention across the DEEPSEEK2, GLM_DSA, and MISTRAL4 architectures. Previously these archs fell back to -sm layer regardless of the user's flag. Implementation: - Per-rank attention build in build_deepseek2_tp_attention with view-sliced FlashAttention, split-buffer output projection, and ggml_reduce across devices - wk_b / wv_b absorbed weights replicated per device via materialize() in llm_prepare_mla (these can't live in a split buffer) - KV cache replication path (replicated_k_l) for graph-mode TP - distribute_mla_tensors_for_split_mode_graph routes attention/norm tensors into ctx_split; expert tensors stay per-layer - Implements ggml_backend_cuda_split_buffer_get_tensor for the replicated / row-split / col-split inverse paths - Early-reject guard in src/llama.cpp that auto-downgrades -sm graph to -sm layer (with a warning) when incompatible loader flags are set: -ncmoe, -cmoe, -ot, -rtr, -muge New CLI flag: - -gap | --graph-attn-precision <f16|f32> (default f16) See the PR description for the full validation matrix (3 archs x 2/4/8 GPU counts), perf numbers, VRAM accounting, and known limitations. * Some tweaks * materialize lambda: per-head split for graph-mode tp_replicate 7dd19e19 changed wk_b/wv_b distribution from mirror to per-head split (split_dim=2) via prepare_split_tensors. That path only fires when wk_b/wv_b are loaded from GGUF. Models that store only wkv_b in GGUF derive wk_b/wv_b at load via llm_prepare_mla, going through the materialize lambda, which was untouched and still produced mirror replicas (split_dim=-1, full n_head per device). build_deepseek2_tp_attention now does mul_mat(wk_b_local, q_nope_perm) without the prior view_3d slice, so a mirror replica passes an n_head tensor where the kernel expects n_head_local. Result: silent SIGSEGV right after model load. Mirror logic in materialize is replaced with the same per-head split as prepare_split_tensors: head_offsets derived from wo split, each rank gets a tensor with ne[2]=n_head_local, data copied from the appropriate source byte slice. Singular `computed` tensor keeps full metadata for tensors_by_name lookups. Tested: 8x3090, -sm graph -mla 3 -fa on now boots cleanly and sweep-benches without crash. Log confirms new path: "Computed blk.X.attn_k_b.weight ... split across N devices on dim=2". * cleanup: indent fix + remove dead view_3d slicing and debug printf - build_deepseek2.cpp: re-indent the self_attention block in build_deepseek2_layer_attention (lines 253-670). Block was at column 0 inside a function body; now at the expected 4/8-space indent. - build_deepseek2.cpp: drop the commented-out view_3d slicing and debug printfs left over after 7dd19e19's switch to direct mul_mat on per-rank wk_b_local / wv_b_local. Update the stale 'wk_b is replicated (split_dim=-1)' comment to match the new split_dim=2 reality. - ggml-cuda.cu: remove the leftover debug printf in ggml_backend_cuda_split_buffer_get_tensor. No behavior change. Verified with a clean rebuild and DSV2.5 + GLM-4.7-Flash sweep-bench runs. * llm_load_tensors: gate incompatible-flag warning to MLA archs The -ncmoe / -rtr / -muge / -ot warning under -sm graph currently fires for all archs that support graph mode. That's an over-reach: the incompatibility is specific to the MLA TP paths (DEEPSEEK2, GLM_DSA, MISTRAL4) — Gemma4 graph mode existed pre-PR and works with those flags. Gate the warning to MLA archs only. Also refreshes two stale comments left over from the wk_b/wv_b mirror -> per-head-split rewrite: - src/llama.cpp llm_prepare_mla: "Replicate wk_b/wv_b ..." now reads "Per-head split wk_b/wv_b ..." to match what the materialize lambda actually does post-823a39e2. - src/llama-load-tensors.cpp distribute_mla_tensors_for_split_mode_graph: drop the wkv_b row-split mention (wkv_b is no longer created under graph mode after 7dd19e19) and correct the wk_b/wv_b distribution description (per-head split, not per-device replicated). --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
This commit is contained in:
parent
104846ddee
commit
c07a052315
@ -2074,6 +2074,11 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
|
||||
params.reduce_type = argv[i];
|
||||
return true;
|
||||
}
|
||||
if (arg == "-gap" || arg == "--graph-attn-precision") {
|
||||
CHECK_ARG
|
||||
params.graph_attn_precision = argv[i];
|
||||
return true;
|
||||
}
|
||||
if (arg == "--numa") {
|
||||
CHECK_ARG
|
||||
std::string value(argv[i]);
|
||||
@ -2885,6 +2890,7 @@ void gpt_params_print_usage(int /*argc*/, char ** argv, const gpt_params & param
|
||||
options.push_back({ "*", "-smf16, --split-mode-f16,", "Use f16 for data exchange between GPUs (default: %d)", true});
|
||||
options.push_back({ "*", "-smf32, --split-mode-f32,", "Use f32 for data exchange between GPUs (default: %d)", false});
|
||||
options.push_back({ "*", "-grt, --graph-reduce-type", "Type for data exchange between GPUs (default: %s)", "f32"});
|
||||
options.push_back({ "*", "-gap, --graph-attn-precision", "Flash-attn precision under -sm graph (default: %s)", "f16"});
|
||||
options.push_back({ "*", "-smgs, --split-mode-graph-scheduling,", "Force Split Mode Graph Scheduling (default: %d)", params.split_mode_graph_scheduling});
|
||||
options.push_back({ "*", "-sas, --scheduler_async,", "Async evaluation of compute graphs: %d)", params.scheduler_async});
|
||||
options.push_back({ "*", "-vq, --validate-quants", "validate quantized data while loading the model (default: %d)", params.validate_quants});
|
||||
@ -4077,6 +4083,7 @@ struct llama_context_params common_context_params_to_llama(const gpt_params & pa
|
||||
cparams.type_k = kv_cache_type_from_str(params.cache_type_k);
|
||||
cparams.type_v = kv_cache_type_from_str(params.cache_type_v);
|
||||
cparams.type_reduce = ggml_type_from_str(params.reduce_type);
|
||||
cparams.type_graph_attn = ggml_type_from_str(params.graph_attn_precision);
|
||||
if (!cparams.flash_attn && ggml_is_quantized(cparams.type_v)) {
|
||||
throw std::runtime_error("Quantized V cache cannot be used without flash attention");
|
||||
}
|
||||
|
||||
@ -433,6 +433,7 @@ struct gpt_params {
|
||||
std::string cache_type_v = "f16"; // KV cache data type for the V
|
||||
|
||||
std::string reduce_type = "f16";
|
||||
std::string graph_attn_precision = "f16";
|
||||
|
||||
std::string type_k_first = "f16";
|
||||
std::string type_k_last = "f16";
|
||||
|
||||
@ -240,7 +240,7 @@
|
||||
// if you need to load more than 64 model shards.
|
||||
#define GGML_MAX_CONTEXTS 64
|
||||
#endif
|
||||
#define GGML_MAX_SRC 10
|
||||
#define GGML_MAX_SRC 12
|
||||
#ifndef GGML_MAX_NAME
|
||||
#define GGML_MAX_NAME 64
|
||||
#endif
|
||||
|
||||
@ -1127,6 +1127,18 @@ GGML_CALL static void ggml_backend_cuda_split_buffer_set_tensor([[maybe_unused]]
|
||||
}
|
||||
}
|
||||
}
|
||||
else if (extra->split_dim == 2) {
|
||||
size_t cur_offset = 0;
|
||||
for (int i = 0; i < extra->n_device; ++i) {
|
||||
auto split = extra->splits[i];
|
||||
if (!split) continue;
|
||||
ggml_cuda_set_device(i);
|
||||
auto size = ggml_nbytes(split);
|
||||
const char * buf_host = (const char *)data + cur_offset;
|
||||
CUDA_CHECK(cudaMemcpyAsync(split->data, buf_host, size, cudaMemcpyHostToDevice, cudaStreamPerThread));
|
||||
cur_offset += size;
|
||||
}
|
||||
}
|
||||
else {
|
||||
fprintf(stderr, "%s: not implemented for split dim %d\n", __func__, extra->split_dim == 0);
|
||||
GGML_ABORT("fatal error");
|
||||
@ -1140,12 +1152,125 @@ GGML_CALL static void ggml_backend_cuda_split_buffer_set_tensor([[maybe_unused]]
|
||||
}
|
||||
|
||||
GGML_CALL static void ggml_backend_cuda_split_buffer_get_tensor([[maybe_unused]] ggml_backend_buffer_t buffer, const ggml_tensor * tensor,
|
||||
[[maybe_unused]] void * data, size_t offset, size_t size) {
|
||||
// split tensors must always be set in their entirety at once
|
||||
void * data, size_t offset, size_t size) {
|
||||
// split tensors must always be read in their entirety at once
|
||||
GGML_ASSERT(offset == 0);
|
||||
GGML_ASSERT(size == ggml_nbytes(tensor));
|
||||
|
||||
GGML_ABORT("not implemented");
|
||||
if (!tensor->extra) return;
|
||||
|
||||
// Inverse of split_buffer_set_tensor; refuses paths with no defined inverse.
|
||||
auto extra = (ggml_split_tensor_t *)tensor->extra;
|
||||
GGML_ASSERT(extra->n_device <= ggml_backend_cuda_get_device_count());
|
||||
|
||||
// Repacked types are block-de-interleaved by set_tensor; no runtime inverse.
|
||||
{
|
||||
const ggml_type t = tensor->type;
|
||||
const bool is_repacked =
|
||||
t == GGML_TYPE_Q4_0_R8 || t == GGML_TYPE_Q5_0_R4 || t == GGML_TYPE_Q8_0_R8 ||
|
||||
t == GGML_TYPE_Q2_K_R4 || t == GGML_TYPE_Q3_K_R4 || t == GGML_TYPE_Q4_K_R4 ||
|
||||
t == GGML_TYPE_Q5_K_R4 || t == GGML_TYPE_Q6_K_R4 || t == GGML_TYPE_IQ4_NL_R4 ||
|
||||
t == GGML_TYPE_IQ4_XS_R8 || t == GGML_TYPE_Q6_0_R4;
|
||||
if (is_repacked) {
|
||||
GGML_ABORT("%s: get_tensor of repacked type %s is not invertible",
|
||||
__func__, ggml_type_name(t));
|
||||
}
|
||||
}
|
||||
|
||||
// Explicit-ranges form (non-contiguous expert assignments) is not invertible.
|
||||
void * extra_ptr = nullptr;
|
||||
memcpy(&extra_ptr, tensor->op_params, sizeof(extra_ptr));
|
||||
if (extra_ptr) {
|
||||
GGML_ABORT("%s: get_tensor with explicit ranges is not implemented", __func__);
|
||||
}
|
||||
|
||||
if (extra->split_dim < 0) {
|
||||
// Replicated: read from first present device.
|
||||
GGML_ASSERT(ggml_is_contiguous(tensor));
|
||||
for (int i = 0; i < extra->n_device; ++i) {
|
||||
auto split = extra->splits[i];
|
||||
if (!split) continue;
|
||||
GGML_ASSERT(split->type == tensor->type);
|
||||
ggml_cuda_set_device(i);
|
||||
CUDA_CHECK(cudaMemcpyAsync(data, split->data, ggml_nbytes(tensor),
|
||||
cudaMemcpyDeviceToHost, cudaStreamPerThread));
|
||||
CUDA_CHECK(cudaStreamSynchronize(cudaStreamPerThread));
|
||||
return;
|
||||
}
|
||||
GGML_ABORT("%s: no device holds a copy of the replicated tensor", __func__);
|
||||
}
|
||||
else if (extra->split_dim == 0) {
|
||||
// Row-split (concat along ne[0]).
|
||||
GGML_ASSERT(ggml_is_contiguous(tensor));
|
||||
auto tt = ggml_internal_get_type_traits(tensor->type);
|
||||
GGML_ASSERT(tt.row_meta_size == 0);
|
||||
std::vector<char> host_buffer;
|
||||
int64_t ne0_acc = 0;
|
||||
for (int i = 0; i < extra->n_device; ++i) {
|
||||
auto split = extra->splits[i];
|
||||
if (!split) continue;
|
||||
GGML_ASSERT(split->type == tensor->type);
|
||||
GGML_ASSERT(split->ne[0] % tt.blck_size == 0);
|
||||
const size_t split_row_size = ggml_row_size(split->type, split->ne[0]);
|
||||
const size_t dev_bytes = (size_t)ggml_nrows(split) * split_row_size;
|
||||
if (host_buffer.size() < dev_bytes) host_buffer.resize(dev_bytes);
|
||||
ggml_cuda_set_device(i);
|
||||
CUDA_CHECK(cudaMemcpyAsync(host_buffer.data(), split->data, dev_bytes,
|
||||
cudaMemcpyDeviceToHost, cudaStreamPerThread));
|
||||
CUDA_CHECK(cudaStreamSynchronize(cudaStreamPerThread));
|
||||
const size_t source_offset = (ne0_acc / tt.blck_size) * tt.type_size;
|
||||
for (int64_t i02 = 0; i02 < split->ne[2]; ++i02) {
|
||||
for (int64_t i01 = 0; i01 < split->ne[1]; ++i01) {
|
||||
const char * src = host_buffer.data() + (i02*split->ne[1] + i01) * split_row_size;
|
||||
char * dst = (char *)data + i02*tensor->nb[2] + i01*tensor->nb[1] + source_offset;
|
||||
memcpy(dst, src, split_row_size);
|
||||
}
|
||||
}
|
||||
ne0_acc += split->ne[0];
|
||||
}
|
||||
}
|
||||
else if (extra->split_dim == 1) {
|
||||
// Column/ne[1] split.
|
||||
const size_t row_size = ggml_row_size(tensor->type, tensor->ne[0]);
|
||||
if (tensor->ne[2] > 1) {
|
||||
std::vector<char> host_buffer;
|
||||
int64_t ne1_acc = 0;
|
||||
for (int i = 0; i < extra->n_device; ++i) {
|
||||
auto split = extra->splits[i];
|
||||
if (!split) continue;
|
||||
const size_t dev_bytes = ggml_nbytes(split);
|
||||
if (host_buffer.size() < dev_bytes) host_buffer.resize(dev_bytes);
|
||||
ggml_cuda_set_device(i);
|
||||
CUDA_CHECK(cudaMemcpyAsync(host_buffer.data(), split->data, dev_bytes,
|
||||
cudaMemcpyDeviceToHost, cudaStreamPerThread));
|
||||
CUDA_CHECK(cudaStreamSynchronize(cudaStreamPerThread));
|
||||
for (int64_t i02 = 0; i02 < split->ne[2]; ++i02) {
|
||||
const char * src = host_buffer.data() + i02 * split->ne[1] * row_size;
|
||||
char * dst = (char *)data + i02*tensor->nb[2] + ne1_acc*tensor->nb[1];
|
||||
memcpy(dst, src, split->ne[1] * row_size);
|
||||
}
|
||||
ne1_acc += split->ne[1];
|
||||
}
|
||||
} else {
|
||||
size_t cur_offset = 0;
|
||||
for (int i = 0; i < extra->n_device; ++i) {
|
||||
auto split = extra->splits[i];
|
||||
if (!split) continue;
|
||||
ggml_cuda_set_device(i);
|
||||
const size_t dev_bytes = ggml_nbytes(split);
|
||||
CUDA_CHECK(cudaMemcpyAsync((char *)data + cur_offset, split->data, dev_bytes,
|
||||
cudaMemcpyDeviceToHost, cudaStreamPerThread));
|
||||
cur_offset += dev_bytes;
|
||||
}
|
||||
for (int i = 0; i < extra->n_device; ++i) {
|
||||
if (!extra->splits[i]) continue;
|
||||
CUDA_CHECK(cudaStreamSynchronize(cudaStreamPerThread));
|
||||
}
|
||||
}
|
||||
}
|
||||
else {
|
||||
GGML_ABORT("%s: not implemented for split_dim %d", __func__, extra->split_dim);
|
||||
}
|
||||
}
|
||||
|
||||
GGML_CALL static void ggml_backend_cuda_split_buffer_clear(ggml_backend_buffer_t buffer, uint8_t value) {
|
||||
@ -1847,8 +1972,9 @@ static void ggml_cuda_op_mul_mat(
|
||||
}
|
||||
|
||||
const int64_t src1_col_stride = ne11;
|
||||
if (quantization_done && ne11 == 1 && ne12 > 1 && ne13 == 1 && ne02 == ne12 && ne02 == dst->ne[2]) {
|
||||
//printf("invoking fast path for %s x %s\n", src0->name, src1->name);
|
||||
// split-buffer src0 has data == NULL; per-device dispatch happens in the slow path below.
|
||||
const bool src0_is_split = src0->buffer && ggml_backend_buft_is_cuda_split(src0->buffer->buft);
|
||||
if (quantization_done && ne11 == 1 && ne12 > 1 && ne13 == 1 && ne02 == ne12 && ne02 == dst->ne[2] && !src0_is_split) {
|
||||
int id = ctx.device;
|
||||
char * src0_dd_i = dev[id].src0_dd;
|
||||
float * src1_ddf_i = dev[id].src1_ddf;
|
||||
@ -4452,6 +4578,16 @@ GGML_CALL static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t
|
||||
|
||||
GGML_CALL static bool ggml_backend_cuda_supports_op(ggml_backend_t backend, const ggml_tensor * op) {
|
||||
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *) backend->context;
|
||||
|
||||
// Non-mul_mat ops can't read a split-buffer parent (no data ptr); let the scheduler fall back to CPU.
|
||||
if (op->op != GGML_OP_MUL_MAT && op->op != GGML_OP_MUL_MAT_ID) {
|
||||
for (int i = 0; i < GGML_MAX_SRC; i++) {
|
||||
if (op->src[i] && op->src[i]->buffer && ggml_backend_buft_is_cuda_split(op->src[i]->buffer->buft)) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
switch (op->op) {
|
||||
case GGML_OP_UNARY:
|
||||
switch (ggml_get_unary_op(op)) {
|
||||
|
||||
@ -2132,7 +2132,7 @@ static void ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1(ggml_backend_cuda_con
|
||||
// return;
|
||||
//} else {
|
||||
if (Q->ne[1] <= 8/ncols2) {
|
||||
if constexpr (DKQ == 512) {
|
||||
if constexpr (DKQ == 512 || DKQ == 576) {
|
||||
ggml_cuda_flash_attn_ext_mma_f16_case<DKQ, DV, 2, ncols2>(ctx, dst);
|
||||
} else {
|
||||
ggml_cuda_flash_attn_ext_mma_f16_case<DKQ, DV, 8/ncols2, ncols2>(ctx, dst);
|
||||
@ -2275,8 +2275,14 @@ void ggml_cuda_flash_attn_ext_mma_new(ggml_backend_cuda_context & ctx, ggml_tens
|
||||
}
|
||||
return;
|
||||
}
|
||||
if (gqa_ratio % 12 == 0 && Q->ne[1] <= 4 && K->ne[1] >= 2048) {
|
||||
ggml_cuda_flash_attn_ext_mma_f16_case<576, 512, 1, 16>(ctx, dst);
|
||||
return;
|
||||
}
|
||||
if (gqa_ratio % 16 == 0) {
|
||||
ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<576, 512, 16>(ctx, dst);
|
||||
} else if (gqa_ratio % 8 == 0) {
|
||||
ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<576, 512, 8>(ctx, dst);
|
||||
} else if (gqa_ratio % 4 == 0) {
|
||||
ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<576, 512, 4>(ctx, dst);
|
||||
} else {
|
||||
|
||||
@ -464,6 +464,7 @@ extern "C" {
|
||||
enum ggml_type type_k; // data type for K cache [EXPERIMENTAL]
|
||||
enum ggml_type type_v; // data type for V cache [EXPERIMENTAL]
|
||||
enum ggml_type type_reduce; // data type for reduce operations
|
||||
enum ggml_type type_graph_attn; // flash-attn precision under -sm graph
|
||||
enum ggml_type type_k_first;
|
||||
enum ggml_type type_k_last;
|
||||
enum ggml_type type_v_first;
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@ -258,6 +258,25 @@ struct llm_build_context {
|
||||
|
||||
ggml_cgraph * build_deepseek2();
|
||||
|
||||
ggml_tensor * build_deepseek2_tp_attention(
|
||||
ggml_cgraph * gf, int il,
|
||||
ggml_tensor * inpL,
|
||||
ggml_tensor * KQ_mask, ggml_tensor * inp_pos,
|
||||
ggml_tensor * rope_cache,
|
||||
float kq_scale, float attn_factor_scaled,
|
||||
bool use_f32_attn_precision,
|
||||
bool is_lite);
|
||||
|
||||
ggml_tensor * build_deepseek2_layer_attention(
|
||||
ggml_cgraph * gf, int il,
|
||||
ggml_tensor * inpL,
|
||||
ggml_tensor * KQ_mask, ggml_tensor * inp_pos,
|
||||
ggml_tensor * rope_cache,
|
||||
float kq_scale, float attn_factor_scaled,
|
||||
bool use_f32_attn_precision,
|
||||
bool is_lite,
|
||||
bool pp_opt);
|
||||
|
||||
ggml_cgraph * build_glm4_moe();
|
||||
|
||||
ggml_cgraph * build_bitnet();
|
||||
|
||||
@ -66,6 +66,9 @@ struct llama_kv_cache {
|
||||
std::vector<llama_split_tensor> split_v_l;
|
||||
std::vector<llama_split_tensor> split_s_l;
|
||||
|
||||
// Per-device replicas of the MLA compressed-latent KV cache (-sm graph for DEEPSEEK2/GLM_DSA/MISTRAL4).
|
||||
std::vector<llama_split_tensor> replicated_k_l;
|
||||
|
||||
std::vector<struct ggml_context *> ctxs;
|
||||
std::vector<ggml_backend_buffer_t> bufs;
|
||||
|
||||
|
||||
@ -50,6 +50,7 @@ struct llama_cparams {
|
||||
int worst_graph_tokens;
|
||||
|
||||
enum ggml_type reduce_type;
|
||||
enum ggml_type graph_attn_precision;
|
||||
enum llama_pooling_type pooling_type;
|
||||
enum llama_mtp_op_type mtp_op_type;
|
||||
|
||||
|
||||
@ -2497,12 +2497,15 @@ bool create_tensors_helper::create_deepseek2_tensors(const LLM_TN & tn) {
|
||||
|
||||
auto & layer = model.layers[i];
|
||||
|
||||
layer.attn_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd});
|
||||
// Under -sm graph/attn, norms need per-device replicas via prepare_split_tensors(-1, ...).
|
||||
auto norm_ctx = (model.split_mode == LLAMA_SPLIT_MODE_GRAPH ||
|
||||
model.split_mode == LLAMA_SPLIT_MODE_ATTN) ? ctx_split : ctx_layer;
|
||||
layer.attn_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd});
|
||||
if (!is_lite) {
|
||||
layer.attn_q_a_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_Q_A_NORM, "weight", i), {q_lora_rank});
|
||||
layer.attn_q_a_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_Q_A_NORM, "weight", i), {q_lora_rank});
|
||||
}
|
||||
|
||||
layer.attn_kv_a_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_KV_A_NORM, "weight", i), {kv_lora_rank});
|
||||
layer.attn_kv_a_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_KV_A_NORM, "weight", i), {kv_lora_rank});
|
||||
|
||||
bool merged = false;
|
||||
if (ml.merge_qkv) {
|
||||
@ -2541,7 +2544,10 @@ bool create_tensors_helper::create_deepseek2_tensors(const LLM_TN & tn) {
|
||||
layer.wkv_a_mqa = create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_KV_A_MQA, "weight", i),{n_embd, kv_lora_rank + (n_embd_head_qk_rope)});
|
||||
}
|
||||
|
||||
layer.wkv_b = create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_KV_B, "weight", i),
|
||||
// Keep wkv_b on a single-device buffer; llm_prepare_mla reads it back to derive wk_b/wv_b.
|
||||
auto wkv_b_ctx = (model.split_mode == LLAMA_SPLIT_MODE_GRAPH ||
|
||||
model.split_mode == LLAMA_SPLIT_MODE_ATTN) ? ctx_layer : ctx_split;
|
||||
layer.wkv_b = create_tensor(wkv_b_ctx, tn(LLM_TENSOR_ATTN_KV_B, "weight", i),
|
||||
{kv_lora_rank, n_head * (n_embd_head_qk_nope + n_embd_head_v)}, llama_model_loader::TENSOR_NOT_REQUIRED);
|
||||
if (!layer.wkv_b) {
|
||||
// Incompatible mainline model. Let's see if we can still load it
|
||||
@ -2554,15 +2560,18 @@ bool create_tensors_helper::create_deepseek2_tensors(const LLM_TN & tn) {
|
||||
}
|
||||
layer.wo = create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_OUT, "weight", i), { n_head * ( n_embd_head_v), n_embd});
|
||||
|
||||
layer.ffn_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd});
|
||||
layer.ffn_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd});
|
||||
|
||||
if (i < (int) hparams.n_layer_dense_lead) {
|
||||
layer.ffn_gate = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd, n_ff});
|
||||
layer.ffn_down = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), { n_ff, n_embd});
|
||||
layer.ffn_up = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP, "weight", i), {n_embd, n_ff});
|
||||
} else {
|
||||
layer.ffn_gate_inp = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert});
|
||||
layer.ffn_exp_probs_b = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, 1);
|
||||
// llm_build_std_moe_ffn needs per-device extras on these under -sm graph/attn.
|
||||
const auto moe_ctx = (model.split_mode == LLAMA_SPLIT_MODE_GRAPH ||
|
||||
model.split_mode == LLAMA_SPLIT_MODE_ATTN) ? ctx_split : ctx_layer;
|
||||
layer.ffn_gate_inp = create_tensor(moe_ctx, tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert});
|
||||
layer.ffn_exp_probs_b = create_tensor(moe_ctx, tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, 1);
|
||||
|
||||
GGML_ASSERT(n_expert > 0);
|
||||
GGML_ASSERT(n_expert_used > 0);
|
||||
@ -2612,10 +2621,16 @@ bool create_tensors_helper::create_glm_dsa_tensors(const LLM_TN & tn) {
|
||||
|
||||
auto & layer = model.layers[i];
|
||||
|
||||
layer.attn_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd}, flags);
|
||||
layer.attn_q_a_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_Q_A_NORM, "weight", i), {q_lora_rank}, flags);
|
||||
// Under -sm graph/attn, norms and MoE-gate tensors need per-device replicas in ctx_split.
|
||||
const auto graph_or_attn = (model.split_mode == LLAMA_SPLIT_MODE_GRAPH ||
|
||||
model.split_mode == LLAMA_SPLIT_MODE_ATTN);
|
||||
auto norm_ctx = graph_or_attn ? ctx_split : ctx_layer;
|
||||
auto moe_ctx = graph_or_attn ? ctx_split : ctx_layer;
|
||||
|
||||
layer.attn_kv_a_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_KV_A_NORM, "weight", i), {kv_lora_rank}, flags);
|
||||
layer.attn_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd}, flags);
|
||||
layer.attn_q_a_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_Q_A_NORM, "weight", i), {q_lora_rank}, flags);
|
||||
|
||||
layer.attn_kv_a_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_ATTN_KV_A_NORM, "weight", i), {kv_lora_rank}, flags);
|
||||
|
||||
bool merged = false;
|
||||
if (ml.merge_qkv) {
|
||||
@ -2658,15 +2673,15 @@ bool create_tensors_helper::create_glm_dsa_tensors(const LLM_TN & tn) {
|
||||
layer.indexer_attn_k = create_tensor(ctx_split, tn(LLM_TENSOR_INDEXER_ATTN_K, "weight", i), {n_embd, hparams.indexer_head_size}, flags);
|
||||
layer.indexer_attn_q_b = create_tensor(ctx_split, tn(LLM_TENSOR_INDEXER_ATTN_Q_B, "weight", i), {q_lora_rank, hparams.indexer_n_head * hparams.indexer_head_size}, flags);
|
||||
|
||||
layer.ffn_norm = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd}, flags);
|
||||
layer.ffn_norm = create_tensor(norm_ctx, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd}, flags);
|
||||
|
||||
if (i < (int) hparams.n_layer_dense_lead) {
|
||||
layer.ffn_gate = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd, n_ff}, flags);
|
||||
layer.ffn_down = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), { n_ff, n_embd}, flags);
|
||||
layer.ffn_up = create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP, "weight", i), {n_embd, n_ff}, flags);
|
||||
} else {
|
||||
layer.ffn_gate_inp = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert}, flags);
|
||||
layer.ffn_exp_probs_b = create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, flags);
|
||||
layer.ffn_gate_inp = create_tensor(moe_ctx, tn(LLM_TENSOR_FFN_GATE_INP, "weight", i), {n_embd, n_expert}, flags);
|
||||
layer.ffn_exp_probs_b = create_tensor(moe_ctx, tn(LLM_TENSOR_FFN_EXP_PROBS_B, "bias", i), {n_expert}, flags);
|
||||
|
||||
GGML_ASSERT(n_expert > 0);
|
||||
GGML_ASSERT(n_expert_used > 0);
|
||||
@ -3719,7 +3734,7 @@ bool create_tensors_helper::merge_qkv(const LLM_TN & tn, int i, int bias, bool i
|
||||
|
||||
static void prepare_split_tensors(int split_dim, ggml_context * ctx, ggml_tensor * tensor, llama_split_tensor & split_tensor,
|
||||
const std::vector<int> & splits, std::vector<size_t> & mem_used) {
|
||||
GGML_ASSERT(split_dim <= 1);
|
||||
GGML_ASSERT(split_dim <= 2);
|
||||
GGML_ASSERT(splits.size() > 1);
|
||||
std::string name{tensor->name};
|
||||
split_tensor.tensor_splits.resize(splits.size());
|
||||
@ -3744,6 +3759,17 @@ static void prepare_split_tensors(int split_dim, ggml_context * ctx, ggml_tensor
|
||||
split_tensor.tensor_splits[i] = nullptr;
|
||||
}
|
||||
}
|
||||
}
|
||||
else if (split_dim == 2) {
|
||||
for (int i = 0; i < int(splits.size()); ++i) {
|
||||
if (splits[i] > 0) {
|
||||
split_tensor.tensor_splits[i] = ggml_new_tensor_3d(ctx, tensor->type, tensor->ne[0], tensor->ne[1], splits[i]);
|
||||
auto name_i = name + '.' + std::to_string(i);
|
||||
ggml_set_name(split_tensor.tensor_splits[i], name_i.c_str());
|
||||
} else {
|
||||
split_tensor.tensor_splits[i] = nullptr;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
for (int i = 0; i < int(splits.size()); ++i) {
|
||||
if (splits[i] > 0) {
|
||||
@ -3769,6 +3795,80 @@ static void prepare_split_tensors(int split_dim, ggml_context * ctx, ggml_tensor
|
||||
}
|
||||
}
|
||||
|
||||
// MLA tensor distribution for -sm graph / -sm attn.
|
||||
// q_a/wkv_a_mqa/norms replicated; q_b row-split by Q head; wo row-split.
|
||||
// wk_b/wv_b are per-head split (split_dim=2) — loaded directly when present
|
||||
// in the GGUF, or produced per-head-split by llm_prepare_mla()'s materialize
|
||||
// lambda when only wkv_b is in the GGUF.
|
||||
static void distribute_mla_tensors_for_split_mode_graph(
|
||||
llama_layer & layer,
|
||||
const llama_hparams & hparams,
|
||||
const std::vector<float> & cur_splits,
|
||||
std::vector<size_t> & mem_used,
|
||||
ggml_context * ctx_split,
|
||||
int il) {
|
||||
const std::vector<int> mirror(cur_splits.size(), 1);
|
||||
|
||||
const int n_head = hparams.n_head(il);
|
||||
const int n_embd_head_k = hparams.n_embd_head_k(il);
|
||||
const int n_embd_head_v = hparams.n_embd_head_v(il);
|
||||
const int qk_rope = hparams.n_rot;
|
||||
const int qk_nope = n_embd_head_k - qk_rope;
|
||||
|
||||
// granularity=4: keeps wo row blocks K-quant-aligned (% 256) and gqa_ratio % 4 == 0 for FA-MMA.
|
||||
auto split_heads = create_split(n_head, 4, cur_splits, mem_used);
|
||||
|
||||
// Derive per-tensor column/row splits from head splits.
|
||||
auto split_wq_b_cols = split_heads;
|
||||
for (auto & s : split_wq_b_cols) s *= n_embd_head_k;
|
||||
auto split_wo_rows = split_heads;
|
||||
for (auto & s : split_wo_rows) s *= n_embd_head_v;
|
||||
|
||||
LLAMA_LOG_DEBUG(" MLA layer %d split_heads:", il);
|
||||
for ([[maybe_unused]] auto s : split_heads) LLAMA_LOG_DEBUG(" %d", s);
|
||||
LLAMA_LOG_DEBUG("\n");
|
||||
|
||||
// Replicated norms (Q-LoRA / KV-LoRA)
|
||||
if (layer.attn_q_a_norm) {
|
||||
prepare_split_tensors(-1, ctx_split, layer.attn_q_a_norm, layer.split_attn_q_a_norm, mirror, mem_used);
|
||||
}
|
||||
if (layer.attn_kv_a_norm) {
|
||||
prepare_split_tensors(-1, ctx_split, layer.attn_kv_a_norm, layer.split_attn_kv_a_norm, mirror, mem_used);
|
||||
}
|
||||
|
||||
// Q-side: either wq_a + wq_b (Q-LoRA path, DSV3/K2) or wq directly (DSV2-Lite)
|
||||
if (layer.wq_a) {
|
||||
prepare_split_tensors(-1, ctx_split, layer.wq_a, layer.split_wq_a, mirror, mem_used);
|
||||
}
|
||||
if (layer.wq_b) {
|
||||
prepare_split_tensors(1, ctx_split, layer.wq_b, layer.split_wq_b, split_wq_b_cols, mem_used);
|
||||
} else if (layer.wq) {
|
||||
// DSV2-Lite / no-Q-LoRA path: column-split wq directly along the head dim.
|
||||
auto split_wq_cols = split_heads;
|
||||
for (auto & s : split_wq_cols) s *= n_embd_head_k;
|
||||
prepare_split_tensors(1, ctx_split, layer.wq, layer.split_wq, split_wq_cols, mem_used);
|
||||
}
|
||||
|
||||
// wkv_a_mqa, wk_b, wv_b replicated: the per-head 3D batched mul_mat can't read a split src0.
|
||||
if (layer.wkv_a_mqa) {
|
||||
prepare_split_tensors(-1, ctx_split, layer.wkv_a_mqa, layer.split_wkv_a_mqa, mirror, mem_used);
|
||||
}
|
||||
if (layer.wk_b) {
|
||||
prepare_split_tensors( 2, ctx_split, layer.wk_b, layer.split_wk_b, split_heads, mem_used);
|
||||
}
|
||||
if (layer.wv_b) {
|
||||
prepare_split_tensors( 2, ctx_split, layer.wv_b, layer.split_wv_b, split_heads, mem_used);
|
||||
}
|
||||
|
||||
// Output projection: row-split, partial outputs all-reduced after.
|
||||
if (layer.wo) {
|
||||
prepare_split_tensors(0, ctx_split, layer.wo, layer.split_wo, split_wo_rows, mem_used);
|
||||
}
|
||||
if (layer.ffn_norm) {
|
||||
prepare_split_tensors(-1, ctx_split, layer.ffn_norm, layer.split_ffn_norm, mirror, mem_used);
|
||||
}
|
||||
}
|
||||
|
||||
static void adjust_split(std::vector<float> & split, const std::vector<size_t> & mem_used, int max_gpu) {
|
||||
if (max_gpu < 1 || max_gpu >= int(split.size()) || split.size() != mem_used.size()) {
|
||||
return;
|
||||
@ -4464,6 +4564,16 @@ bool create_tensors_helper::create_tensors() {
|
||||
}
|
||||
}
|
||||
|
||||
// MLA tensor distribution (DEEPSEEK2/GLM_DSA/MISTRAL4). Detect by arch + absence of wk
|
||||
// since wkv_b can be null when the model was quantized by mainline llama.cpp.
|
||||
if (layer.wo && !layer.wk &&
|
||||
(model.arch == LLM_ARCH_DEEPSEEK2 ||
|
||||
model.arch == LLM_ARCH_GLM_DSA ||
|
||||
model.arch == LLM_ARCH_MISTRAL4)) {
|
||||
distribute_mla_tensors_for_split_mode_graph(
|
||||
layer, hparams, cur_splits, mem_used, ctx_split, il);
|
||||
}
|
||||
|
||||
if (layer.ffn_down && layer.ffn_up && layer.ffn_gate) {
|
||||
bool use_split = split_tensors.find(layer.ffn_down) != split_tensors.end() &&
|
||||
split_tensors.find(layer.ffn_gate) != split_tensors.end() &&
|
||||
|
||||
@ -219,6 +219,15 @@ struct llama_layer {
|
||||
llama_split_tensor split_sinks;
|
||||
llama_split_tensor split_wqkv_gate;
|
||||
|
||||
// MLA per-device shards (-sm graph for DEEPSEEK2/GLM_DSA/MISTRAL4).
|
||||
llama_split_tensor split_wq_a;
|
||||
llama_split_tensor split_wq_b;
|
||||
llama_split_tensor split_wkv_a_mqa;
|
||||
llama_split_tensor split_wk_b;
|
||||
llama_split_tensor split_wv_b;
|
||||
llama_split_tensor split_attn_q_a_norm;
|
||||
llama_split_tensor split_attn_kv_a_norm;
|
||||
|
||||
llama_split_tensor split_ssm_wqkv;
|
||||
llama_split_tensor split_ssm_wqkv_gate;
|
||||
llama_split_tensor split_ssm_in;
|
||||
@ -374,6 +383,10 @@ struct llama_layer {
|
||||
std::unique_ptr<ggml_tensor> computed_wk_b;
|
||||
std::unique_ptr<ggml_tensor> computed_wv_b;
|
||||
std::unique_ptr<ggml_tensor> computed_wkv_b;
|
||||
|
||||
// Per-device replicas of computed wk_b/wv_b (-sm graph). Buffers owned via model.bufs.
|
||||
std::vector<std::unique_ptr<ggml_tensor>> computed_wk_b_replicas;
|
||||
std::vector<std::unique_ptr<ggml_tensor>> computed_wv_b_replicas;
|
||||
};
|
||||
|
||||
struct llama_lora_adapter;
|
||||
|
||||
241
src/llama.cpp
241
src/llama.cpp
@ -622,13 +622,14 @@ bool llama_context::update_cache_copies() {
|
||||
GGML_ASSERT(model.split_mode == LLAMA_SPLIT_MODE_GRAPH || model.split_mode == LLAMA_SPLIT_MODE_ATTN);
|
||||
GGML_ASSERT(model.splits.size() > 1);
|
||||
auto vl = !kv_self.v_l.empty() && kv_self.v_l[il] ? (ggml_split_tensor_t *)kv_self.v_l[il]->extra : nullptr;
|
||||
GGML_ASSERT(kl && (!kv_self.v_l[il] || vl));
|
||||
GGML_ASSERT(kl && (kv_self.v_l.empty() || !kv_self.v_l[il] || vl));
|
||||
if (vl) {
|
||||
GGML_ASSERT(kl->n_device == vl->n_device);
|
||||
}
|
||||
for (int id = 0; id < kl->n_device; ++id) {
|
||||
if (!kl->splits[id]) continue;
|
||||
auto& c = cache_copies[2*model.splits.size()*il + 2*id + 0];
|
||||
size_t idx = 2*model.splits.size()*il + 2*id + 0;
|
||||
auto& c = cache_copies[idx];
|
||||
if (!c.cpy || c.cpy->op != GGML_OP_CPY || c.cpy->view_src != kl->splits[id]) {
|
||||
return false;
|
||||
}
|
||||
@ -800,6 +801,7 @@ static bool llama_kv_cache_init(
|
||||
bool is_mla_attn = model.arch == LLM_ARCH_DEEPSEEK2 || model.arch == LLM_ARCH_GLM_DSA || model.arch == LLM_ARCH_MISTRAL4;
|
||||
|
||||
bool split_cache = false;
|
||||
bool replicate_mla = false;
|
||||
if ((model.split_mode == LLAMA_SPLIT_MODE_GRAPH || model.split_mode == LLAMA_SPLIT_MODE_ATTN) && !is_mla_attn && offload) {
|
||||
cache.split_k_l.reserve(n_layer);
|
||||
cache.split_v_l.reserve(n_layer);
|
||||
@ -808,6 +810,10 @@ static bool llama_kv_cache_init(
|
||||
}
|
||||
split_cache = true;
|
||||
}
|
||||
if ((model.split_mode == LLAMA_SPLIT_MODE_GRAPH || model.split_mode == LLAMA_SPLIT_MODE_ATTN) && is_mla_attn && offload) {
|
||||
cache.replicated_k_l.reserve(n_layer);
|
||||
replicate_mla = true;
|
||||
}
|
||||
|
||||
// count used buffer types
|
||||
std::map<ggml_backend_buffer_type_t, int> buft_layer_count;
|
||||
@ -817,7 +823,7 @@ static bool llama_kv_cache_init(
|
||||
const int64_t n_mtp_first = n_layer - hparams.nextn_predict_layers;
|
||||
for (int64_t i = 0; i < n_layer; ++i) {
|
||||
const bool is_mtp_tail = qwen_mtp && i >= n_mtp_first;
|
||||
if (split_cache && !is_mtp_tail) {
|
||||
if ((split_cache || replicate_mla) && !is_mtp_tail) {
|
||||
buft_layer_count[model.buft_layer[i].buft_matrix]++;
|
||||
if (model.buft_layer[i].buft != model.buft_layer[i].buft_matrix) {
|
||||
buft_layer_count[model.buft_layer[i].buft]++;
|
||||
@ -835,7 +841,7 @@ static bool llama_kv_cache_init(
|
||||
for (auto & it : buft_layer_count) {
|
||||
int n_layers = it.second;
|
||||
size_t ctx_mem_size = 8u*n_layers*ggml_tensor_overhead();
|
||||
if (split_cache) ctx_mem_size += 4*model.splits.size()*n_layers*ggml_tensor_overhead();
|
||||
if (split_cache || replicate_mla) ctx_mem_size += 4*model.splits.size()*n_layers*ggml_tensor_overhead();
|
||||
struct ggml_init_params params = {
|
||||
/*.mem_size =*/ ctx_mem_size,
|
||||
/*.mem_buffer =*/ NULL,
|
||||
@ -910,7 +916,7 @@ static bool llama_kv_cache_init(
|
||||
model.arch == LLM_ARCH_QWEN35MOE) &&
|
||||
hparams.nextn_predict_layers > 0 && i >= (int)n_mtp_first_layer;
|
||||
//struct ggml_context * ctx = split_cache && !qnext_recurrent ? ctx_map.at(model.buft_layer[i].buft_matrix) : offload ? ctx_map.at(model.buft_layer[i].buft) : cache.ctxs.front();
|
||||
struct ggml_context * ctx = (split_cache && !is_mtp_tail_layer) ? ctx_map.at(model.buft_layer[i].buft_matrix) : offload ? ctx_map.at(model.buft_layer[i].buft) : cache.ctxs.front();
|
||||
struct ggml_context * ctx = ((split_cache || replicate_mla) && !is_mtp_tail_layer) ? ctx_map.at(model.buft_layer[i].buft_matrix) : offload ? ctx_map.at(model.buft_layer[i].buft) : cache.ctxs.front();
|
||||
ggml_tensor * k = nullptr;
|
||||
ggml_tensor * v = nullptr;
|
||||
ggml_tensor * s = nullptr;
|
||||
@ -919,19 +925,40 @@ static bool llama_kv_cache_init(
|
||||
const uint32_t n_embd_head_qk_rope = hparams.n_rot;
|
||||
const uint32_t kv_lora_rank = hparams.n_lora_kv;
|
||||
//LLAMA_LOG_INFO("%s: layer %d: n_embd_head_qk_rope = %d, kv_lora_rank = %d\n", __func__, i, n_embd_head_qk_rope, kv_lora_rank);
|
||||
if (cparams.flash_attn) {
|
||||
ggml_tensor * kv = ggml_new_tensor_2d(ctx, cache.type_k, kv_lora_rank + n_embd_head_qk_rope, kv_size);
|
||||
ggml_format_name(kv, "cache_k_l%d", i);
|
||||
cache.k_l.push_back(kv);
|
||||
} else {
|
||||
auto kv_type = cparams.mla_attn == 1 ? cache.type_k : cache.type_v;
|
||||
ggml_tensor * kv = ggml_new_tensor_2d(ctx, kv_type, kv_lora_rank + n_embd_head_qk_rope, kv_size);
|
||||
ggml_format_name(kv, "cache_k_l%d", i);
|
||||
cache.k_l.push_back(kv);
|
||||
if (cparams.mla_attn == 1) {
|
||||
ggml_tensor * kvt = ggml_new_tensor_1d(ctx, cache.type_v, kv_lora_rank*kv_size);
|
||||
ggml_format_name(kvt, "cache_v_l%d", i);
|
||||
cache.v_l.push_back(kvt);
|
||||
ggml_type primary_kv_type = cparams.flash_attn ? cache.type_k
|
||||
: (cparams.mla_attn == 1 ? cache.type_k : cache.type_v);
|
||||
ggml_tensor * kv = ggml_new_tensor_2d(ctx, primary_kv_type, kv_lora_rank + n_embd_head_qk_rope, kv_size);
|
||||
ggml_format_name(kv, "cache_k_l%d", i);
|
||||
cache.k_l.push_back(kv);
|
||||
if (!cparams.flash_attn && cparams.mla_attn == 1) {
|
||||
ggml_tensor * kvt = ggml_new_tensor_1d(ctx, cache.type_v, kv_lora_rank*kv_size);
|
||||
ggml_format_name(kvt, "cache_v_l%d", i);
|
||||
cache.v_l.push_back(kvt);
|
||||
}
|
||||
// Per-device replicas of the compressed latent KV cache (n_device from wo's split).
|
||||
if (replicate_mla && !is_mtp_tail_layer) {
|
||||
auto wo = model.layers[i].wo;
|
||||
if (wo && wo->extra) {
|
||||
auto extra_wo = (const ggml_split_tensor_t *)wo->extra;
|
||||
int n_device = extra_wo->n_device;
|
||||
auto & repl_k_l = cache.replicated_k_l.emplace_back();
|
||||
repl_k_l.tensor_splits.resize(n_device, nullptr);
|
||||
for (int is = 0; is < n_device; ++is) {
|
||||
if (!extra_wo->splits[is]) continue;
|
||||
ggml_tensor * rkv = ggml_new_tensor_2d(ctx, primary_kv_type,
|
||||
kv_lora_rank + n_embd_head_qk_rope, kv_size);
|
||||
auto split_name = std::string("cache_k_l") + std::to_string(i) + '.' + std::to_string(is);
|
||||
ggml_set_name(rkv, split_name.c_str());
|
||||
repl_k_l.tensor_splits[is] = rkv;
|
||||
mem_split[is] += ggml_nbytes(rkv);
|
||||
}
|
||||
repl_k_l.ggml.n_device = n_device;
|
||||
repl_k_l.ggml.split_dim = -1;
|
||||
repl_k_l.ggml.splits = repl_k_l.tensor_splits.data();
|
||||
kv->extra = (void *)&repl_k_l.ggml;
|
||||
} else {
|
||||
GGML_ABORT("MLA layer %d: wo lacks split metadata under -sm graph "
|
||||
"(distribute_mla_tensors_for_split_mode_graph not run?)", i);
|
||||
}
|
||||
}
|
||||
n_mla++;
|
||||
@ -1086,8 +1113,9 @@ static bool llama_kv_cache_init(
|
||||
cache.bufs.push_back(buf);
|
||||
}
|
||||
}
|
||||
if (split_cache) {
|
||||
LLAMA_LOG_INFO("%s: KV cache size per device:\n", __func__);
|
||||
if (split_cache || replicate_mla) {
|
||||
LLAMA_LOG_INFO("%s: KV cache size per device%s:\n", __func__,
|
||||
replicate_mla ? " (MLA replicated)" : "");
|
||||
for (int i = 0; i < int(mem_split.size()); ++i) printf(" Device %d: %g MiB\n", i, mem_split[i]/1024./1024.);
|
||||
}
|
||||
|
||||
@ -2306,24 +2334,107 @@ static void llm_prepare_mla(llama_model & model, int mla) {
|
||||
|
||||
auto name = std::string{"blk."} + std::to_string(il) + ".attn_k_b.weight";
|
||||
|
||||
l.computed_wk_b = std::make_unique<ggml_tensor>(*wk_b);
|
||||
l.computed_wk_b->buffer = ggml_backend_buft_alloc_buffer(ggml_backend_buffer_get_type(l.wkv_b->buffer), ggml_nbytes(wk_b));
|
||||
l.computed_wk_b->data = ggml_backend_buffer_get_base(l.computed_wk_b->buffer);
|
||||
l.computed_wk_b->op = GGML_OP_NONE; // we absolutely need to do this, else the backend will attempt to find the parents
|
||||
// of wk_b, which no longer exist, and will therefore crash.
|
||||
for (int j = 0; j < GGML_MAX_SRC; ++j) l.computed_wk_b->src[j] = nullptr;
|
||||
ggml_set_name(l.computed_wk_b.get(), name.c_str());
|
||||
ggml_backend_buffer_set_usage(l.computed_wk_b->buffer, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
|
||||
ggml_backend_tensor_set(l.computed_wk_b.get(), wk_b->data, 0, ggml_nbytes(wk_b));
|
||||
if (ggml_backend_buffer_is_host(l.computed_wk_b->buffer)) {
|
||||
iqk_modify_tensor(l.computed_wk_b.get());
|
||||
}
|
||||
// Per-head split wk_b/wv_b under -sm graph/attn so each rank's batched matmul
|
||||
// reads only its share of heads (split_dim=2), mirroring prepare_split_tensors.
|
||||
const bool tp_replicate =
|
||||
(model.split_mode == LLAMA_SPLIT_MODE_GRAPH || model.split_mode == LLAMA_SPLIT_MODE_ATTN)
|
||||
&& l.wo && l.wo->extra;
|
||||
|
||||
l.wk_b = l.computed_wk_b.get();
|
||||
model.tensors_by_name.push_back(std::make_pair(name, l.wk_b));
|
||||
auto materialize = [&](ggml_tensor * source,
|
||||
std::unique_ptr<ggml_tensor> & computed,
|
||||
std::vector<std::unique_ptr<ggml_tensor>> & replicas,
|
||||
llama_split_tensor & split,
|
||||
const std::string & tname) -> ggml_tensor * {
|
||||
if (tp_replicate) {
|
||||
auto wo_split = (const ggml_split_tensor_t *)l.wo->extra;
|
||||
const int n_device = wo_split->n_device;
|
||||
const int64_t n_embd_head_v_full = hparams.n_embd_head_v_full;
|
||||
|
||||
printf("Computed %s as %d x %d x %d of type %s and stored in buffer %s\n", name.c_str(), (int)wk_b->ne[0], (int)wk_b->ne[1], (int)wk_b->ne[2],
|
||||
ggml_type_name(wk_b->type), ggml_backend_buffer_name(l.computed_wk_b->buffer));
|
||||
std::vector<int> head_offsets(n_device + 1, 0);
|
||||
for (int idx = 0; idx < n_device; ++idx) {
|
||||
int n_h_id = 0;
|
||||
if (wo_split->splits[idx]) {
|
||||
n_h_id = (int)(wo_split->splits[idx]->ne[0] / n_embd_head_v_full);
|
||||
}
|
||||
head_offsets[idx + 1] = head_offsets[idx] + n_h_id;
|
||||
}
|
||||
|
||||
computed = std::make_unique<ggml_tensor>(*source);
|
||||
computed->buffer = nullptr;
|
||||
computed->data = nullptr;
|
||||
computed->op = GGML_OP_NONE;
|
||||
for (int j = 0; j < GGML_MAX_SRC; ++j) computed->src[j] = nullptr;
|
||||
ggml_set_name(computed.get(), tname.c_str());
|
||||
|
||||
replicas.resize(n_device);
|
||||
split.tensor_splits.assign(n_device, nullptr);
|
||||
|
||||
const size_t head_block_bytes = source->nb[2];
|
||||
|
||||
for (int id = 0; id < n_device; ++id) {
|
||||
if (!wo_split->splits[id] || !wo_split->splits[id]->buffer) continue;
|
||||
const int head_offset = head_offsets[id];
|
||||
const int n_head_local = head_offsets[id + 1] - head_offset;
|
||||
if (n_head_local <= 0) continue;
|
||||
|
||||
const size_t slice_bytes = (size_t)n_head_local * head_block_bytes;
|
||||
auto dev_buft = ggml_backend_buffer_get_type(wo_split->splits[id]->buffer);
|
||||
auto dev_buf = ggml_backend_buft_alloc_buffer(dev_buft, slice_bytes);
|
||||
ggml_backend_buffer_set_usage(dev_buf, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
|
||||
model.bufs.push_back(dev_buf);
|
||||
|
||||
replicas[id] = std::make_unique<ggml_tensor>(*source);
|
||||
auto rep = replicas[id].get();
|
||||
rep->ne[2] = n_head_local;
|
||||
rep->nb[3] = rep->nb[2] * (size_t)rep->ne[2];
|
||||
rep->buffer = dev_buf;
|
||||
rep->data = ggml_backend_buffer_get_base(dev_buf);
|
||||
rep->op = GGML_OP_NONE;
|
||||
for (int j = 0; j < GGML_MAX_SRC; ++j) rep->src[j] = nullptr;
|
||||
rep->view_src = nullptr;
|
||||
rep->view_offs = 0;
|
||||
rep->extra = nullptr;
|
||||
ggml_set_name(rep, (tname + "." + std::to_string(id)).c_str());
|
||||
|
||||
const uint8_t * src_bytes = (const uint8_t *)source->data + (size_t)head_offset * head_block_bytes;
|
||||
ggml_backend_tensor_set(rep, src_bytes, 0, slice_bytes);
|
||||
if (ggml_backend_buffer_is_host(rep->buffer)) {
|
||||
iqk_modify_tensor(rep);
|
||||
}
|
||||
split.tensor_splits[id] = rep;
|
||||
}
|
||||
|
||||
split.ggml.n_device = n_device;
|
||||
split.ggml.split_dim = 2;
|
||||
split.ggml.splits = split.tensor_splits.data();
|
||||
computed->extra = (void *)&split.ggml;
|
||||
|
||||
printf("Computed %s as %d x %d x %d of type %s, split across %d devices on dim=2\n",
|
||||
tname.c_str(), (int)source->ne[0], (int)source->ne[1], (int)source->ne[2],
|
||||
ggml_type_name(source->type), n_device);
|
||||
} else {
|
||||
computed = std::make_unique<ggml_tensor>(*source);
|
||||
computed->buffer = ggml_backend_buft_alloc_buffer(ggml_backend_buffer_get_type(l.wkv_b->buffer), ggml_nbytes(source));
|
||||
computed->data = ggml_backend_buffer_get_base(computed->buffer);
|
||||
// GGML_OP_NONE so the backend doesn't try to find the (now-freed) parents of source.
|
||||
computed->op = GGML_OP_NONE;
|
||||
for (int j = 0; j < GGML_MAX_SRC; ++j) computed->src[j] = nullptr;
|
||||
ggml_set_name(computed.get(), tname.c_str());
|
||||
ggml_backend_buffer_set_usage(computed->buffer, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
|
||||
ggml_backend_tensor_set(computed.get(), source->data, 0, ggml_nbytes(source));
|
||||
if (ggml_backend_buffer_is_host(computed->buffer)) {
|
||||
iqk_modify_tensor(computed.get());
|
||||
}
|
||||
|
||||
printf("Computed %s as %d x %d x %d of type %s and stored in buffer %s\n",
|
||||
tname.c_str(), (int)source->ne[0], (int)source->ne[1], (int)source->ne[2],
|
||||
ggml_type_name(source->type), ggml_backend_buffer_name(computed->buffer));
|
||||
}
|
||||
model.tensors_by_name.push_back(std::make_pair(tname, computed.get()));
|
||||
return computed.get();
|
||||
};
|
||||
|
||||
l.wk_b = materialize(wk_b, l.computed_wk_b, l.computed_wk_b_replicas, l.split_wk_b, name);
|
||||
|
||||
ggml_graph_clear(graph);
|
||||
auto wv_b = ggml_cont(ctx, ggml_view_3d(ctx, &wkv_b, kv_lora_rank, n_embd_head_v, n_head,
|
||||
@ -2338,30 +2449,13 @@ static void llm_prepare_mla(llama_model & model, int mla) {
|
||||
|
||||
name = std::string{"blk."} + std::to_string(il) + ".attn_v_b.weight";
|
||||
|
||||
l.computed_wv_b = std::make_unique<ggml_tensor>(*wv_b);
|
||||
l.computed_wv_b->buffer = ggml_backend_buft_alloc_buffer(ggml_backend_buffer_get_type(l.wkv_b->buffer), ggml_nbytes(wv_b));
|
||||
l.computed_wv_b->data = ggml_backend_buffer_get_base(l.computed_wv_b->buffer);
|
||||
l.computed_wv_b->op = GGML_OP_NONE; // we absolutely need to do this, else the backend will attempt to find the parents
|
||||
// of wk_b, which no longer exist, and will therefore crash.
|
||||
for (int j = 0; j < GGML_MAX_SRC; ++j) l.computed_wv_b->src[j] = nullptr;
|
||||
ggml_set_name(l.computed_wv_b.get(), name.c_str());
|
||||
ggml_backend_buffer_set_usage(l.computed_wv_b->buffer, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
|
||||
ggml_backend_tensor_set(l.computed_wv_b.get(), wv_b->data, 0, ggml_nbytes(wv_b));
|
||||
if (ggml_backend_buffer_is_host(l.computed_wv_b->buffer)) {
|
||||
iqk_modify_tensor(l.computed_wv_b.get());
|
||||
}
|
||||
|
||||
l.wv_b = l.computed_wv_b.get();
|
||||
model.tensors_by_name.push_back(std::make_pair(name, l.wv_b));
|
||||
|
||||
printf("Computed %s as %d x %d x %d of type %s and stored in buffer %s\n", name.c_str(), (int)wv_b->ne[0], (int)wv_b->ne[1], (int)wv_b->ne[2],
|
||||
ggml_type_name(wv_b->type), ggml_backend_buffer_name(l.computed_wv_b->buffer));
|
||||
l.wv_b = materialize(wv_b, l.computed_wv_b, l.computed_wv_b_replicas, l.split_wv_b, name);
|
||||
|
||||
ggml_graph_clear(graph);
|
||||
}
|
||||
ggml_free(ctx);
|
||||
}
|
||||
if (mla == 1) return;
|
||||
if (mla == 1 || model.split_mode == LLAMA_SPLIT_MODE_GRAPH) return;
|
||||
|
||||
n_to_compute = 0;
|
||||
for (auto& l : model.layers) {
|
||||
@ -2608,6 +2702,9 @@ static bool is_model_split_supported(const llama_model & model) {
|
||||
LLM_ARCH_QWEN35,
|
||||
LLM_ARCH_QWEN35MOE,
|
||||
LLM_ARCH_GEMMA4,
|
||||
LLM_ARCH_DEEPSEEK2,
|
||||
LLM_ARCH_GLM_DSA,
|
||||
LLM_ARCH_MISTRAL4,
|
||||
};
|
||||
auto it = k_supported.find(model.arch);
|
||||
return it != k_supported.end();
|
||||
@ -2878,6 +2975,12 @@ static bool llm_load_tensors(
|
||||
const bool unsupported_gemma_split =
|
||||
model.arch == LLM_ARCH_GEMMA4_MTP ||
|
||||
(model.arch == LLM_ARCH_GEMMA4 && hparams.n_embd_per_layer > 0);
|
||||
const bool is_mla_arch =
|
||||
model.arch == LLM_ARCH_DEEPSEEK2 ||
|
||||
model.arch == LLM_ARCH_GLM_DSA ||
|
||||
model.arch == LLM_ARCH_MISTRAL4;
|
||||
const bool incompatible_loader_opts = is_mla_arch &&
|
||||
(ml.ncmoe > 0 || ml.repack_tensors || ml.merge_up_gate_exps || ml.tensor_buft_overrides);
|
||||
|
||||
if (unsupported_gemma_split) {
|
||||
LLAMA_LOG_WARN("\n=========================================================\n");
|
||||
@ -2887,6 +2990,16 @@ static bool llm_load_tensors(
|
||||
LLAMA_LOG_WARN(" => changing split mode to 'layer'\n");
|
||||
LLAMA_LOG_WARN("===========================================================\n\n");
|
||||
split_mode = LLAMA_SPLIT_MODE_LAYER;
|
||||
} else if (incompatible_loader_opts) {
|
||||
const char * bad_flag = ml.ncmoe > 0 ? "-ncmoe | --n-cpu-moe"
|
||||
: ml.repack_tensors ? "-rtr | --run-time-repack"
|
||||
: ml.merge_up_gate_exps ? "-muge | --merge-up-gate-experts"
|
||||
: "-ot | --override-tensor";
|
||||
LLAMA_LOG_WARN("\n=======================================================\n");
|
||||
LLAMA_LOG_WARN("Split mode 'graph' is not compatible with %s\n", bad_flag);
|
||||
LLAMA_LOG_WARN(" => changing split mode to 'layer'\n");
|
||||
LLAMA_LOG_WARN("=======================================================\n\n");
|
||||
split_mode = LLAMA_SPLIT_MODE_LAYER;
|
||||
} else if (!is_model_split_supported(model)) {
|
||||
LLAMA_LOG_WARN("\n=======================================================\n");
|
||||
LLAMA_LOG_WARN("Split mode 'graph' is not supported for this model\n");
|
||||
@ -3301,7 +3414,8 @@ static bool llm_load_tensors(
|
||||
|
||||
ml.done_getting_tensors();
|
||||
|
||||
ml.init_mappings(!defer_expert_mmap, use_mlock ? &model.mlock_mmaps : nullptr, ml.use_thp);
|
||||
// --dry-run skips MAP_POPULATE/WILLNEED — tensor data is never read.
|
||||
ml.init_mappings(!defer_expert_mmap && !dry_run, use_mlock ? &model.mlock_mmaps : nullptr, ml.use_thp);
|
||||
model.mappings.reserve(ml.mappings.size());
|
||||
|
||||
// create the backend buffers
|
||||
@ -3444,8 +3558,13 @@ static bool llm_load_tensors(
|
||||
}
|
||||
}
|
||||
|
||||
if (model.arch == LLM_ARCH_DEEPSEEK2 || model.arch == LLM_ARCH_GLM_DSA || model.arch == LLM_ARCH_MISTRAL4) {
|
||||
llm_prepare_mla(model, mla_attn);
|
||||
if ((model.arch == LLM_ARCH_DEEPSEEK2 || model.arch == LLM_ARCH_GLM_DSA || model.arch == LLM_ARCH_MISTRAL4)) {
|
||||
// -sm graph/attn needs wk_b->extra populated; run prepare even under dry-run.
|
||||
const bool graph_mode = (model.split_mode == LLAMA_SPLIT_MODE_GRAPH ||
|
||||
model.split_mode == LLAMA_SPLIT_MODE_ATTN);
|
||||
if (!dry_run || graph_mode) {
|
||||
llm_prepare_mla(model, mla_attn);
|
||||
}
|
||||
}
|
||||
if (model.arch == LLM_ARCH_GEMMA4) {
|
||||
llm_scale_gate_inp_s(model, use_mmap_buffer);
|
||||
@ -5767,6 +5886,7 @@ struct llama_context_params llama_context_default_params() {
|
||||
/*.type_k =*/ GGML_TYPE_F16,
|
||||
/*.type_v =*/ GGML_TYPE_F16,
|
||||
/*.type_reduce =*/ GGML_TYPE_F16,
|
||||
/*.type_graph_attn =*/ GGML_TYPE_F16,
|
||||
/*.type_first_k =*/ GGML_TYPE_F16,
|
||||
/*.type_last_k =*/ GGML_TYPE_F16,
|
||||
/*.type_first_v =*/ GGML_TYPE_F16,
|
||||
@ -6187,6 +6307,11 @@ struct llama_context * llama_init_from_model(
|
||||
cparams.worst_graph_tokens = params.worst_case_tokens;
|
||||
|
||||
cparams.reduce_type = params.type_reduce;
|
||||
cparams.graph_attn_precision = params.type_graph_attn;
|
||||
if (cparams.graph_attn_precision != GGML_TYPE_F16 && cparams.graph_attn_precision != GGML_TYPE_F32) {
|
||||
throw std::runtime_error(format("--graph-attn-precision must be f16 or f32, got %s",
|
||||
ggml_type_name(cparams.graph_attn_precision)));
|
||||
}
|
||||
cparams.pooling_type = params.pooling_type;
|
||||
|
||||
cparams.n_ctx = params.n_ctx == 0 ? hparams.n_ctx_train : params.n_ctx;
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user