ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

Author	SHA1	Message	Date
mb8565	0ffdf509ab	ggml : fix set_rows CPU crash when the destination is F32 (#2038 ) The CPU `set_rows` kernel for F32 sources fetches `type_traits[dst->type].from_float` and calls it for every scattered row. F32 has no `from_float` entry, it is NULL in `type_traits`, so any `set_rows` into an F32 destination calls a NULL function pointer and segfaults. Other destination types work because they all have a real `from_float`. Repro (CPU backend, standalone ggml graph): dst = new_tensor_2d(F32, 8, 6) // F32 destination src = new_tensor_2d(F32, 8, 4) idx = new_tensor_1d(I64, 4) // {0,2,4,5} out = ggml_set_rows(dst, src, idx) // ggml_backend_graph_compute(cpu, ...) -> SIGSEGV on current main When the destination is F32, copy the row with `memcpy` instead of going through `from_float`. The I32 and I64 index branches both get the same treatment. An assert guards the remaining case, non-F32 dst with a NULL `from_float`, so a future unsupported type fails loudly instead of crashing. I ran a normal model after this and it still decodes fine (DeepSeek-V2-Lite-Q4_K_M, CPU, coherent output), and the non-F32 path is untouched. On the F32 path you pay one `memcpy` per row in place of the indirect call. Co-authored-by: local-llm <local-llm@local-llm-R740.cruvis.org> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 08:46:26 +02:00
mb8565	4553cd0059	cuda : fix MLA flash-attn vec decode for asymmetric K/V head sizes (#2031 ) The flash-attn vec kernels walk the KV cache in blocks of Dk rows for the score loop but accumulate V in blocks of Dv. With Dk == Dv that is the same thing, so normal attention shapes are fine. For absorbed MLA shapes where the K and V head sizes differ (Dk=576/Dv=512 and Dk=192/Dv=128) the two loops step a different number of KV rows, so K and V drift out of sync after the first block and the V pointer reads the wrong cache rows. This only shows up at decode (batch=1) on cards that fall back to the vec kernel for MLA, which on NVIDIA is pre-Volta. There deepseek2/GLM MLA models with -mla 1 -fa 1 or -mla 3 -fa 1 decode coherently for short prompts but collapse into garbage once n_kv passes the first KV block (Dk=576). Prefill/PPL is unaffected because prefill takes the tile kernel, not the vec kernel. Fix: the score loop already covers Dk KV rows, so the V loop and the V pointer step Dk rows too. For asymmetric Dk>Dv the V row is only Dv wide, so threads with tid >= Dv have no V element (their VKQ lane is discarded at the output store anyway) and read 0 instead of stepping past the row. The change keys off the compile-time Dk != Dv, so every symmetric instantiation compiles to byte-identical code and modern GPUs (which never take this vec path for MLA) are unaffected. Validated on a Tesla P100 (sm_60) with DeepSeek-V2-Lite Q4_K_M: decode coherence restored for -mla 1/3 -fa 1, KLD vs the -fa 0 soft_max path drops from 4.79 to 1.4e-4 (same top token 27% -> 100%) at c1024, and TG is unchanged (82.8 t/s). Co-authored-by: mb8565 <244351746+mb8565@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 08:56:17 +02:00
Kawrakow	d5507e33ae	Split mode graph for dense Gemma4 assistant (#2022 ) * WIP: Split mode graph for Gemma4 assistant Something is not right - acceptance drops to nearly zero. * Per model CUDA contexts Still not working!? * This works The issue was that I was not correctly calculating the number of KV heads for the split KV cache. * Compiler warnings * It is better to use llama_context pointers as keys	2026-06-24 18:29:32 +02:00
Nexes the Elder	7cacf28eec	Fix minor GGML discrepencies (#2016 ) * fix: wrong stride in batched quantized add1 (nb0 -> nb3) ggml_compute_forward_add1_q_f32 used i3nb0 (element stride) instead of i3nb3 (batch stride) for the destination row pointer. This causes all add1 operations with quantized types and batch > 1 to write to wrong memory locations. The src0 pointer on the line above correctly uses nb03. * fix: wrong dimension limits in dup_f16 non-contiguous path The destination index wrapping in ggml_compute_forward_dup_f16 used source dimensions (ne00/ne01/ne02/ne03) instead of destination dimensions (ne0/ne1/ne2/ne3). While source and destination shapes are currently identical for dup, using the wrong variables is incorrect by design. * fix: wrong dimension limits in dup_bf16 non-contiguous path Same fix as the dup_f16 path: destination index wrapping used source dimensions (ne00/ne01/ne02/ne03) instead of destination dimensions (ne0/ne1/ne2/ne3). Copy-paste error from the contiguous path. * fix: ACC work size uses src[1] instead of src[0] The dequantization work buffer for quantized ACC was sized using src[1]->ne[0] instead of src[0]->ne[0]. Since src[0] is the tensor being dequantized, its dimensions should determine the buffer size. * fix: missing work size for SOFT_CAP_MAX and ROPE_BACK Both ops dereference params->wdata in their forward functions but had no work size allocation (cur = 0), causing a NULL pointer dereference when any thread attempted to use wdata. * fix: wrong dim in sum_rows_f32 dimension decomposition Line 14404 used ne01ne0 (= ne011) instead of ne01*ne02 for the i3 term in the flat row index formula. When ne02 > 1 (batched 2D inputs), this causes wrong memory access and corrupted results.	2026-06-24 09:09:33 +02:00
Farmadupe	7ccf1d2095	allow user to use THP for host allocations with GGML_CUDA_HOST_MALLOC_THP (#2010 ) * allow user to use THP for host allocations with GGML_CUDA_HOST_MALLOC_THP * Remove useless symbol check	2026-06-23 15:13:41 +02:00
Nexes the Elder	2d3ecd5e19	Fix minor CUDA discrepancies (part 2) (#2015 ) * fix: wrong tensor index in BF16 fused RMS norm add path (norm.cu:1039) The BF16 branch of ggml_cuda_op_fused_rms_rms_add used dst->src[2]->data for the second weight pointer, but should have used dst->src[3]->data. This caused reading float weights from the wrong bf16 input tensor. The F32 and F16 branches both correctly reference src[3], and the assertions at lines 1013-1015 confirm src[3] is the F32 weight tensor. * fix: off-by-one bounds check in 7 dmmv kernels (row > nrows -> row >= nrows) Seven K-quant dequantize_mul_mat_vec kernels used row > nrows for bounds checking instead of row >= nrows. Since rows are 0-indexed (0..nrows-1), the check missed the row == nrows case, allowing a potential out-of-bounds memory write when grid dimensions produce exactly nrows. The templated dequantize_mul_mat_vec<type> kernel at line 667 already used the correct row >= nrows pattern. * fix: typo in function name iqk_mul_mat_vec_q_kerne -> iqk_mul_mat_vec_q_kernel Truncated function name in iqk_mmvq_templates.cuh was missing trailing 'l'. * fix: print actual split_dim value in set_tensor error message (ggml-cuda.cu) fprintf used extra->split_dim == 0 which evaluates to boolean 0 or 1 instead of the actual split dimension value. When this fatal error is hit for an unsupported split_dim, the user could not diagnose which value caused the problem. * fix: wrong src index in gate bias stride for fused up-gate MoE path ggml_cuda_add_id for the gate bias used dst->src[4]->nb[1] as the stride argument instead of dst->src[5]->nb[1]. This was a copy-paste error from the up-bias code (lines 3220-3224) where src[4] is correct. If src[4] and src[5] have different strides, the bias addition produces incorrect results. * fix: wrong row count for gate projection MMQ in fused up-gate MoE path ggml_cuda_op_mul_mat_q for the gate projection (src0_2) used src0_1->ne[1] as row_high instead of src0_2->ne[1]. This copy-paste error causes processing the wrong number of rows if the up and gate projections have different row counts. The gemv path (line ~3563) correctly used src0_2->ne[1].	2026-06-23 14:03:22 +02:00
Nexes the Elder	9eaf86a7c7	Fix minor CUDA discrepencies (#2005 ) * CUDA : typo * CUDA: Add missing GGML_CALL to function definition * CUDA: only log GGML_CUDA_FORCE_MMQ/CUBLAS when enabled * CUDA: Fix softcap bug in flash_attn_tile_ext_f16 The else branch (softcap != 0) incorrectly called launch_fattn_tile_f16_64_128 with use_softcap=false instead of true, causing logit softcap to be silently ignored for the col_per_block=32, parallel_blocks=1 path.	2026-06-23 09:37:48 +02:00
magikRUKKOLA	72440a19fc	on-demand tensor reload (#1989 ) * host-swap tensor loop the host-swap functionality is only triggered when the certain env. variables are declared * target_include_directories tweak * hot-swap tensor support two intrusions: 1.) at the model loading to collect the snapshot 2.) the modification of the `/health` HTTP endpoint to be able to trigger the hot-swap via sending the `llama-server` the HTTP-request. both a braced by the specific env. variables hot-swap tensor support; graph invalidation ggml_backend_cuda_invalidate_graphs export * hot-swap tensor support graph invalidation implementation; extended debug output (commented out) * llama_reload_changed_tensors export * tensor hot-swap on-demand reload cpu-only/hybrid/gpu-only with split mode layer/graph full support implementation * docs * reuse the gguf parsing from llama.cpp gguf_init_from_file, gguf_find_tensor, ggml_get_tensor * remove the manual scheduling for hybrid inference * update docs * tensor shape validation * update docs * update docs accidentally wiped the previous changes; so recovered them * revert the GGML_CUDA_MAX_DEVICES to 16 * update llama_reload_changed_tensor update llama_reload_changed_tensor, revert CMakeLists.txt * update llama_reload_changed_tensor * GGML_MAX_SRC GGML_MAX_SRC compile-time definition support * GGML_MAX_SRC GGML_MAX_SRC compile-time definition support * GGML_MAX_SRC GGML_MAX_SRC compile-time definition support * llama_reload_changed_tensor update llama_reload_changed_tensor definition * refactory move the tensor-reloading implementation to llama-reload.cpp, llama-reload-info.h; some bugfixes and code reduction * revert added back the missing newline * update docs * reload_info constructor * bugfix: cpu-only TODO: improve the working environment by compiling for multiple hardware configurations; possibly make a test pipeline * cpu-only bugfix set the fix again after unsuccessful sync with main * windows os compilation fix #include <string> * fix windows os build error C2039: 'string': is not a member of 'std' * remove dead file * implement perplexity in server * Revert "implement perplexity in server"	2026-06-22 16:36:34 +02:00
a1batross	6c00e87ac8	cmake: drop ggml-blas.h from GGML_PUBLIC_HEADERS (#2007 )	2026-06-21 07:49:09 +02:00
Kawrakow	b21653a56f	Fully remove any BLAS remnants (#2001 ) * Fully remove any BLAS remnants * Also these	2026-06-19 17:26:09 +02:00
Nexes the Elder	b3dfb7858c	AVX VNNI auto-activation for MSVC ; HAVE_VNNI256 path for IQ4_XS_R8 and Qx_0 R4 quants. (#1991 ) * AVX VNNI auto-activation Enables auto-detect of AVX VNNI and its definition in the CMakeLists Detected by ik_llama.cpp. * IQ4_XS R8: Enable AVX-VNNI 256-bit path with MSVC compatibility Migrate mul_mat_iq4_xs_r8_q8_k_avx2() from HAVE_FANCY_SIMD to HAVE_VNNI256. Changes (6 guard sites + 8 intrinsic calls in iqk_gemm_kquants.cpp): - Replaced 3x #ifdef HAVE_FANCY_SIMD with #ifdef HAVE_VNNI256 - Replaced 3x #ifndef HAVE_FANCY_SIMD with #ifndef HAVE_VNNI256 - Replaced 8x raw _mm256_dpbusd_epi32 with ggml_mm256_dpbusd_epi32 (the ggml wrapper resolves to _mm256_dpbusd_avx_epi32 on MSVC via the iqk_config.h macro, which is the correct MSVC AVX-VNNI intrinsic available under /arch:AVX2; raw _mm256_dpbusd_epi32 does not exist in MSVC headers without AVX-512) Impact: - IQ4_XS_R8 matmul now uses VNNI256 on CPUs with AVX-VNNI but no AVX-512 (e.g. Intel Arrow Lake / Core Ultra 265K) - Previously limited to HAVE_FANCY_SIMD (full AVX-512) exclusively - This path is exercised when models are loaded with -rtr / --run-time-repack (in-memory repack) or when using --repack to create a permanent IQ4_XS_R8 file. Standard IQ4_XS does not auto-convert to IQ4_XS_R8 at load time. * Qx_0 R4 legacy quants: Enable VNNI256 path for AVX-VNNI CPUs with MSVC compatibility Three changes in iqk_gemm_legacy_quants.cpp: 1. DotHelper (line 23): Extend VNNI condition to include HAVE_VNNI256 (not just __AVX512VNNI__+VL) and use ggml_mm256_dpbusd_epi32 wrapper for MSVC compatibility. This fixes Q6_0 non-R4 path and all other quant types routed through UnsignedDot/SignedDot. 2. accum_q4_0_quants (line 994), mul_mat_q5_0_r4_q8_2_avx2 (lines 1202, 1223), mul_mat_q6_0_r4_q8_2_avx2 (lines 1375, 1394): Replace #ifdef HAVE_FANCY_SIMD / #ifndef HAVE_FANCY_SIMD with HAVE_VNNI256 (which correctly detects AVX-VNNI without requiring full AVX-512). Also replace raw _mm256_dpbusd_epi32 with ggml_mm256_dpbusd_epi32 wrapper. These paths were dead code on Arrow Lake (HAVE_FANCY_SIMD requires full AVX-512 which Arrow Lake lacks). Now they compile and use the hardware VNNI instruction (vpdpbusd) via __AVXVNNI__. Note: remaining HAVE_FANCY_SIMD guards in this file guard true AVX-512 paths (_mm512_* intrinsics) and are left unchanged. * Simplify def	2026-06-18 18:05:19 +02:00
Farmadupe	21f918c185	faster ggml_cuda_host_malloc (#1988 )	2026-06-18 10:01:34 +02:00
Kawrakow	71af16a6b7	Fix DFlash oerformance with split mode graph (#1980 )	2026-06-17 18:40:02 +02:00
Kawrakow	6be3a488d3	CUDA FA: faster TG when GQA is 16 and head size is 128	2026-06-15 11:46:02 +00:00
Kawrakow	f81673c7db	Merge pull request #1972 from ikawrakow/ik/minimaxm3_smgraph Split mode graph for MiniMax-M3	2026-06-15 13:44:19 +02:00
Kawrakow	1dc4ea938a	Merge pull request #1962 from ikawrakow/ik/fix_1961 Fix #1961	2026-06-15 13:00:27 +02:00
Kawrakow	c24d50dd88	Split mode graph for MiniMax-M3	2026-06-15 08:41:34 +00:00
Nexesenex	0fdac83272	Fix Q8_0 graph reduce type Analogous to the BF16 fix in eea6a82b25, this adds proper Q8_0 type handling in ggml_cuda_op_add: - Add k_add_q8_0_f32 kernel: dequantize Q8_0, add F32, store F32 - Add k_add_q8_0_q8_0_f32 kernel: dequantize two Q8_0, add, store F32 - Add Q8_0+Q8_0/Q8_0+F32/F32+Q8_0 branches in the F32 dst (else) block, preventing Q8_0 data from falling through to the incorrect half cast - Expand Q8_0 dst branch to handle F32+Q8_0->Q8_0 (swapped args), not just Q8_0+F32->Q8_0	2026-06-14 16:13:17 +02:00
Kawrakow	c73bfbe9ce	Fix #1961	2026-06-14 07:42:39 +00:00
Kawrakow	e6f8112f3b	Adjust CUDA FA kernel parameters for head size 512 on Turing (#1942 )	2026-06-10 07:49:21 +02:00
Kawrakow	366e478cb6	Bug fixes (#1940 ) * Bug fixes * More	2026-06-10 07:45:49 +02:00
Kawrakow	a38d29232d	CPU FA: disable mask optimization (#1935 )	2026-06-09 09:13:19 +02:00
Kawrakow	eea6a82b25	Fix bf16 graph reduce type (#1938 )	2026-06-08 16:51:05 +02:00
Kawrakow	1660459db5	CUDA FA: cover Gemma4-4B/2B assistant (#1934 )	2026-06-08 08:18:26 +02:00
Kawrakow	b50b0919d5	CPU FA: Check for empty attention mask (#1923 )	2026-06-08 07:54:57 +02:00
Kawrakow	1b53a58bf9	Enable split mode graph for Gemma4-12B (#1922 )	2026-06-05 10:59:22 +02:00
Chip Bradford	19dcc1f7d1	CUDA : support head_dim 512 with gqa_ratio % 8 (unblocks Gemma 4 12B) (#1921 ) The MMA flash-attention dispatcher only instantiated ncols2 = 8 and 4 for head_dim 512, so any other GQA ratio hit GGML_ABORT. Gemma 4 12B's global attention layers use head_dim 512 with a 16:1 GQA ratio (16 query heads / 1 KV head), which aborts at load. Because MTP speculative decoding requires flash attention, this also blocks the Gemma 4 12B MTP drafter entirely. Instantiating ncols2 = 16 there is not viable: it exceeds the maximum dynamic shared memory on Ada (cudaFuncSetAttribute returns invalid argument). Instead, route gqa_ratio % 8 == 0 (covering 8 and 16) through the existing ncols2 = 8 kernel, which already iterates over Q-head groups (iter_z = ceil(gqa_ratio / ncols2)). gqa_ratio 8 and 4 behavior is unchanged; this mirrors the divisor dispatch already used for the 576x512 case below. Verified on RTX 4070 Ti SUPER (Ada, cc 8.9): Gemma 4 12B + MTP drafter now runs with flash attention; draft acceptance 43-95% by workload, 1.5-2.2x end-to-end speedup. The 26B-A4B drafter (gqa_ratio 8) is unaffected. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 17:36:10 +02:00
Kawrakow	3bf7e836c2	Allow Hadamard transform for head sizes that are not power of 2 (#1883 ) * Disable K Hadamard transform if K-head size is not a power of 2 * Allow Hadamard transform for head sizes that are not power of 2 * Give more details why Hadamard is not possible * Arghh	2026-05-27 18:29:32 +03:00
Gearstickle	4fbd0c441b	fa: preserve early-termination, fix multi-slot correctness via union of masks (#1880 ) * fa: fix FlashQKV early-termination causing S=0 assertion with --parallel N>1 The backward-scan optimization in compute_helper/compute_helper_q checks only one mask position per k_step block on the last query row (q_step-1) to find where valid KV entries end. When q_step > 1 and different query rows have non-overlapping valid KV regions (multi-slot / --parallel N>1), the scan on the last row's mask can miss blocks that contain valid entries for earlier rows. This causes those rows to accumulate S=0, triggering the GGML_ASSERT(S > 0) in normalize_and_store_1row. Fix: remove the early-termination scan at all 4 sites and iterate all nk1/k_step blocks unconditionally. The mask already handles correctness: fully-masked blocks produce smax=-inf and skip V accumulation, so the performance cost is minimal for TG (small nq1) and acceptable for PP. Fixes #809 * fa: refactor multi-slot mask fix into mask_effective_nk1() helper Replace 4× inlined early-termination scans with a shared helper that computes the effective K boundary by scanning ALL query mask rows (union-of-masks). This is the minimal fix for multi-slot parallel inference where different slots have different sequence lengths. The helper returns the k_step-aligned boundary covering the longest active sequence across all rows, preserving single-slot performance (single row = same boundary as before). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Turbomen008 <Turbomen008@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-26 16:16:49 +03:00
Kawrakow	b3d39cff8b	Fix split mode graph for Qwen35-MoE + MTP (#1861 )	2026-05-22 09:23:53 +03:00
Kawrakow	4b73de246b	Fix crash with split mode graph and partial offload (#1857 )	2026-05-21 13:36:01 +03:00
David Young	aefb8bdd99	MLA TP -khad: ggml_dequant_hadamard fused op + wv_b/wk_b_pp Hadamard fold (#1852 ) * ggml: ggml_dequant_hadamard fused op for MLA -khad path Adds a new ggml op that fuses (ggml_cast -> F32) + (ggml_hadamard) into a single kernel. Reads a quantized (or F16/F32) source and produces a per- Hadamard-block F32 chunk with the inverse transform applied, without materializing a full-size F32 intermediate buffer. Motivation: the MLA pp_opt path in build_deepseek2.cpp un-encodes the H-applied cache_nope view at every PP call. Today that runs as a cast (quant -> F32) followed by a separate ggml_hadamard kernel, costing two full-size F32 passes per layer per rank per call. Fusing them halves the bandwidth on the un-encode and removes one kernel launch. CUDA kernels in dequant_hadamard.cu lift the Walsh-Hadamard butterfly from hadamard.cu and dequant helpers from dequantize.cuh: * qr=1 layout (q8_0): consecutive dequant pair, stage 1 fused with load * qr=2 layout (q4_0 / q4_1 / q5_0 / q5_1 / q6_0 / iq4_nl): dequant pair at stride qk/2, explicit stage 1 after sync * F16 has a dedicated kernel * F32 source falls back to the standalone Hadamard op CPU impl in iqk_cpu_ops.cpp composes the existing type_traits.to_float dequant with fast_ht for graph completeness. nh in {64, 128, 256, 512}. * MLA-TP: Hadamard pretransform of wv_b/wk_b_pp for -khad Fold the 64-block orthonormal Hadamard into wv_b and wk_b_pp once at context init so the pp_opt mul_mats consume the K cache in its on-disk encoded basis. The per-PP-call cache_nope un-Hadamard is then skipped (rope half still un-applied — it goes to FA via concat, no wk_b multiply). Math is identity by H^T H = I: mul_mat(H@wv_b, H@cache) = wv_b^T @ cache. For mla=2/3 absorb, composes correctly with the existing post-FA ggml_hadamard(kqv_compressed, 64). All-or-nothing across layers under a castable type-allowlist (excludes 1-3 bpw IQ types whose requant blows up beyond PPL noise). Models with ineligible weights fall back to the runtime un-Hadamard path unchanged. Composes with the fused ggml_dequant_hadamard op (prior commit): with the fold active only the rope half still runs the runtime transform, via the fused kernel. * MLA-TP: fix TG with -khad after wv_b/wk_b_pp fold The absorb branch of build_deepseek2_tp_attention applies ggml_hadamard to kqv_compressed after FA, then multiplies by wv_b. Pre-fold this was needed because wv_b was un-encoded; with the wv_b fold (prior commit) the mul_mat already expects H-encoded kqv_compressed: mul_mat(H @ wv_b, kqv_encoded) = wv_b^T @ H @ H @ kqv_unencoded = wv_b^T @ kqv_unencoded (H @ H = I) Skip the post-FA hadamard when model.khad_pretransformed is set so the two H applications cancel instead of double-applying. Affects the absorb branch: TG (n_tokens=1), short-context PP (n_kv < 1024), and models without wk_b_pp. Long-context PP goes through the pp_opt branch and is unrelated/unchanged. Reported by @ikawrakow on PR 1852. Verified across mla={1,2,3} x khad={on,off} x -ctk={q8_0,q4_0} on GLM-4.7-Flash IQ5_K and the unsloth IQ4_XS variant ik used to reproduce. * ggml_hadamard: accept F16 and quant sources; drop GGML_OP_DEQUANT_HADAMARD Per @ikawrakow review on PR 1852: subsume the per-source-type dispatch into the existing GGML_OP_HADAMARD instead of carrying a separate enum entry, op constructor, and standalone files. ggml_hadamard's API is unchanged from the call-site perspective. The constructor's F32-only assertion is dropped; ggml_cuda_op_hadamard and iqk_hadamard now dispatch internally: - F32 source: existing F32 butterfly (unchanged) - F16 source: dedicated kernel - q8_0 / q4_0 / q4_1 / q5_0 / q5_1 / q6_0 / iq4_nl: fused dequant + butterfly kernel (lifted from the deleted dequant_hadamard.cu) - CPU side composes traits.to_float with fast_ht Net diff: -80 lines. Removes dequant_hadamard.{cu,cuh}, the enum entry, op table rows, ggml_dequant_hadamard constructor, dispatch cases, and the DEQUANT_HADAMARD supports_op block. Verified clean build + TG smoke (mla=3 +khad q8 on GLM-4.7-Flash-IQ4_XS, same coherent output as prior commit on feat/dequant-hadamard).	2026-05-21 07:29:15 +03:00
Kawrakow	40254a51da	Fix MTP when -no-gr is used (#1848 )	2026-05-20 13:38:33 +03:00
David Young	c07a052315	MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) (#1821 ) * MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) Extends -sm graph (split-mode graph) to MLA-style attention across the DEEPSEEK2, GLM_DSA, and MISTRAL4 architectures. Previously these archs fell back to -sm layer regardless of the user's flag. Implementation: - Per-rank attention build in build_deepseek2_tp_attention with view-sliced FlashAttention, split-buffer output projection, and ggml_reduce across devices - wk_b / wv_b absorbed weights replicated per device via materialize() in llm_prepare_mla (these can't live in a split buffer) - KV cache replication path (replicated_k_l) for graph-mode TP - distribute_mla_tensors_for_split_mode_graph routes attention/norm tensors into ctx_split; expert tensors stay per-layer - Implements ggml_backend_cuda_split_buffer_get_tensor for the replicated / row-split / col-split inverse paths - Early-reject guard in src/llama.cpp that auto-downgrades -sm graph to -sm layer (with a warning) when incompatible loader flags are set: -ncmoe, -cmoe, -ot, -rtr, -muge New CLI flag: - -gap \| --graph-attn-precision <f16\|f32> (default f16) See the PR description for the full validation matrix (3 archs x 2/4/8 GPU counts), perf numbers, VRAM accounting, and known limitations. * Some tweaks * materialize lambda: per-head split for graph-mode tp_replicate 7dd19e19 changed wk_b/wv_b distribution from mirror to per-head split (split_dim=2) via prepare_split_tensors. That path only fires when wk_b/wv_b are loaded from GGUF. Models that store only wkv_b in GGUF derive wk_b/wv_b at load via llm_prepare_mla, going through the materialize lambda, which was untouched and still produced mirror replicas (split_dim=-1, full n_head per device). build_deepseek2_tp_attention now does mul_mat(wk_b_local, q_nope_perm) without the prior view_3d slice, so a mirror replica passes an n_head tensor where the kernel expects n_head_local. Result: silent SIGSEGV right after model load. Mirror logic in materialize is replaced with the same per-head split as prepare_split_tensors: head_offsets derived from wo split, each rank gets a tensor with ne[2]=n_head_local, data copied from the appropriate source byte slice. Singular `computed` tensor keeps full metadata for tensors_by_name lookups. Tested: 8x3090, -sm graph -mla 3 -fa on now boots cleanly and sweep-benches without crash. Log confirms new path: "Computed blk.X.attn_k_b.weight ... split across N devices on dim=2". * cleanup: indent fix + remove dead view_3d slicing and debug printf - build_deepseek2.cpp: re-indent the self_attention block in build_deepseek2_layer_attention (lines 253-670). Block was at column 0 inside a function body; now at the expected 4/8-space indent. - build_deepseek2.cpp: drop the commented-out view_3d slicing and debug printfs left over after 7dd19e19's switch to direct mul_mat on per-rank wk_b_local / wv_b_local. Update the stale 'wk_b is replicated (split_dim=-1)' comment to match the new split_dim=2 reality. - ggml-cuda.cu: remove the leftover debug printf in ggml_backend_cuda_split_buffer_get_tensor. No behavior change. Verified with a clean rebuild and DSV2.5 + GLM-4.7-Flash sweep-bench runs. * llm_load_tensors: gate incompatible-flag warning to MLA archs The -ncmoe / -rtr / -muge / -ot warning under -sm graph currently fires for all archs that support graph mode. That's an over-reach: the incompatibility is specific to the MLA TP paths (DEEPSEEK2, GLM_DSA, MISTRAL4) — Gemma4 graph mode existed pre-PR and works with those flags. Gate the warning to MLA archs only. Also refreshes two stale comments left over from the wk_b/wv_b mirror -> per-head-split rewrite: - src/llama.cpp llm_prepare_mla: "Replicate wk_b/wv_b ..." now reads "Per-head split wk_b/wv_b ..." to match what the materialize lambda actually does post-823a39e2. - src/llama-load-tensors.cpp distribute_mla_tensors_for_split_mode_graph: drop the wkv_b row-split mention (wkv_b is no longer created under graph mode after 7dd19e19) and correct the wk_b/wv_b distribution description (per-head split, not per-device replicated). --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2026-05-19 08:36:17 +03:00
Joel Farthing	f43a9f1cf6	Add per-byte CUDA MoE offload threshold (#1813 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-05-19 08:35:05 +03:00
Kawrakow	3e573cfea6	MTP: option to use re-quantized output tensor for better TG performance (#1809 ) * Option to use re-quantized output tensor for MTP * Remove quantize extra output option * Handle interleaved types	2026-05-16 14:40:18 +03:00
Jun Yamog	53cd4d0ff0	fix: use mmq for volta quantized matmuls (#1785 )	2026-05-15 08:11:49 +03:00
Forkoz	8a0f912cb2	Remove outdated asserts from mmproj (#1795 )	2026-05-13 17:40:11 +03:00
Kawrakow	6b221f0c1f	Fix ggml_nbytes (#1798 )	2026-05-13 17:39:25 +03:00
Kawrakow	397150caa2	MTP: faster recurrent state restore (#1791 ) * MTP: store ready per step convolution states * Cleanup	2026-05-13 11:00:24 +03:00
Kawrakow	86b5d076c5	Gemma4 MTP: avoid casting KV cache to f32 (#1786 )	2026-05-13 09:11:27 +03:00
Jun Yamog	8b0cd0357a	fix: keep sm70 cublas f32 outputs in f32 (#1776 )	2026-05-12 07:38:42 +03:00
Kawrakow	cec1a6c1f5	MTP: Reuse graphs (again) (#1780 )	2026-05-12 07:36:12 +03:00
Kawrakow	eb570eb966	MTP: Avoid per step SSM copy (#1778 ) * Avoid copying the per-step SSM state (CUDA) * Avoid copying the per-step SSM state (CPU) * Allocate only what is necessary for per-step SSM state * Cleanup	2026-05-11 18:15:55 +03:00
Kawrakow	3557b446f8	Avoid recurrent state copy (#1777 )	2026-05-11 13:13:59 +03:00
Samuel Oliveira Alves	c2b8bca807	Add MTP Support for Gemma 4 (#1744 ) * gemma-mtp: build the arch to load the MTP model * gemma-mtp: fix mtp kv state * gemma-mtp: refactor some functions and create gguf * gemma-mtp: make usable for embeddings models variant * gemma-mtp: fix qwen mtp load in graph split * gemma-mtp: refactor tensor creation and adjust output tensor handling * Gemma 4 MTP: improve tensor handling, and adjust split mode logic	2026-05-10 07:44:20 +03:00
XZiar	ab0f22b819	Use AVX version VNNI intrinsic when AVX512VNNI not available. (#1748 ) * Use AVX version VNNI intrinsic when AVX512VNNI not available. * remove changes under HAVE_FANCY_SIMD --------- Co-authored-by: XZiar <xziar@xziar.xziar>	2026-05-09 09:02:06 +03:00
Kawrakow	2f0b47c19d	Use async copies to save/restore recurrent state (#1759 )	2026-05-09 08:31:56 +03:00
Alex	98950267c6	ggml : default GGML_WIN_VER to 0x0A00 (Windows 10) (#1755 ) The default of 0x602 (Windows 8) causes a build failure on any toolchain where _WIN32_WINNT propagates into vendored cpp-httplib (notably MinGW with the bundled w64devkit GCC). cpp-httplib's httplib.h has, for some time now, contained: #ifdef _WIN32 #if defined(_WIN32_WINNT) && _WIN32_WINNT < 0x0A00 #error "cpp-httplib doesn't support Windows 8 or lower. Please use Windows 10 or later." #endif #endif so the entire llama-server target fails to compile on Windows + MinGW unless the user passes -DGGML_WIN_VER=0x0A00 manually. Bumping the default to 0x0A00 (Windows 10) keeps Windows 8 reachable for anyone who explicitly requests it (-DGGML_WIN_VER=0x602) while letting the default Windows + MinGW build succeed end-to-end. Windows 8 / 8.1 reached end of support in January 2023, and Windows 10 is a strict superset of the Win8 surface used elsewhere (PrefetchVirtualMemory etc.), so this is strictly additive on the API side. Verified by building with w64devkit 2.8.0 (gcc 16.1.0) on Windows 11 without any -DGGML_WIN_VER override: all 266 ninja targets link cleanly, including bin/llama-server.exe, and llama-cli runs Qwen3-4B-Thinking-2507 IQ4_XS at ~6.2 tok/s with q8_0 KV at 4096 context.	2026-05-08 13:23:04 +03:00
dmaivel	1beaaa002d	speculative: enable MTP per-step checkpoints with CPU recurrent layers (#1724 )	2026-05-03 08:14:56 +03:00

1 2 3 4 5 ...

675 Commits