* fix: wrong stride in batched quantized add1 (nb0 -> nb3)
ggml_compute_forward_add1_q_f32 used i3*nb0 (element stride) instead of
i3*nb3 (batch stride) for the destination row pointer. This causes all
add1 operations with quantized types and batch > 1 to write to wrong
memory locations. The src0 pointer on the line above correctly uses nb03.
* fix: wrong dimension limits in dup_f16 non-contiguous path
The destination index wrapping in ggml_compute_forward_dup_f16 used
source dimensions (ne00/ne01/ne02/ne03) instead of destination dimensions
(ne0/ne1/ne2/ne3). While source and destination shapes are currently
identical for dup, using the wrong variables is incorrect by design.
* fix: wrong dimension limits in dup_bf16 non-contiguous path
Same fix as the dup_f16 path: destination index wrapping used source
dimensions (ne00/ne01/ne02/ne03) instead of destination dimensions
(ne0/ne1/ne2/ne3). Copy-paste error from the contiguous path.
* fix: ACC work size uses src[1] instead of src[0]
The dequantization work buffer for quantized ACC was sized using
src[1]->ne[0] instead of src[0]->ne[0]. Since src[0] is the tensor
being dequantized, its dimensions should determine the buffer size.
* fix: missing work size for SOFT_CAP_MAX and ROPE_BACK
Both ops dereference params->wdata in their forward functions but had
no work size allocation (cur = 0), causing a NULL pointer dereference
when any thread attempted to use wdata.
* fix: wrong dim in sum_rows_f32 dimension decomposition
Line 14404 used ne01*ne0 (= ne01*1) instead of ne01*ne02 for the
i3 term in the flat row index formula. When ne02 > 1 (batched 2D
inputs), this causes wrong memory access and corrupted results.
* Handle Cohere2MoE unopened thinking before tools
* Cohere2MoE: route unopened thinking to reasoning_content; test in active target
Follow-up to #1968. Gate extract_reasoning on reasoning_format only (drop the
"&& enable_thinking" addition) so the unopened-thinking handling does not also
change where an opened thinking block is routed. Under --reasoning off
(enable_thinking=false, reasoning_format defaults to DEEPSEEK) an orphaned
thinking block is now quarantined in reasoning_content with clean content and a
native tool call, instead of leaking the thinking prose into the user-facing
answer.
Move the Cohere2MoE end-to-end parser cases into tests/test-chat-auto-parser.cpp,
which CMake actually builds. tests/test-chat.cpp has been disabled in
tests/CMakeLists.txt since #723, so cohere coverage added there never ran in CI;
revert the local band-aids to that file.
* Cohere2MoE: harden parser from NMC eval findings
---------
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
`{% set %}` of a non-loop variable inside a `{% for %}` body leaks
across iterations when the assignment is conditionally skipped. Each
iteration should start with a clean scope, matching standard Jinja2
semantics.
This fixes the issue with GLM-5.2 chat template when:
* turn 1 is a tool call with reasoning
* turn 2 is a tool call without reasoning
In this case, the reasoning content for turn 1 would be wrongly
duplicated to turn 2, resulting in degraded model performance.
ggml_print_tensor() accumulated `sum` inside the truncated print loop, which
skips the middle of each row when ne[0] > 2*n (n = 3). The printed `sum =`
therefore only covered the first n and last n elements per row, not the whole
tensor. For example a {2688, 5} tensor reported the sum of 30 of its 13440
values. That makes the value useless for numerically comparing two runs, and it
disagrees with mainline llama.cpp, whose eval-callback sums every element in a
separate pass.
This factors the per-element read into a small helper and computes the sum in
its own loop over all elements (double accumulator). The truncated print is a
separate, unchanged pass, so only the printed `sum =` value changes. The change
is confined to examples/eval-callback.
Co-authored-by: mb8565 <244351746+mb8565@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
* fix: wrong tensor index in BF16 fused RMS norm add path (norm.cu:1039)
The BF16 branch of ggml_cuda_op_fused_rms_rms_add used dst->src[2]->data
for the second weight pointer, but should have used dst->src[3]->data.
This caused reading float weights from the wrong bf16 input tensor.
The F32 and F16 branches both correctly reference src[3], and the
assertions at lines 1013-1015 confirm src[3] is the F32 weight tensor.
* fix: off-by-one bounds check in 7 dmmv kernels (row > nrows -> row >= nrows)
Seven K-quant dequantize_mul_mat_vec kernels used row > nrows for bounds
checking instead of row >= nrows. Since rows are 0-indexed (0..nrows-1),
the check missed the row == nrows case, allowing a potential out-of-bounds
memory write when grid dimensions produce exactly nrows.
The templated dequantize_mul_mat_vec<type> kernel at line 667 already used
the correct row >= nrows pattern.
* fix: typo in function name iqk_mul_mat_vec_q_kerne -> iqk_mul_mat_vec_q_kernel
Truncated function name in iqk_mmvq_templates.cuh was missing trailing 'l'.
* fix: print actual split_dim value in set_tensor error message (ggml-cuda.cu)
fprintf used extra->split_dim == 0 which evaluates to boolean 0 or 1
instead of the actual split dimension value. When this fatal error is
hit for an unsupported split_dim, the user could not diagnose which
value caused the problem.
* fix: wrong src index in gate bias stride for fused up-gate MoE path
ggml_cuda_add_id for the gate bias used dst->src[4]->nb[1] as the stride
argument instead of dst->src[5]->nb[1]. This was a copy-paste error from
the up-bias code (lines 3220-3224) where src[4] is correct. If src[4]
and src[5] have different strides, the bias addition produces incorrect
results.
* fix: wrong row count for gate projection MMQ in fused up-gate MoE path
ggml_cuda_op_mul_mat_q for the gate projection (src0_2) used
src0_1->ne[1] as row_high instead of src0_2->ne[1]. This copy-paste
error causes processing the wrong number of rows if the up and gate
projections have different row counts. The gemv path (line ~3563)
correctly used src0_2->ne[1].
* CUDA : typo
* CUDA: Add missing GGML_CALL to function definition
* CUDA: only log GGML_CUDA_FORCE_MMQ/CUBLAS when enabled
* CUDA: Fix softcap bug in flash_attn_tile_ext_f16
The else branch (softcap != 0) incorrectly called launch_fattn_tile_f16_64_128
with use_softcap=false instead of true, causing logit softcap to be silently
ignored for the col_per_block=32, parallel_blocks=1 path.
* host-swap tensor loop
the host-swap functionality is only triggered when the certain env. variables are declared
* target_include_directories tweak
* hot-swap tensor support
two intrusions:
1.) at the model loading to collect the snapshot
2.) the modification of the `/health` HTTP endpoint to be able to trigger the hot-swap via sending the `llama-server` the HTTP-request.
*both a braced by the specific env. variables
* hot-swap tensor support; graph invalidation
ggml_backend_cuda_invalidate_graphs export
* hot-swap tensor support
graph invalidation implementation; extended debug output (commented out)
* llama_reload_changed_tensors export
* tensor hot-swap on-demand reload
cpu-only/hybrid/gpu-only with split mode layer/graph full support implementation
* docs
* reuse the gguf parsing from llama.cpp
gguf_init_from_file, gguf_find_tensor, ggml_get_tensor
* remove the manual scheduling for hybrid inference
* update docs
* tensor shape validation
* update docs
* update docs
accidentally wiped the previous changes; so recovered them
* revert the GGML_CUDA_MAX_DEVICES to 16
* update llama_reload_changed_tensor
update llama_reload_changed_tensor, revert CMakeLists.txt
* update llama_reload_changed_tensor
* GGML_MAX_SRC
GGML_MAX_SRC compile-time definition support
* GGML_MAX_SRC
GGML_MAX_SRC compile-time definition support
* GGML_MAX_SRC
GGML_MAX_SRC compile-time definition support
* llama_reload_changed_tensor
update llama_reload_changed_tensor definition
* refactory
move the tensor-reloading implementation to llama-reload.cpp, llama-reload-info.h; some bugfixes and code reduction
* revert
added back the missing newline
* update docs
* reload_info constructor
* bugfix: cpu-only
TODO: improve the working environment by compiling for multiple hardware configurations; possibly make a test pipeline
* cpu-only bugfix
set the fix again after unsuccessful sync with main
* windows os compilation fix
#include <string>
* fix windows os build
error C2039: 'string': is not a member of 'std'
* remove dead file
* implement perplexity in server
* Revert "implement perplexity in server"
* AVX VNNI auto-activation
Enables auto-detect of AVX VNNI and its definition in the CMakeLists
Detected by ik_llama.cpp.
* IQ4_XS R8: Enable AVX-VNNI 256-bit path with MSVC compatibility
Migrate mul_mat_iq4_xs_r8_q8_k_avx2() from HAVE_FANCY_SIMD to HAVE_VNNI256.
Changes (6 guard sites + 8 intrinsic calls in iqk_gemm_kquants.cpp):
- Replaced 3x #ifdef HAVE_FANCY_SIMD with #ifdef HAVE_VNNI256
- Replaced 3x #ifndef HAVE_FANCY_SIMD with #ifndef HAVE_VNNI256
- Replaced 8x raw _mm256_dpbusd_epi32 with ggml_mm256_dpbusd_epi32
(the ggml wrapper resolves to _mm256_dpbusd_avx_epi32 on MSVC via
the iqk_config.h macro, which is the correct MSVC AVX-VNNI intrinsic
available under /arch:AVX2; raw _mm256_dpbusd_epi32 does not exist
in MSVC headers without AVX-512)
Impact:
- IQ4_XS_R8 matmul now uses VNNI256 on CPUs with AVX-VNNI but no
AVX-512 (e.g. Intel Arrow Lake / Core Ultra 265K)
- Previously limited to HAVE_FANCY_SIMD (full AVX-512) exclusively
- This path is exercised when models are loaded with -rtr / --run-time-repack
(in-memory repack) or when using --repack to create a permanent IQ4_XS_R8 file.
Standard IQ4_XS does not auto-convert to IQ4_XS_R8 at load time.
* Qx_0 R4 legacy quants: Enable VNNI256 path for AVX-VNNI CPUs with MSVC compatibility
Three changes in iqk_gemm_legacy_quants.cpp:
1. DotHelper (line 23): Extend VNNI condition to include HAVE_VNNI256
(not just __AVX512VNNI__+VL) and use ggml_mm256_dpbusd_epi32
wrapper for MSVC compatibility. This fixes Q6_0 non-R4 path
and all other quant types routed through UnsignedDot/SignedDot.
2. accum_q4_0_quants (line 994), mul_mat_q5_0_r4_q8_2_avx2
(lines 1202, 1223), mul_mat_q6_0_r4_q8_2_avx2 (lines 1375, 1394):
Replace #ifdef HAVE_FANCY_SIMD / #ifndef HAVE_FANCY_SIMD with
HAVE_VNNI256 (which correctly detects AVX-VNNI without requiring
full AVX-512). Also replace raw _mm256_dpbusd_epi32 with
ggml_mm256_dpbusd_epi32 wrapper.
These paths were dead code on Arrow Lake (HAVE_FANCY_SIMD requires
full AVX-512 which Arrow Lake lacks). Now they compile and use
the hardware VNNI instruction (vpdpbusd) via __AVXVNNI__.
Note: remaining HAVE_FANCY_SIMD guards in this file guard true
AVX-512 paths (_mm512_* intrinsics) and are left unchanged.
* Simplify def
* Use hidden state from prev token from qwen mtp
* Fix Qwen35 MTP warmup
* Cleanup + remove unnecessary crippling performance by not using accept to sample draft token
* Provide API to gtet the model arch string
---------
Co-authored-by: SamuelOliveirads <samueloliveira32df@gmail.com>
When llama_set_causal_attn(false) is called on a causal model (e.g.
Gemma-4 during vision image decode), llama_set_inputs took the non-causal
else-branch (designed for pure embedding models).
That path wrote the F16 mask with stride n_tokens instead of n_kv, and iterated batch
indices rather than KV cache cells.
The result was that every image query row beyond the first was
written at the wrong offset, leaving stale -inf values from
previous decodes visible to the GPU kernel. Any conversation
that had built up prior KV mask data would produce all-inf attention scores
for most image tokens, collapsing softmax to NaN and aborting at sampling.
Resolves#1984
The graph builder for Minimax M3 (build_minimaxm3.cpp) was not passing
model.layers[il].ffn_up_gate_exps to llm_build_std_moe_ffn, unlike
Minimax M2 and all other MoE model graph builders.
When -muge (merge_up_gate_experts) is enabled, the merge creates a single
ffn_up_gate_exps tensor with ffn_up_exps and ffn_gate_exps as views.
Only the parent merged tensor gets the split 'extra' pointer set.
Without passing it as up_gate_exps parameter, the function sees null
split pointers for up/gate (the views) while split_down_exps is valid,
causing the assertion at llama-build-context.cpp:1453 to fail.