* AVX VNNI auto-activation
Enables auto-detect of AVX VNNI and its definition in the CMakeLists
Detected by ik_llama.cpp.
* IQ4_XS R8: Enable AVX-VNNI 256-bit path with MSVC compatibility
Migrate mul_mat_iq4_xs_r8_q8_k_avx2() from HAVE_FANCY_SIMD to HAVE_VNNI256.
Changes (6 guard sites + 8 intrinsic calls in iqk_gemm_kquants.cpp):
- Replaced 3x #ifdef HAVE_FANCY_SIMD with #ifdef HAVE_VNNI256
- Replaced 3x #ifndef HAVE_FANCY_SIMD with #ifndef HAVE_VNNI256
- Replaced 8x raw _mm256_dpbusd_epi32 with ggml_mm256_dpbusd_epi32
(the ggml wrapper resolves to _mm256_dpbusd_avx_epi32 on MSVC via
the iqk_config.h macro, which is the correct MSVC AVX-VNNI intrinsic
available under /arch:AVX2; raw _mm256_dpbusd_epi32 does not exist
in MSVC headers without AVX-512)
Impact:
- IQ4_XS_R8 matmul now uses VNNI256 on CPUs with AVX-VNNI but no
AVX-512 (e.g. Intel Arrow Lake / Core Ultra 265K)
- Previously limited to HAVE_FANCY_SIMD (full AVX-512) exclusively
- This path is exercised when models are loaded with -rtr / --run-time-repack
(in-memory repack) or when using --repack to create a permanent IQ4_XS_R8 file.
Standard IQ4_XS does not auto-convert to IQ4_XS_R8 at load time.
* Qx_0 R4 legacy quants: Enable VNNI256 path for AVX-VNNI CPUs with MSVC compatibility
Three changes in iqk_gemm_legacy_quants.cpp:
1. DotHelper (line 23): Extend VNNI condition to include HAVE_VNNI256
(not just __AVX512VNNI__+VL) and use ggml_mm256_dpbusd_epi32
wrapper for MSVC compatibility. This fixes Q6_0 non-R4 path
and all other quant types routed through UnsignedDot/SignedDot.
2. accum_q4_0_quants (line 994), mul_mat_q5_0_r4_q8_2_avx2
(lines 1202, 1223), mul_mat_q6_0_r4_q8_2_avx2 (lines 1375, 1394):
Replace #ifdef HAVE_FANCY_SIMD / #ifndef HAVE_FANCY_SIMD with
HAVE_VNNI256 (which correctly detects AVX-VNNI without requiring
full AVX-512). Also replace raw _mm256_dpbusd_epi32 with
ggml_mm256_dpbusd_epi32 wrapper.
These paths were dead code on Arrow Lake (HAVE_FANCY_SIMD requires
full AVX-512 which Arrow Lake lacks). Now they compile and use
the hardware VNNI instruction (vpdpbusd) via __AVXVNNI__.
Note: remaining HAVE_FANCY_SIMD guards in this file guard true
AVX-512 paths (_mm512_* intrinsics) and are left unchanged.
* Simplify def
* Use hidden state from prev token from qwen mtp
* Fix Qwen35 MTP warmup
* Cleanup + remove unnecessary crippling performance by not using accept to sample draft token
* Provide API to gtet the model arch string
---------
Co-authored-by: SamuelOliveirads <samueloliveira32df@gmail.com>
When llama_set_causal_attn(false) is called on a causal model (e.g.
Gemma-4 during vision image decode), llama_set_inputs took the non-causal
else-branch (designed for pure embedding models).
That path wrote the F16 mask with stride n_tokens instead of n_kv, and iterated batch
indices rather than KV cache cells.
The result was that every image query row beyond the first was
written at the wrong offset, leaving stale -inf values from
previous decodes visible to the GPU kernel. Any conversation
that had built up prior KV mask data would produce all-inf attention scores
for most image tokens, collapsing softmax to NaN and aborting at sampling.
Resolves#1984
The graph builder for Minimax M3 (build_minimaxm3.cpp) was not passing
model.layers[il].ffn_up_gate_exps to llm_build_std_moe_ffn, unlike
Minimax M2 and all other MoE model graph builders.
When -muge (merge_up_gate_experts) is enabled, the merge creates a single
ffn_up_gate_exps tensor with ffn_up_exps and ffn_gate_exps as views.
Only the parent merged tensor gets the split 'extra' pointer set.
Without passing it as up_gate_exps parameter, the function sees null
split pointers for up/gate (the views) while split_down_exps is valid,
causing the assertion at llama-build-context.cpp:1453 to fail.
Analogous to the BF16 fix in eea6a82b25, this adds proper Q8_0
type handling in ggml_cuda_op_add:
- Add k_add_q8_0_f32 kernel: dequantize Q8_0, add F32, store F32
- Add k_add_q8_0_q8_0_f32 kernel: dequantize two Q8_0, add, store F32
- Add Q8_0+Q8_0/Q8_0+F32/F32+Q8_0 branches in the F32 dst (else) block,
preventing Q8_0 data from falling through to the incorrect half cast
- Expand Q8_0 dst branch to handle F32+Q8_0->Q8_0 (swapped args), not
just Q8_0+F32->Q8_0
* Refactor speculative decoding: move logic outside of server
* remove duplicated tokens in mtp kv cache
* narrow to only discard draft cells in MTP
* revert mtp_speculative_gen_draft
With --parallel 1, a client disconnect/timeout on a *queued* request aborts the
*active* decode of a different client (llama_decode: failed to decode, ret = -3 /
"Decode process is cancelled by user"), releasing the slot with the request
unfinished. To the active client the stream silently stalls and never returns,
while the server reports healthy — easy to misdiagnose as a network/proxy wedge.
Root cause: llama_decode_stop() signals a process-global stop flag that the
active decode loop polls. examples/server/server.cpp calls it *ungated* from the
request reader's connection-closed paths, so any reader closing (including a
queued, not-yet-running task's) trips the global flag against whatever decode is
currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" +
hybrid/recurrent ret=-3), which did not gate these call sites against non-active
readers, so the queued-cancel-kills-active cascade still fires on current main.
Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the
three llama_decode_stop() sites on it, so the global stop is signalled only when
one of THIS reader's tasks is on a slot (the active decode). A queued task's
disconnect then only drops that queued task. Verified in production under heavy
concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero
active-decode kills). Stdlib-only reproducer in the PR description.
Caveat: any_task_on_slot() reads the slots vector from the reader thread — the
same race class as the existing process-global flag; can be tightened to a
per-context/per-task cancellation if preferred.