4666 Commits

Author SHA1 Message Date
Kawrakow
f96eaddba8
Revert DFlash SWA optimization (#2039) 2026-06-26 11:00:09 +02:00
Kawrakow
1255b1e479
Minor DFlash tweaks (#2034) 2026-06-26 10:31:03 +02:00
Farmadupe
af62a37acd
Prune examples/llava. Dead code. (#2025)
examples/llava has been replaced by mtmd since late 2025, and has 
been out-of-build in ik_llama.cpp since examples/CMakeLists.txt removed it
in #798.

Repointed descriptions from llava to mtmd where they remained.
2026-06-26 08:48:48 +02:00
mb8565
c713bd599b
llama : fix CPU-only load crash on a CUDA build (device_mem out-of-bounds) (#2037)
Loading a model with no GPU layers on a binary built with CUDA crashes in
`llm_load_tensors`. The GPU-fit block is guarded by `if (device_count > 0)`, but
`device_count` comes from `model.splits`, which always has at least one entry
(`{1.0f}`). The memory array it indexes, `device_mem`, is sized by `model.devices`,
which is empty when no GPU is present or when the model is loaded with `-ngl 0`. So the
block runs with `device_count >= 1` and reads `device_mem[0]` out of bounds.

Repro: build with `-DGGML_CUDA=ON` on a host that has no usable GPU, or hide the GPUs
with `CUDA_VISIBLE_DEVICES=""`, then load any model. The load segfaults inside the fit
loop (confirmed with DeepSeek-V2-Lite-Q4_K_M). With a real GPU present `model.devices`
is non-empty even at `-ngl 0`, so the crash needs the empty-device case.

The fix is to also require `!model.devices.empty()` before entering the GPU-fit block.
CPU-only placement is already handled earlier, all layers go to the CPU when there are
no GPU layers, so skipping this block on a CPU-only load is correct.

GPU loads still take the block since `model.devices` is non-empty. CPU-only loads on a
CUDA build now finish and decode normally instead of crashing.

Co-authored-by: local-llm <local-llm@local-llm-R740.cruvis.org>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:47:19 +02:00
mb8565
0ffdf509ab
ggml : fix set_rows CPU crash when the destination is F32 (#2038)
The CPU `set_rows` kernel for F32 sources fetches `type_traits[dst->type].from_float`
and calls it for every scattered row. F32 has no `from_float` entry, it is NULL in
`type_traits`, so any `set_rows` into an F32 destination calls a NULL function pointer
and segfaults. Other destination types work because they all have a real `from_float`.

Repro (CPU backend, standalone ggml graph):

    dst = new_tensor_2d(F32, 8, 6)   // F32 destination
    src = new_tensor_2d(F32, 8, 4)
    idx = new_tensor_1d(I64, 4)      // {0,2,4,5}
    out = ggml_set_rows(dst, src, idx)
    // ggml_backend_graph_compute(cpu, ...) -> SIGSEGV on current main

When the destination is F32, copy the row with `memcpy` instead of going through
`from_float`. The I32 and I64 index branches both get the same treatment. An assert
guards the remaining case, non-F32 dst with a NULL `from_float`, so a future
unsupported type fails loudly instead of crashing.

I ran a normal model after this and it still decodes fine (DeepSeek-V2-Lite-Q4_K_M,
CPU, coherent output), and the non-F32 path is untouched. On the F32 path you pay one
`memcpy` per row in place of the indirect call.

Co-authored-by: local-llm <local-llm@local-llm-R740.cruvis.org>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:46:26 +02:00
Kawrakow
b84902d2ad
Split mode graph for dense Qwen35 MTP (#2027)
* WIP: Split mode graph for Gemma4 assistant

Something is not right - acceptance drops to nearly zero.

* Per model CUDA contexts

Still not working!?

* This works

The issue was that I was not correctly calculating the number
of KV heads for the split KV cache.

* Compiler warnings

* It is better to use llama_context pointers as keys

* Split mode graph for dense Qwen35 MTP
2026-06-25 11:12:22 +02:00
Farmadupe
d3e86a5431
Free raw multimedia data from server_tokens after encoding, as it will never be read again (#2029)
Data server_tokens.map_idx_to_media.tokens_image.batch_f32 is read exactly once, 
by mtmd_encode, however it was retained as long as the input image was present
in the sequence. Add a manual free function to clear out this data after encoding.

Solves:
* Memory wasted in struct server_tokens
* The same wasted memory in the ram cache 
* Long copy durations cloning this data to/from ram cache
* Accounting failures in ram cache (`batch_f32` can be larger than a sequence's entire KV)
* The above accounting failures leading to terminal memory leaks in pathological cases
* Remove JSON serialization for `batch_32` which was unused, and had no foreseeable usecase
2026-06-25 10:18:32 +02:00
Joel Farthing
bdf5c081dc
DFlash: enable sliding-window attention for draft models (#2021)
* DFlash: bound intra-block draft tokens to the SWA window

The SWA mask builder applied the sliding-window distance check only to the
cross-context section; the intra-block draft-token loop masked causal-only,
so a draft token could attend to earlier block tokens beyond n_swa. Apply
the same window bound ((j - block_k) < swa_window) in both the F16 and F32
paths so it matches the cross-context section.

Behavior-neutral for dense models: the SWA mask tensor is only allocated
when the model has SWA layers (build_dflash.cpp needs_swa_mask gate), so
for dense targets the changed block is unreachable.

* DFlash: enable sliding-window attention for draft models

DFlash drafts can be trained with sliding-window attention for long context,
but the runtime ignored it: the draft loader never read the window keys and the
converter never emitted them, so SWA-trained drafts always ran full-attention.
Enable it end to end and fix the dormant SWA graph path it exposes:

- convert_hf_to_gguf.py (DFlashDraftModel): emit attention.sliding_window + an
  all-layers sliding_window_pattern when the source config sets use_sliding_window.
- llama-hparams.cpp (LLM_ARCH_DFLASH_DRAFT): read sliding_window + pattern into
  n_swa / swa_layers.
- build_dflash.cpp + llama-dflash.cpp: the SWA mask path had never run; an all-SWA
  draft turned the full kq_mask into a dead graph node the scheduler never backs
  with a buffer, then the input-set wrote it unconditionally (GGML_ASSERT buf!=NULL).
  Create + set each mask only when a layer uses it; derive mask dims from whichever
  mask is live. Dense/mixed drafts are byte-identical.

Validated on gemma-4-26B-A4B at long context (cross_ctx 8176 > window 2048): no
crash, no short-context regression, SWA-on recovers long-context draft acceptance.

* DFlash: derive draft SWA pattern from layer_types

The converter emitted an all-layers SWA pattern ([True]*n_layers). The z-lab
DFlash drafts are sliding-window on every layer except a final full-attention
(global) layer, so this ran that global layer as sliding-window and clipped its
long-context view. Read layer_types and emit the matching per-layer pattern
(sliding_attention -> True), falling back to all-SWA only when layer_types is
absent.

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-25 09:06:54 +02:00
mb8565
4553cd0059
cuda : fix MLA flash-attn vec decode for asymmetric K/V head sizes (#2031)
The flash-attn vec kernels walk the KV cache in blocks of Dk rows for the
score loop but accumulate V in blocks of Dv. With Dk == Dv that is the same
thing, so normal attention shapes are fine. For absorbed MLA shapes where the
K and V head sizes differ (Dk=576/Dv=512 and Dk=192/Dv=128) the two loops step
a different number of KV rows, so K and V drift out of sync after the first
block and the V pointer reads the wrong cache rows.

This only shows up at decode (batch=1) on cards that fall back to the vec
kernel for MLA, which on NVIDIA is pre-Volta. There deepseek2/GLM MLA models
with -mla 1 -fa 1 or -mla 3 -fa 1 decode coherently for short prompts but
collapse into garbage once n_kv passes the first KV block (Dk=576). Prefill/PPL
is unaffected because prefill takes the tile kernel, not the vec kernel.

Fix: the score loop already covers Dk KV rows, so the V loop and the V pointer
step Dk rows too. For asymmetric Dk>Dv the V row is only Dv wide, so threads
with tid >= Dv have no V element (their VKQ lane is discarded at the output
store anyway) and read 0 instead of stepping past the row.

The change keys off the compile-time Dk != Dv, so every symmetric instantiation
compiles to byte-identical code and modern GPUs (which never take this vec path
for MLA) are unaffected.

Validated on a Tesla P100 (sm_60) with DeepSeek-V2-Lite Q4_K_M: decode coherence
restored for -mla 1/3 -fa 1, KLD vs the -fa 0 soft_max path drops from 4.79 to
1.4e-4 (same top token 27% -> 100%) at c1024, and TG is unchanged (82.8 t/s).

Co-authored-by: mb8565 <244351746+mb8565@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 08:56:17 +02:00
Kawrakow
d5507e33ae
Split mode graph for dense Gemma4 assistant (#2022)
* WIP: Split mode graph for Gemma4 assistant

Something is not right - acceptance drops to nearly zero.

* Per model CUDA contexts

Still not working!?

* This works

The issue was that I was not correctly calculating the number
of KV heads for the split KV cache.

* Compiler warnings

* It is better to use llama_context pointers as keys
2026-06-24 18:29:32 +02:00
Kawrakow
bf23a7599c
Avoid Gemma4 assistant strange tensor name warnings (#2023) 2026-06-24 11:23:22 +02:00
Nexes the Elder
7cacf28eec
Fix minor GGML discrepencies (#2016)
* fix: wrong stride in batched quantized add1 (nb0 -> nb3)

ggml_compute_forward_add1_q_f32 used i3*nb0 (element stride) instead of
i3*nb3 (batch stride) for the destination row pointer. This causes all
add1 operations with quantized types and batch > 1 to write to wrong
memory locations. The src0 pointer on the line above correctly uses nb03.

* fix: wrong dimension limits in dup_f16 non-contiguous path

The destination index wrapping in ggml_compute_forward_dup_f16 used
source dimensions (ne00/ne01/ne02/ne03) instead of destination dimensions
(ne0/ne1/ne2/ne3). While source and destination shapes are currently
identical for dup, using the wrong variables is incorrect by design.

* fix: wrong dimension limits in dup_bf16 non-contiguous path

Same fix as the dup_f16 path: destination index wrapping used source
dimensions (ne00/ne01/ne02/ne03) instead of destination dimensions
(ne0/ne1/ne2/ne3). Copy-paste error from the contiguous path.

* fix: ACC work size uses src[1] instead of src[0]

The dequantization work buffer for quantized ACC was sized using
src[1]->ne[0] instead of src[0]->ne[0]. Since src[0] is the tensor
being dequantized, its dimensions should determine the buffer size.

* fix: missing work size for SOFT_CAP_MAX and ROPE_BACK

Both ops dereference params->wdata in their forward functions but had
no work size allocation (cur = 0), causing a NULL pointer dereference
when any thread attempted to use wdata.

* fix: wrong dim in sum_rows_f32 dimension decomposition

Line 14404 used ne01*ne0 (= ne01*1) instead of ne01*ne02 for the
i3 term in the flat row index formula. When ne02 > 1 (batched 2D
inputs), this causes wrong memory access and corrupted results.
2026-06-24 09:09:33 +02:00
Joel Farthing
8686ea708b
chat: Cohere2MoE/North Code: parse unopened thinking under --reasoning off (follow-up to #1968) (#2012)
* Handle Cohere2MoE unopened thinking before tools

* Cohere2MoE: route unopened thinking to reasoning_content; test in active target

Follow-up to #1968. Gate extract_reasoning on reasoning_format only (drop the
"&& enable_thinking" addition) so the unopened-thinking handling does not also
change where an opened thinking block is routed. Under --reasoning off
(enable_thinking=false, reasoning_format defaults to DEEPSEEK) an orphaned
thinking block is now quarantined in reasoning_content with clean content and a
native tool call, instead of leaking the thinking prose into the user-facing
answer.

Move the Cohere2MoE end-to-end parser cases into tests/test-chat-auto-parser.cpp,
which CMake actually builds. tests/test-chat.cpp has been disabled in
tests/CMakeLists.txt since #723, so cohere coverage added there never ran in CI;
revert the local band-aids to that file.

* Cohere2MoE: harden parser from NMC eval findings

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-24 09:04:41 +02:00
Yap Sok Ann
5a4fa17947
Load glm-dsa indexer tensors as optional (ggml-org/llama.cpp#24770) (#2017) 2026-06-24 09:04:09 +02:00
Yap Sok Ann
997b289d93
jinja: give each for-loop iteration a fresh scope (#2018)
`{% set %}` of a non-loop variable inside a `{% for %}` body leaks
across iterations when the assignment is conditionally skipped. Each
iteration should start with a clean scope, matching standard Jinja2
semantics.

This fixes the issue with GLM-5.2 chat template when:
* turn 1 is a tool call with reasoning
* turn 2 is a tool call without reasoning

In this case, the reasoning content for turn 1 would be wrongly
duplicated to turn 2, resulting in degraded model performance.
2026-06-24 08:58:36 +02:00
mb8565
a7d35d51dc
eval-callback : sum over the full tensor, not just the printed slice (#2019)
ggml_print_tensor() accumulated `sum` inside the truncated print loop, which
skips the middle of each row when ne[0] > 2*n (n = 3). The printed `sum =`
therefore only covered the first n and last n elements per row, not the whole
tensor. For example a {2688, 5} tensor reported the sum of 30 of its 13440
values. That makes the value useless for numerically comparing two runs, and it
disagrees with mainline llama.cpp, whose eval-callback sums every element in a
separate pass.

This factors the per-element read into a small helper and computes the sum in
its own loop over all elements (double accumulator). The truncated print is a
separate, unchanged pass, so only the printed `sum =` value changes. The change
is confined to examples/eval-callback.

Co-authored-by: mb8565 <244351746+mb8565@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:57:19 +02:00
firecoperana
befbc0945b
server: variance based checkpoint eviction (#2020)
Co-authored-by: firecoperana <firecoperana>
2026-06-24 08:54:07 +02:00
Farmadupe
7ccf1d2095
allow user to use THP for host allocations with GGML_CUDA_HOST_MALLOC_THP (#2010)
* allow user to use THP for host allocations with GGML_CUDA_HOST_MALLOC_THP

* Remove useless symbol check
2026-06-23 15:13:41 +02:00
Nexes the Elder
2d3ecd5e19
Fix minor CUDA discrepancies (part 2) (#2015)
* fix: wrong tensor index in BF16 fused RMS norm add path (norm.cu:1039)

The BF16 branch of ggml_cuda_op_fused_rms_rms_add used dst->src[2]->data
for the second weight pointer, but should have used dst->src[3]->data.
This caused reading float weights from the wrong bf16 input tensor.

The F32 and F16 branches both correctly reference src[3], and the
assertions at lines 1013-1015 confirm src[3] is the F32 weight tensor.

* fix: off-by-one bounds check in 7 dmmv kernels (row > nrows -> row >= nrows)

Seven K-quant dequantize_mul_mat_vec kernels used row > nrows for bounds
checking instead of row >= nrows. Since rows are 0-indexed (0..nrows-1),
the check missed the row == nrows case, allowing a potential out-of-bounds
memory write when grid dimensions produce exactly nrows.

The templated dequantize_mul_mat_vec<type> kernel at line 667 already used
the correct row >= nrows pattern.

* fix: typo in function name iqk_mul_mat_vec_q_kerne -> iqk_mul_mat_vec_q_kernel

Truncated function name in iqk_mmvq_templates.cuh was missing trailing 'l'.

* fix: print actual split_dim value in set_tensor error message (ggml-cuda.cu)

fprintf used extra->split_dim == 0 which evaluates to boolean 0 or 1
instead of the actual split dimension value. When this fatal error is
hit for an unsupported split_dim, the user could not diagnose which
value caused the problem.

* fix: wrong src index in gate bias stride for fused up-gate MoE path

ggml_cuda_add_id for the gate bias used dst->src[4]->nb[1] as the stride
argument instead of dst->src[5]->nb[1]. This was a copy-paste error from
the up-bias code (lines 3220-3224) where src[4] is correct. If src[4]
and src[5] have different strides, the bias addition produces incorrect
results.

* fix: wrong row count for gate projection MMQ in fused up-gate MoE path

ggml_cuda_op_mul_mat_q for the gate projection (src0_2) used
src0_1->ne[1] as row_high instead of src0_2->ne[1]. This copy-paste
error causes processing the wrong number of rows if the up and gate
projections have different row counts. The gemv path (line ~3563)
correctly used src0_2->ne[1].
2026-06-23 14:03:22 +02:00
Nexes the Elder
9eaf86a7c7
Fix minor CUDA discrepencies (#2005)
* CUDA : typo

* CUDA: Add missing GGML_CALL to function definition

* CUDA: only log GGML_CUDA_FORCE_MMQ/CUBLAS when enabled

* CUDA: Fix softcap bug in flash_attn_tile_ext_f16

The else branch (softcap != 0) incorrectly called launch_fattn_tile_f16_64_128
with use_softcap=false instead of true, causing logit softcap to be silently
ignored for the col_per_block=32, parallel_blocks=1 path.
2026-06-23 09:37:48 +02:00
Jun Yamog
69a8336d08
Add native MiniMax-M3 tool call parser (#2008) 2026-06-23 09:36:02 +02:00
Joel Farthing
b2b4f66fa0
tests: add Seed-OSS chat template fixture (#2014)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-23 09:35:28 +02:00
empty-quiver
b47b90d0be
Add Laguna M.1 GGUF support (#2003) 2026-06-22 16:53:10 +02:00
Joel Farthing
64fceb70bc
DFlash: use persistent FA-ready K/V cache (#1997)
* Prototype physical-order DFlash KV cache

(cherry picked from commit f9093d9ee57cf66f6ce44c42524158bb1449d1c9)

* Use persistent FA-ready DFlash KV cache

(cherry picked from commit cfed6ae456b5448ac0053fbd5994037af845a69a)

* Address DFlash review cleanup

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-22 16:49:35 +02:00
magikRUKKOLA
72440a19fc
on-demand tensor reload (#1989)
* host-swap tensor loop

the host-swap functionality is only triggered when the certain env. variables are declared

* target_include_directories tweak

* hot-swap tensor support

two intrusions:
1.) at the model loading to collect the snapshot
2.) the modification of the `/health` HTTP endpoint to be able to trigger the hot-swap via sending the `llama-server` the HTTP-request.
*both a braced by the specific env. variables

* hot-swap tensor support; graph invalidation

ggml_backend_cuda_invalidate_graphs export

* hot-swap tensor support

graph invalidation implementation;  extended debug output (commented out)

* llama_reload_changed_tensors export

* tensor hot-swap on-demand reload

cpu-only/hybrid/gpu-only with split mode layer/graph full support implementation

* docs

* reuse the gguf parsing from llama.cpp

gguf_init_from_file, gguf_find_tensor, ggml_get_tensor

* remove the manual scheduling for hybrid inference

* update docs

* tensor shape validation

* update docs

* update docs

accidentally wiped the previous changes;  so recovered them

* revert the GGML_CUDA_MAX_DEVICES to 16

* update llama_reload_changed_tensor

update llama_reload_changed_tensor, revert CMakeLists.txt

* update llama_reload_changed_tensor

* GGML_MAX_SRC

GGML_MAX_SRC compile-time definition support

* GGML_MAX_SRC

GGML_MAX_SRC compile-time definition support

* GGML_MAX_SRC

GGML_MAX_SRC compile-time definition support

* llama_reload_changed_tensor

update llama_reload_changed_tensor definition

* refactory

move the tensor-reloading implementation to llama-reload.cpp, llama-reload-info.h;  some bugfixes and code reduction

* revert

added back the missing newline

* update docs

* reload_info constructor

* bugfix: cpu-only

TODO: improve the working environment by compiling for multiple hardware configurations;  possibly make a test pipeline

* cpu-only bugfix

set the fix again after unsuccessful sync with main

* windows os compilation fix

#include <string>

* fix windows os build

error C2039: 'string': is not a member of 'std'

* remove dead file

* implement perplexity in server

* Revert "implement perplexity in server"
2026-06-22 16:36:34 +02:00
a1batross
6c00e87ac8
cmake: drop ggml-blas.h from GGML_PUBLIC_HEADERS (#2007) 2026-06-21 07:49:09 +02:00
Kawrakow
d47f484d29
Force Gemma4 assistant to be loaded on last GPU (#1999)
* Allow graph reuse for Gemma4 MTP

* Force Gemma4 assistant to be loaded on last GPU
2026-06-19 18:17:13 +02:00
Kawrakow
8369cf7412
Allow graph reuse for Gemma4 MTP (#1996) 2026-06-19 18:16:53 +02:00
Kawrakow
b21653a56f
Fully remove any BLAS remnants (#2001)
* Fully remove any BLAS remnants

* Also these
2026-06-19 17:26:09 +02:00
Kawrakow
4bcfe5b872
Add compatibility for llama.cpp Gemma4 assistant GGUFs (#1995) 2026-06-19 11:24:54 +02:00
Samuel Oliveira Alves
d5c04c15fd
clean redudance in dflash graph and small logics (#1994) 2026-06-19 09:04:54 +02:00
Kawrakow
7321648844
Fix Gemma4 MTP compute graph (#1993)
* Fix MTP warmup for GLM models

* Fix Gemma4 MTP compute graph
2026-06-19 09:00:44 +02:00
Kawrakow
0d59973e4a
Fix MTP warmup for GLM models (#1992) 2026-06-19 08:59:55 +02:00
Nexes the Elder
b3dfb7858c
AVX VNNI auto-activation for MSVC ; HAVE_VNNI256 path for IQ4_XS_R8 and Qx_0 R4 quants. (#1991)
* AVX VNNI auto-activation

Enables auto-detect of AVX VNNI and its definition in the CMakeLists
Detected by ik_llama.cpp.

* IQ4_XS R8: Enable AVX-VNNI 256-bit path with MSVC compatibility

Migrate mul_mat_iq4_xs_r8_q8_k_avx2() from HAVE_FANCY_SIMD to HAVE_VNNI256.

Changes (6 guard sites + 8 intrinsic calls in iqk_gemm_kquants.cpp):
- Replaced 3x #ifdef HAVE_FANCY_SIMD with #ifdef HAVE_VNNI256
- Replaced 3x #ifndef HAVE_FANCY_SIMD with #ifndef HAVE_VNNI256
- Replaced 8x raw _mm256_dpbusd_epi32 with ggml_mm256_dpbusd_epi32
  (the ggml wrapper resolves to _mm256_dpbusd_avx_epi32 on MSVC via
  the iqk_config.h macro, which is the correct MSVC AVX-VNNI intrinsic
  available under /arch:AVX2; raw _mm256_dpbusd_epi32 does not exist
  in MSVC headers without AVX-512)

Impact:
- IQ4_XS_R8 matmul now uses VNNI256 on CPUs with AVX-VNNI but no
  AVX-512 (e.g. Intel Arrow Lake / Core Ultra 265K)
- Previously limited to HAVE_FANCY_SIMD (full AVX-512) exclusively
- This path is exercised when models are loaded with -rtr / --run-time-repack
  (in-memory repack) or when using --repack to create a permanent IQ4_XS_R8 file.
  Standard IQ4_XS does not auto-convert to IQ4_XS_R8 at load time.

* Qx_0 R4 legacy quants: Enable VNNI256 path for AVX-VNNI CPUs with MSVC compatibility

Three changes in iqk_gemm_legacy_quants.cpp:

1. DotHelper (line 23): Extend VNNI condition to include HAVE_VNNI256
   (not just __AVX512VNNI__+VL) and use ggml_mm256_dpbusd_epi32
   wrapper for MSVC compatibility. This fixes Q6_0 non-R4 path
   and all other quant types routed through UnsignedDot/SignedDot.

2. accum_q4_0_quants (line 994), mul_mat_q5_0_r4_q8_2_avx2
   (lines 1202, 1223), mul_mat_q6_0_r4_q8_2_avx2 (lines 1375, 1394):
   Replace #ifdef HAVE_FANCY_SIMD / #ifndef HAVE_FANCY_SIMD with
   HAVE_VNNI256 (which correctly detects AVX-VNNI without requiring
   full AVX-512). Also replace raw _mm256_dpbusd_epi32 with
   ggml_mm256_dpbusd_epi32 wrapper.

These paths were dead code on Arrow Lake (HAVE_FANCY_SIMD requires
full AVX-512 which Arrow Lake lacks). Now they compile and use
the hardware VNNI instruction (vpdpbusd) via __AVXVNNI__.

Note: remaining HAVE_FANCY_SIMD guards in this file guard true
AVX-512 paths (_mm512_* intrinsics) and are left unchanged.

* Simplify def
2026-06-18 18:05:19 +02:00
Kawrakow
3b81f63acd Update AUTHORS 2026-06-18 08:11:41 +00:00
Farmadupe
21f918c185
faster ggml_cuda_host_malloc (#1988) 2026-06-18 10:01:34 +02:00
Kawrakow
f5e5753c32
Fix Qwen35 mtp warmup (#1987)
* Use hidden state from prev token from qwen mtp

* Fix Qwen35 MTP warmup

* Cleanup + remove unnecessary crippling performance by not using accept to sample draft token

* Provide API to gtet the model arch string

---------

Co-authored-by: SamuelOliveirads <samueloliveira32df@gmail.com>
2026-06-18 09:03:40 +02:00
Kawrakow
71af16a6b7
Fix DFlash oerformance with split mode graph (#1980) 2026-06-17 18:40:02 +02:00
gapeleon
4f220159b8
Fix (Gemma-4 Vision): Correct KQ mask fill for causal models in non-causal flash-attn mode (#1985)
When llama_set_causal_attn(false) is called on a causal model (e.g.
Gemma-4 during vision image decode), llama_set_inputs took the non-causal
else-branch (designed for pure embedding models).

That path wrote the F16 mask with stride n_tokens instead of n_kv, and iterated batch
indices rather than KV cache cells.

The result was that every image query row beyond the first was
written at the wrong offset, leaving stale -inf values from
previous decodes visible to the GPU kernel. Any conversation
that had built up prior KV mask data would produce all-inf attention scores
for most image tokens, collapsing softmax to NaN and aborting at sampling.

Resolves #1984
2026-06-17 16:52:45 +02:00
Jun Yamog
064d23a6f8
Codex CLI Responses Compatibility (#1964)
* responses: skip known unsupported Responses tool types from Codex CLI

- Skip namespace, web_search, image_generation tools instead of HTTP 500
- Reject unknown non-function tool types with controlled error
- Preserve function tool conversion logic unchanged

Fixes Codex CLI 0.133.0 compatibility where it sends mixed tool types.

* responses: harden codex compatibility coverage

* responses: expose Codex model catalog metadata
2026-06-16 15:28:16 +02:00
Joel Farthing
d37d92b54c
chat: add Cohere2MoE North Code parser (#1968)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-16 15:27:30 +02:00
Kawrakow
8420f91ae3
Merge pull request #1977 from ikawrakow/ik/dflash_fix_cpu
Fix DFlash on the CPU
2026-06-16 15:26:23 +02:00
Kawrakow
6f45163a95 Fix DFlash on the CPU 2026-06-16 13:22:36 +00:00
Kawrakow
f9078e169b
Merge pull request #1970 from SamuelOliveirads/feat/dflash-implementation
Add DFlash support
2026-06-16 15:07:55 +02:00
Kawrakow
11c9935ce8
Merge pull request #1893 from ikawrakow/ik/gemma4_mtmd_blindness
Fix Gemma4 vision
2026-06-16 07:47:37 +02:00
SamuelOliveirads
ad24046b51 minor refactor in DFlash kv cache graph 2026-06-15 18:22:56 -03:00
Kawrakow
2f524850a1
Merge pull request #1973 from ikawrakow/ik/fattn_mma_gqa_16
CUDA FA: faster TG when GQA is 16 and head size is 128
2026-06-15 15:24:01 +02:00
Kawrakow
37ea89cabf
Merge pull request #1974 from Nexesenex/fix_muge_crash_minimax_m3
Fix Minimax M3 crash when -muge merges up/gate experts
2026-06-15 15:07:49 +02:00
Nexesenex
3c9680fd3c Fix Minimax M3 crash when -muge merges up/gate experts
The graph builder for Minimax M3 (build_minimaxm3.cpp) was not passing
model.layers[il].ffn_up_gate_exps to llm_build_std_moe_ffn, unlike
Minimax M2 and all other MoE model graph builders.

When -muge (merge_up_gate_experts) is enabled, the merge creates a single
ffn_up_gate_exps tensor with ffn_up_exps and ffn_gate_exps as views.
Only the parent merged tensor gets the split 'extra' pointer set.
Without passing it as up_gate_exps parameter, the function sees null
split pointers for up/gate (the views) while split_down_exps is valid,
causing the assertion at llama-build-context.cpp:1453 to fail.
2026-06-15 15:00:32 +02:00
Kawrakow
6be3a488d3 CUDA FA: faster TG when GQA is 16 and head size is 128 2026-06-15 11:46:02 +00:00