4636 Commits

Author SHA1 Message Date
Samuel Oliveira Alves
d5c04c15fd
clean redudance in dflash graph and small logics (#1994) 2026-06-19 09:04:54 +02:00
Kawrakow
7321648844
Fix Gemma4 MTP compute graph (#1993)
* Fix MTP warmup for GLM models

* Fix Gemma4 MTP compute graph
2026-06-19 09:00:44 +02:00
Kawrakow
0d59973e4a
Fix MTP warmup for GLM models (#1992) 2026-06-19 08:59:55 +02:00
Nexes the Elder
b3dfb7858c
AVX VNNI auto-activation for MSVC ; HAVE_VNNI256 path for IQ4_XS_R8 and Qx_0 R4 quants. (#1991)
* AVX VNNI auto-activation

Enables auto-detect of AVX VNNI and its definition in the CMakeLists
Detected by ik_llama.cpp.

* IQ4_XS R8: Enable AVX-VNNI 256-bit path with MSVC compatibility

Migrate mul_mat_iq4_xs_r8_q8_k_avx2() from HAVE_FANCY_SIMD to HAVE_VNNI256.

Changes (6 guard sites + 8 intrinsic calls in iqk_gemm_kquants.cpp):
- Replaced 3x #ifdef HAVE_FANCY_SIMD with #ifdef HAVE_VNNI256
- Replaced 3x #ifndef HAVE_FANCY_SIMD with #ifndef HAVE_VNNI256
- Replaced 8x raw _mm256_dpbusd_epi32 with ggml_mm256_dpbusd_epi32
  (the ggml wrapper resolves to _mm256_dpbusd_avx_epi32 on MSVC via
  the iqk_config.h macro, which is the correct MSVC AVX-VNNI intrinsic
  available under /arch:AVX2; raw _mm256_dpbusd_epi32 does not exist
  in MSVC headers without AVX-512)

Impact:
- IQ4_XS_R8 matmul now uses VNNI256 on CPUs with AVX-VNNI but no
  AVX-512 (e.g. Intel Arrow Lake / Core Ultra 265K)
- Previously limited to HAVE_FANCY_SIMD (full AVX-512) exclusively
- This path is exercised when models are loaded with -rtr / --run-time-repack
  (in-memory repack) or when using --repack to create a permanent IQ4_XS_R8 file.
  Standard IQ4_XS does not auto-convert to IQ4_XS_R8 at load time.

* Qx_0 R4 legacy quants: Enable VNNI256 path for AVX-VNNI CPUs with MSVC compatibility

Three changes in iqk_gemm_legacy_quants.cpp:

1. DotHelper (line 23): Extend VNNI condition to include HAVE_VNNI256
   (not just __AVX512VNNI__+VL) and use ggml_mm256_dpbusd_epi32
   wrapper for MSVC compatibility. This fixes Q6_0 non-R4 path
   and all other quant types routed through UnsignedDot/SignedDot.

2. accum_q4_0_quants (line 994), mul_mat_q5_0_r4_q8_2_avx2
   (lines 1202, 1223), mul_mat_q6_0_r4_q8_2_avx2 (lines 1375, 1394):
   Replace #ifdef HAVE_FANCY_SIMD / #ifndef HAVE_FANCY_SIMD with
   HAVE_VNNI256 (which correctly detects AVX-VNNI without requiring
   full AVX-512). Also replace raw _mm256_dpbusd_epi32 with
   ggml_mm256_dpbusd_epi32 wrapper.

These paths were dead code on Arrow Lake (HAVE_FANCY_SIMD requires
full AVX-512 which Arrow Lake lacks). Now they compile and use
the hardware VNNI instruction (vpdpbusd) via __AVXVNNI__.

Note: remaining HAVE_FANCY_SIMD guards in this file guard true
AVX-512 paths (_mm512_* intrinsics) and are left unchanged.

* Simplify def
2026-06-18 18:05:19 +02:00
Kawrakow
3b81f63acd Update AUTHORS 2026-06-18 08:11:41 +00:00
Farmadupe
21f918c185
faster ggml_cuda_host_malloc (#1988) 2026-06-18 10:01:34 +02:00
Kawrakow
f5e5753c32
Fix Qwen35 mtp warmup (#1987)
* Use hidden state from prev token from qwen mtp

* Fix Qwen35 MTP warmup

* Cleanup + remove unnecessary crippling performance by not using accept to sample draft token

* Provide API to gtet the model arch string

---------

Co-authored-by: SamuelOliveirads <samueloliveira32df@gmail.com>
2026-06-18 09:03:40 +02:00
Kawrakow
71af16a6b7
Fix DFlash oerformance with split mode graph (#1980) 2026-06-17 18:40:02 +02:00
gapeleon
4f220159b8
Fix (Gemma-4 Vision): Correct KQ mask fill for causal models in non-causal flash-attn mode (#1985)
When llama_set_causal_attn(false) is called on a causal model (e.g.
Gemma-4 during vision image decode), llama_set_inputs took the non-causal
else-branch (designed for pure embedding models).

That path wrote the F16 mask with stride n_tokens instead of n_kv, and iterated batch
indices rather than KV cache cells.

The result was that every image query row beyond the first was
written at the wrong offset, leaving stale -inf values from
previous decodes visible to the GPU kernel. Any conversation
that had built up prior KV mask data would produce all-inf attention scores
for most image tokens, collapsing softmax to NaN and aborting at sampling.

Resolves #1984
2026-06-17 16:52:45 +02:00
Jun Yamog
064d23a6f8
Codex CLI Responses Compatibility (#1964)
* responses: skip known unsupported Responses tool types from Codex CLI

- Skip namespace, web_search, image_generation tools instead of HTTP 500
- Reject unknown non-function tool types with controlled error
- Preserve function tool conversion logic unchanged

Fixes Codex CLI 0.133.0 compatibility where it sends mixed tool types.

* responses: harden codex compatibility coverage

* responses: expose Codex model catalog metadata
2026-06-16 15:28:16 +02:00
Joel Farthing
d37d92b54c
chat: add Cohere2MoE North Code parser (#1968)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-16 15:27:30 +02:00
Kawrakow
8420f91ae3
Merge pull request #1977 from ikawrakow/ik/dflash_fix_cpu
Fix DFlash on the CPU
2026-06-16 15:26:23 +02:00
Kawrakow
6f45163a95 Fix DFlash on the CPU 2026-06-16 13:22:36 +00:00
Kawrakow
f9078e169b
Merge pull request #1970 from SamuelOliveirads/feat/dflash-implementation
Add DFlash support
2026-06-16 15:07:55 +02:00
Kawrakow
11c9935ce8
Merge pull request #1893 from ikawrakow/ik/gemma4_mtmd_blindness
Fix Gemma4 vision
2026-06-16 07:47:37 +02:00
SamuelOliveirads
ad24046b51 minor refactor in DFlash kv cache graph 2026-06-15 18:22:56 -03:00
Kawrakow
2f524850a1
Merge pull request #1973 from ikawrakow/ik/fattn_mma_gqa_16
CUDA FA: faster TG when GQA is 16 and head size is 128
2026-06-15 15:24:01 +02:00
Kawrakow
37ea89cabf
Merge pull request #1974 from Nexesenex/fix_muge_crash_minimax_m3
Fix Minimax M3 crash when -muge merges up/gate experts
2026-06-15 15:07:49 +02:00
Nexesenex
3c9680fd3c Fix Minimax M3 crash when -muge merges up/gate experts
The graph builder for Minimax M3 (build_minimaxm3.cpp) was not passing
model.layers[il].ffn_up_gate_exps to llm_build_std_moe_ffn, unlike
Minimax M2 and all other MoE model graph builders.

When -muge (merge_up_gate_experts) is enabled, the merge creates a single
ffn_up_gate_exps tensor with ffn_up_exps and ffn_gate_exps as views.
Only the parent merged tensor gets the split 'extra' pointer set.
Without passing it as up_gate_exps parameter, the function sees null
split pointers for up/gate (the views) while split_down_exps is valid,
causing the assertion at llama-build-context.cpp:1453 to fail.
2026-06-15 15:00:32 +02:00
Kawrakow
6be3a488d3 CUDA FA: faster TG when GQA is 16 and head size is 128 2026-06-15 11:46:02 +00:00
Kawrakow
f81673c7db
Merge pull request #1972 from ikawrakow/ik/minimaxm3_smgraph
Split mode graph for MiniMax-M3
2026-06-15 13:44:19 +02:00
Kawrakow
e927adc4ad
Merge pull request #1969 from Farmadupe/resize_algo_fix
Correct image resize algorithm for all qwens after qwen2vl and gemma4
2026-06-15 13:39:11 +02:00
Kawrakow
00d96744de
Merge pull request #1967 from Farmadupe/stb_image_resize2
Replace image resizers with avx2/neon simd impls from stb_img_resize2.h
2026-06-15 13:38:31 +02:00
Kawrakow
1dc4ea938a
Merge pull request #1962 from ikawrakow/ik/fix_1961
Fix #1961
2026-06-15 13:00:27 +02:00
Kawrakow
c24d50dd88 Split mode graph for MiniMax-M3 2026-06-15 08:41:34 +00:00
Kawrakow
567854aeab
Merge pull request #1963 from jkyamog/minimax-m3-support
Add preliminary MiniMax-M3 support
2026-06-15 10:16:10 +02:00
Jun Yamog
c08d194edd Use standard graph helpers for MiniMax-M3 2026-06-15 02:00:46 +00:00
Jun Yamog
c538210e6d Add MiniMax-M3 chat template 2026-06-15 01:29:13 +00:00
SamuelOliveirads
6cae8c7ba2 clean logs 2026-06-14 21:07:57 -03:00
Thomas Green
19f08160ad Correct image resize algorithm for all qwens after qwen2vl and gemma4 2026-06-14 21:57:11 +01:00
Thomas Green
574f22b3c7 Replace image resizers with avx2/neon simd impls from stb_img_resize2.h 2026-06-14 20:28:08 +01:00
SamuelOliveirads
0d75eee35a remove duplicated code and unnecesary refactor 2026-06-14 16:02:02 -03:00
Kawrakow
4f1ec69ae5
Merge pull request #1965 from Nexesenex/fix_q8_0_graph_reduce_type
CUDA: Fix Q8_0 graph reduce type
2026-06-14 16:32:48 +02:00
Nexesenex
0fdac83272 Fix Q8_0 graph reduce type
Analogous to the BF16 fix in eea6a82b25, this adds proper Q8_0
type handling in ggml_cuda_op_add:

- Add k_add_q8_0_f32 kernel: dequantize Q8_0, add F32, store F32
- Add k_add_q8_0_q8_0_f32 kernel: dequantize two Q8_0, add, store F32
- Add Q8_0+Q8_0/Q8_0+F32/F32+Q8_0 branches in the F32 dst (else) block,
  preventing Q8_0 data from falling through to the incorrect half cast
- Expand Q8_0 dst branch to handle F32+Q8_0->Q8_0 (swapped args), not
  just Q8_0+F32->Q8_0
2026-06-14 16:13:17 +02:00
Jun Yamog
0df00b3b94 Add preliminary MiniMax-M3 support 2026-06-14 12:23:20 +00:00
Kawrakow
c73bfbe9ce Fix #1961 2026-06-14 07:42:39 +00:00
Kawrakow
670a3f6f5b
Merge pull request #1960 from BeccaLabs/fix/rpc-device-init
fix: initialize rpc_device endpoint and device index before parsing
2026-06-14 08:14:07 +02:00
SamuelOliveirads
3b1a0f88d5 Add logging for DFlash statistics and clean up workspace handling 2026-06-13 20:14:08 -03:00
BECCA-Labs
053202f97a fix: initialize rpc_device endpoint and device index before parsing 2026-06-13 16:13:44 -05:00
SamuelOliveirads
3a1d46c4d1 Merge remote-tracking branch 'origin/main' into feat/dflash-implementation
# Conflicts:
#	common/common.cpp
#	common/speculative.cpp
#	convert_hf_to_gguf.py
#	examples/server/server-context.cpp
#	examples/server/server-context.h
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama.cpp
2026-06-13 17:27:52 -03:00
Kawrakow
5f917a64b3
Merge pull request #1958 from ikawrakow/ik/handle_think_no_space 2026-06-12 21:27:23 +02:00
Samuel Oliveira Alves
8a38025174
Refactor: Move spec outside server (#1949)
* Refactor speculative decoding: move logic outside of server

* remove duplicated tokens in mtp kv cache

* narrow to only discard draft cells in MTP

* revert mtp_speculative_gen_draft
2026-06-12 18:12:39 +02:00
Farmadupe
d1339249d7
Cleanup: Unify location of m-rope repacking for token and embd (#1924)
* unify location of rope-position-array rewriting prior to ubatching

* Reorder terms.
2026-06-12 08:27:50 +02:00
Simon Lundell
b1eb8bb0a1
server: gate llama_decode_stop() to the active decode (fix queued-cancel cascade) (#1941)
With --parallel 1, a client disconnect/timeout on a *queued* request aborts the
*active* decode of a different client (llama_decode: failed to decode, ret = -3 /
"Decode process is cancelled by user"), releasing the slot with the request
unfinished. To the active client the stream silently stalls and never returns,
while the server reports healthy — easy to misdiagnose as a network/proxy wedge.

Root cause: llama_decode_stop() signals a process-global stop flag that the
active decode loop polls. examples/server/server.cpp calls it *ungated* from the
request reader's connection-closed paths, so any reader closing (including a
queued, not-yet-running task's) trips the global flag against whatever decode is
currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" +
hybrid/recurrent ret=-3), which did not gate these call sites against non-active
readers, so the queued-cancel-kills-active cascade still fires on current main.

Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the
three llama_decode_stop() sites on it, so the global stop is signalled only when
one of THIS reader's tasks is on a slot (the active decode). A queued task's
disconnect then only drops that queued task. Verified in production under heavy
concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero
active-decode kills). Stdlib-only reproducer in the PR description.

Caveat: any_task_on_slot() reads the slots vector from the reader thread — the
same race class as the existing process-global flag; can be tightened to a
per-context/per-task cancellation if preferred.
2026-06-12 08:25:44 +02:00
Marian M.
5fb707d19b
Update docs (#1956)
* Update README.md

Models, MTP, fit

* Update parameters.md

Disclaimer, terms, new flags, graph split list.
2026-06-12 08:24:22 +02:00
Kawrakow
175819b4fb Style 2026-06-12 06:19:06 +00:00
Kawrakow
3dbc3241b9 Handle forced-open reasoning tag without trailing whitespace 2026-06-12 05:43:11 +00:00
Joel Farthing
8d91d3c3d9
common: gate empty-start reasoning extraction (#1955)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-12 07:16:24 +02:00
Kawrakow
022bd00aab
Optimize Cohere2-MoE graph parallel (#1948)
* Optimzie Cohere2-MoE graph parallel

* Minor
2026-06-11 07:26:42 +02:00
firecoperana
ca0c1c5f85
fix Qwen3.6 outputs blank <think></think> in response when thinking is off (#1951)
Co-authored-by: firecoperana <firecoperana>
2026-06-11 07:26:07 +02:00