ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

Author	SHA1	Message	Date
Kawrakow	bf23a7599c	Avoid Gemma4 assistant strange tensor name warnings (#2023 )	2026-06-24 11:23:22 +02:00
Nexes the Elder	7cacf28eec	Fix minor GGML discrepencies (#2016 ) * fix: wrong stride in batched quantized add1 (nb0 -> nb3) ggml_compute_forward_add1_q_f32 used i3nb0 (element stride) instead of i3nb3 (batch stride) for the destination row pointer. This causes all add1 operations with quantized types and batch > 1 to write to wrong memory locations. The src0 pointer on the line above correctly uses nb03. * fix: wrong dimension limits in dup_f16 non-contiguous path The destination index wrapping in ggml_compute_forward_dup_f16 used source dimensions (ne00/ne01/ne02/ne03) instead of destination dimensions (ne0/ne1/ne2/ne3). While source and destination shapes are currently identical for dup, using the wrong variables is incorrect by design. * fix: wrong dimension limits in dup_bf16 non-contiguous path Same fix as the dup_f16 path: destination index wrapping used source dimensions (ne00/ne01/ne02/ne03) instead of destination dimensions (ne0/ne1/ne2/ne3). Copy-paste error from the contiguous path. * fix: ACC work size uses src[1] instead of src[0] The dequantization work buffer for quantized ACC was sized using src[1]->ne[0] instead of src[0]->ne[0]. Since src[0] is the tensor being dequantized, its dimensions should determine the buffer size. * fix: missing work size for SOFT_CAP_MAX and ROPE_BACK Both ops dereference params->wdata in their forward functions but had no work size allocation (cur = 0), causing a NULL pointer dereference when any thread attempted to use wdata. * fix: wrong dim in sum_rows_f32 dimension decomposition Line 14404 used ne01ne0 (= ne011) instead of ne01*ne02 for the i3 term in the flat row index formula. When ne02 > 1 (batched 2D inputs), this causes wrong memory access and corrupted results.	2026-06-24 09:09:33 +02:00
Joel Farthing	8686ea708b	chat: Cohere2MoE/North Code: parse unopened thinking under --reasoning off (follow-up to #1968 ) (#2012 ) * Handle Cohere2MoE unopened thinking before tools * Cohere2MoE: route unopened thinking to reasoning_content; test in active target Follow-up to #1968. Gate extract_reasoning on reasoning_format only (drop the "&& enable_thinking" addition) so the unopened-thinking handling does not also change where an opened thinking block is routed. Under --reasoning off (enable_thinking=false, reasoning_format defaults to DEEPSEEK) an orphaned thinking block is now quarantined in reasoning_content with clean content and a native tool call, instead of leaking the thinking prose into the user-facing answer. Move the Cohere2MoE end-to-end parser cases into tests/test-chat-auto-parser.cpp, which CMake actually builds. tests/test-chat.cpp has been disabled in tests/CMakeLists.txt since #723, so cohere coverage added there never ran in CI; revert the local band-aids to that file. * Cohere2MoE: harden parser from NMC eval findings --------- Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-24 09:04:41 +02:00
Yap Sok Ann	5a4fa17947	Load glm-dsa indexer tensors as optional (ggml-org/llama.cpp#24770) (#2017 )	2026-06-24 09:04:09 +02:00
Yap Sok Ann	997b289d93	jinja: give each for-loop iteration a fresh scope (#2018 ) `{% set %}` of a non-loop variable inside a `{% for %}` body leaks across iterations when the assignment is conditionally skipped. Each iteration should start with a clean scope, matching standard Jinja2 semantics. This fixes the issue with GLM-5.2 chat template when: * turn 1 is a tool call with reasoning * turn 2 is a tool call without reasoning In this case, the reasoning content for turn 1 would be wrongly duplicated to turn 2, resulting in degraded model performance.	2026-06-24 08:58:36 +02:00
mb8565	a7d35d51dc	eval-callback : sum over the full tensor, not just the printed slice (#2019 ) ggml_print_tensor() accumulated `sum` inside the truncated print loop, which skips the middle of each row when ne[0] > 2*n (n = 3). The printed `sum =` therefore only covered the first n and last n elements per row, not the whole tensor. For example a {2688, 5} tensor reported the sum of 30 of its 13440 values. That makes the value useless for numerically comparing two runs, and it disagrees with mainline llama.cpp, whose eval-callback sums every element in a separate pass. This factors the per-element read into a small helper and computes the sum in its own loop over all elements (double accumulator). The truncated print is a separate, unchanged pass, so only the printed `sum =` value changes. The change is confined to examples/eval-callback. Co-authored-by: mb8565 <244351746+mb8565@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 08:57:19 +02:00
firecoperana	befbc0945b	server: variance based checkpoint eviction (#2020 ) Co-authored-by: firecoperana <firecoperana>	2026-06-24 08:54:07 +02:00
Farmadupe	7ccf1d2095	allow user to use THP for host allocations with GGML_CUDA_HOST_MALLOC_THP (#2010 ) * allow user to use THP for host allocations with GGML_CUDA_HOST_MALLOC_THP * Remove useless symbol check	2026-06-23 15:13:41 +02:00
Nexes the Elder	2d3ecd5e19	Fix minor CUDA discrepancies (part 2) (#2015 ) * fix: wrong tensor index in BF16 fused RMS norm add path (norm.cu:1039) The BF16 branch of ggml_cuda_op_fused_rms_rms_add used dst->src[2]->data for the second weight pointer, but should have used dst->src[3]->data. This caused reading float weights from the wrong bf16 input tensor. The F32 and F16 branches both correctly reference src[3], and the assertions at lines 1013-1015 confirm src[3] is the F32 weight tensor. * fix: off-by-one bounds check in 7 dmmv kernels (row > nrows -> row >= nrows) Seven K-quant dequantize_mul_mat_vec kernels used row > nrows for bounds checking instead of row >= nrows. Since rows are 0-indexed (0..nrows-1), the check missed the row == nrows case, allowing a potential out-of-bounds memory write when grid dimensions produce exactly nrows. The templated dequantize_mul_mat_vec<type> kernel at line 667 already used the correct row >= nrows pattern. * fix: typo in function name iqk_mul_mat_vec_q_kerne -> iqk_mul_mat_vec_q_kernel Truncated function name in iqk_mmvq_templates.cuh was missing trailing 'l'. * fix: print actual split_dim value in set_tensor error message (ggml-cuda.cu) fprintf used extra->split_dim == 0 which evaluates to boolean 0 or 1 instead of the actual split dimension value. When this fatal error is hit for an unsupported split_dim, the user could not diagnose which value caused the problem. * fix: wrong src index in gate bias stride for fused up-gate MoE path ggml_cuda_add_id for the gate bias used dst->src[4]->nb[1] as the stride argument instead of dst->src[5]->nb[1]. This was a copy-paste error from the up-bias code (lines 3220-3224) where src[4] is correct. If src[4] and src[5] have different strides, the bias addition produces incorrect results. * fix: wrong row count for gate projection MMQ in fused up-gate MoE path ggml_cuda_op_mul_mat_q for the gate projection (src0_2) used src0_1->ne[1] as row_high instead of src0_2->ne[1]. This copy-paste error causes processing the wrong number of rows if the up and gate projections have different row counts. The gemv path (line ~3563) correctly used src0_2->ne[1].	2026-06-23 14:03:22 +02:00
Nexes the Elder	9eaf86a7c7	Fix minor CUDA discrepencies (#2005 ) * CUDA : typo * CUDA: Add missing GGML_CALL to function definition * CUDA: only log GGML_CUDA_FORCE_MMQ/CUBLAS when enabled * CUDA: Fix softcap bug in flash_attn_tile_ext_f16 The else branch (softcap != 0) incorrectly called launch_fattn_tile_f16_64_128 with use_softcap=false instead of true, causing logit softcap to be silently ignored for the col_per_block=32, parallel_blocks=1 path.	2026-06-23 09:37:48 +02:00
Jun Yamog	69a8336d08	Add native MiniMax-M3 tool call parser (#2008 )	2026-06-23 09:36:02 +02:00
Joel Farthing	b2b4f66fa0	tests: add Seed-OSS chat template fixture (#2014 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-23 09:35:28 +02:00
empty-quiver	b47b90d0be	Add Laguna M.1 GGUF support (#2003 )	2026-06-22 16:53:10 +02:00
Joel Farthing	64fceb70bc	DFlash: use persistent FA-ready K/V cache (#1997 ) * Prototype physical-order DFlash KV cache (cherry picked from commit f9093d9ee57cf66f6ce44c42524158bb1449d1c9) * Use persistent FA-ready DFlash KV cache (cherry picked from commit cfed6ae456b5448ac0053fbd5994037af845a69a) * Address DFlash review cleanup --------- Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-22 16:49:35 +02:00
magikRUKKOLA	72440a19fc	on-demand tensor reload (#1989 ) * host-swap tensor loop the host-swap functionality is only triggered when the certain env. variables are declared * target_include_directories tweak * hot-swap tensor support two intrusions: 1.) at the model loading to collect the snapshot 2.) the modification of the `/health` HTTP endpoint to be able to trigger the hot-swap via sending the `llama-server` the HTTP-request. both a braced by the specific env. variables hot-swap tensor support; graph invalidation ggml_backend_cuda_invalidate_graphs export * hot-swap tensor support graph invalidation implementation; extended debug output (commented out) * llama_reload_changed_tensors export * tensor hot-swap on-demand reload cpu-only/hybrid/gpu-only with split mode layer/graph full support implementation * docs * reuse the gguf parsing from llama.cpp gguf_init_from_file, gguf_find_tensor, ggml_get_tensor * remove the manual scheduling for hybrid inference * update docs * tensor shape validation * update docs * update docs accidentally wiped the previous changes; so recovered them * revert the GGML_CUDA_MAX_DEVICES to 16 * update llama_reload_changed_tensor update llama_reload_changed_tensor, revert CMakeLists.txt * update llama_reload_changed_tensor * GGML_MAX_SRC GGML_MAX_SRC compile-time definition support * GGML_MAX_SRC GGML_MAX_SRC compile-time definition support * GGML_MAX_SRC GGML_MAX_SRC compile-time definition support * llama_reload_changed_tensor update llama_reload_changed_tensor definition * refactory move the tensor-reloading implementation to llama-reload.cpp, llama-reload-info.h; some bugfixes and code reduction * revert added back the missing newline * update docs * reload_info constructor * bugfix: cpu-only TODO: improve the working environment by compiling for multiple hardware configurations; possibly make a test pipeline * cpu-only bugfix set the fix again after unsuccessful sync with main * windows os compilation fix #include <string> * fix windows os build error C2039: 'string': is not a member of 'std' * remove dead file * implement perplexity in server * Revert "implement perplexity in server"	2026-06-22 16:36:34 +02:00
a1batross	6c00e87ac8	cmake: drop ggml-blas.h from GGML_PUBLIC_HEADERS (#2007 )	2026-06-21 07:49:09 +02:00
Kawrakow	d47f484d29	Force Gemma4 assistant to be loaded on last GPU (#1999 ) * Allow graph reuse for Gemma4 MTP * Force Gemma4 assistant to be loaded on last GPU	2026-06-19 18:17:13 +02:00
Kawrakow	8369cf7412	Allow graph reuse for Gemma4 MTP (#1996 )	2026-06-19 18:16:53 +02:00
Kawrakow	b21653a56f	Fully remove any BLAS remnants (#2001 ) * Fully remove any BLAS remnants * Also these	2026-06-19 17:26:09 +02:00
Kawrakow	4bcfe5b872	Add compatibility for llama.cpp Gemma4 assistant GGUFs (#1995 )	2026-06-19 11:24:54 +02:00
Samuel Oliveira Alves	d5c04c15fd	clean redudance in dflash graph and small logics (#1994 )	2026-06-19 09:04:54 +02:00
Kawrakow	7321648844	Fix Gemma4 MTP compute graph (#1993 ) * Fix MTP warmup for GLM models * Fix Gemma4 MTP compute graph	2026-06-19 09:00:44 +02:00
Kawrakow	0d59973e4a	Fix MTP warmup for GLM models (#1992 )	2026-06-19 08:59:55 +02:00
Nexes the Elder	b3dfb7858c	AVX VNNI auto-activation for MSVC ; HAVE_VNNI256 path for IQ4_XS_R8 and Qx_0 R4 quants. (#1991 ) * AVX VNNI auto-activation Enables auto-detect of AVX VNNI and its definition in the CMakeLists Detected by ik_llama.cpp. * IQ4_XS R8: Enable AVX-VNNI 256-bit path with MSVC compatibility Migrate mul_mat_iq4_xs_r8_q8_k_avx2() from HAVE_FANCY_SIMD to HAVE_VNNI256. Changes (6 guard sites + 8 intrinsic calls in iqk_gemm_kquants.cpp): - Replaced 3x #ifdef HAVE_FANCY_SIMD with #ifdef HAVE_VNNI256 - Replaced 3x #ifndef HAVE_FANCY_SIMD with #ifndef HAVE_VNNI256 - Replaced 8x raw _mm256_dpbusd_epi32 with ggml_mm256_dpbusd_epi32 (the ggml wrapper resolves to _mm256_dpbusd_avx_epi32 on MSVC via the iqk_config.h macro, which is the correct MSVC AVX-VNNI intrinsic available under /arch:AVX2; raw _mm256_dpbusd_epi32 does not exist in MSVC headers without AVX-512) Impact: - IQ4_XS_R8 matmul now uses VNNI256 on CPUs with AVX-VNNI but no AVX-512 (e.g. Intel Arrow Lake / Core Ultra 265K) - Previously limited to HAVE_FANCY_SIMD (full AVX-512) exclusively - This path is exercised when models are loaded with -rtr / --run-time-repack (in-memory repack) or when using --repack to create a permanent IQ4_XS_R8 file. Standard IQ4_XS does not auto-convert to IQ4_XS_R8 at load time. * Qx_0 R4 legacy quants: Enable VNNI256 path for AVX-VNNI CPUs with MSVC compatibility Three changes in iqk_gemm_legacy_quants.cpp: 1. DotHelper (line 23): Extend VNNI condition to include HAVE_VNNI256 (not just __AVX512VNNI__+VL) and use ggml_mm256_dpbusd_epi32 wrapper for MSVC compatibility. This fixes Q6_0 non-R4 path and all other quant types routed through UnsignedDot/SignedDot. 2. accum_q4_0_quants (line 994), mul_mat_q5_0_r4_q8_2_avx2 (lines 1202, 1223), mul_mat_q6_0_r4_q8_2_avx2 (lines 1375, 1394): Replace #ifdef HAVE_FANCY_SIMD / #ifndef HAVE_FANCY_SIMD with HAVE_VNNI256 (which correctly detects AVX-VNNI without requiring full AVX-512). Also replace raw _mm256_dpbusd_epi32 with ggml_mm256_dpbusd_epi32 wrapper. These paths were dead code on Arrow Lake (HAVE_FANCY_SIMD requires full AVX-512 which Arrow Lake lacks). Now they compile and use the hardware VNNI instruction (vpdpbusd) via __AVXVNNI__. Note: remaining HAVE_FANCY_SIMD guards in this file guard true AVX-512 paths (_mm512_* intrinsics) and are left unchanged. * Simplify def	2026-06-18 18:05:19 +02:00
Kawrakow	3b81f63acd	Update AUTHORS	2026-06-18 08:11:41 +00:00
Farmadupe	21f918c185	faster ggml_cuda_host_malloc (#1988 )	2026-06-18 10:01:34 +02:00
Kawrakow	f5e5753c32	Fix Qwen35 mtp warmup (#1987 ) * Use hidden state from prev token from qwen mtp * Fix Qwen35 MTP warmup * Cleanup + remove unnecessary crippling performance by not using accept to sample draft token * Provide API to gtet the model arch string --------- Co-authored-by: SamuelOliveirads <samueloliveira32df@gmail.com>	2026-06-18 09:03:40 +02:00
Kawrakow	71af16a6b7	Fix DFlash oerformance with split mode graph (#1980 )	2026-06-17 18:40:02 +02:00
gapeleon	4f220159b8	Fix (Gemma-4 Vision): Correct KQ mask fill for causal models in non-causal flash-attn mode (#1985 ) When llama_set_causal_attn(false) is called on a causal model (e.g. Gemma-4 during vision image decode), llama_set_inputs took the non-causal else-branch (designed for pure embedding models). That path wrote the F16 mask with stride n_tokens instead of n_kv, and iterated batch indices rather than KV cache cells. The result was that every image query row beyond the first was written at the wrong offset, leaving stale -inf values from previous decodes visible to the GPU kernel. Any conversation that had built up prior KV mask data would produce all-inf attention scores for most image tokens, collapsing softmax to NaN and aborting at sampling. Resolves #1984	2026-06-17 16:52:45 +02:00
Jun Yamog	064d23a6f8	Codex CLI Responses Compatibility (#1964 ) * responses: skip known unsupported Responses tool types from Codex CLI - Skip namespace, web_search, image_generation tools instead of HTTP 500 - Reject unknown non-function tool types with controlled error - Preserve function tool conversion logic unchanged Fixes Codex CLI 0.133.0 compatibility where it sends mixed tool types. * responses: harden codex compatibility coverage * responses: expose Codex model catalog metadata	2026-06-16 15:28:16 +02:00
Joel Farthing	d37d92b54c	chat: add Cohere2MoE North Code parser (#1968 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-16 15:27:30 +02:00
Kawrakow	8420f91ae3	Merge pull request #1977 from ikawrakow/ik/dflash_fix_cpu Fix DFlash on the CPU	2026-06-16 15:26:23 +02:00
Kawrakow	6f45163a95	Fix DFlash on the CPU	2026-06-16 13:22:36 +00:00
Kawrakow	f9078e169b	Merge pull request #1970 from SamuelOliveirads/feat/dflash-implementation Add DFlash support	2026-06-16 15:07:55 +02:00
Kawrakow	11c9935ce8	Merge pull request #1893 from ikawrakow/ik/gemma4_mtmd_blindness Fix Gemma4 vision	2026-06-16 07:47:37 +02:00
SamuelOliveirads	ad24046b51	minor refactor in DFlash kv cache graph	2026-06-15 18:22:56 -03:00
Kawrakow	2f524850a1	Merge pull request #1973 from ikawrakow/ik/fattn_mma_gqa_16 CUDA FA: faster TG when GQA is 16 and head size is 128	2026-06-15 15:24:01 +02:00
Kawrakow	37ea89cabf	Merge pull request #1974 from Nexesenex/fix_muge_crash_minimax_m3 Fix Minimax M3 crash when -muge merges up/gate experts	2026-06-15 15:07:49 +02:00
Nexesenex	3c9680fd3c	Fix Minimax M3 crash when -muge merges up/gate experts The graph builder for Minimax M3 (build_minimaxm3.cpp) was not passing model.layers[il].ffn_up_gate_exps to llm_build_std_moe_ffn, unlike Minimax M2 and all other MoE model graph builders. When -muge (merge_up_gate_experts) is enabled, the merge creates a single ffn_up_gate_exps tensor with ffn_up_exps and ffn_gate_exps as views. Only the parent merged tensor gets the split 'extra' pointer set. Without passing it as up_gate_exps parameter, the function sees null split pointers for up/gate (the views) while split_down_exps is valid, causing the assertion at llama-build-context.cpp:1453 to fail.	2026-06-15 15:00:32 +02:00
Kawrakow	6be3a488d3	CUDA FA: faster TG when GQA is 16 and head size is 128	2026-06-15 11:46:02 +00:00
Kawrakow	f81673c7db	Merge pull request #1972 from ikawrakow/ik/minimaxm3_smgraph Split mode graph for MiniMax-M3	2026-06-15 13:44:19 +02:00
Kawrakow	e927adc4ad	Merge pull request #1969 from Farmadupe/resize_algo_fix Correct image resize algorithm for all qwens after qwen2vl and gemma4	2026-06-15 13:39:11 +02:00
Kawrakow	00d96744de	Merge pull request #1967 from Farmadupe/stb_image_resize2 Replace image resizers with avx2/neon simd impls from stb_img_resize2.h	2026-06-15 13:38:31 +02:00
Kawrakow	1dc4ea938a	Merge pull request #1962 from ikawrakow/ik/fix_1961 Fix #1961	2026-06-15 13:00:27 +02:00
Kawrakow	c24d50dd88	Split mode graph for MiniMax-M3	2026-06-15 08:41:34 +00:00
Kawrakow	567854aeab	Merge pull request #1963 from jkyamog/minimax-m3-support Add preliminary MiniMax-M3 support	2026-06-15 10:16:10 +02:00
Jun Yamog	c08d194edd	Use standard graph helpers for MiniMax-M3	2026-06-15 02:00:46 +00:00
Jun Yamog	c538210e6d	Add MiniMax-M3 chat template	2026-06-15 01:29:13 +00:00
SamuelOliveirads	6cae8c7ba2	clean logs	2026-06-14 21:07:57 -03:00
Thomas Green	19f08160ad	Correct image resize algorithm for all qwens after qwen2vl and gemma4	2026-06-14 21:57:11 +01:00

1 2 3 4 5 ...

4656 Commits