ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

Author	SHA1	Message	Date
Samuel Oliveira Alves	35fbe08d6e	disable MTP for parallel slots (#1804 )	2026-05-15 07:11:04 +03:00
Samuel Oliveira Alves	0fcffdb64d	feat: map Gemma 4 tensor and support with imatrix (#1796 )	2026-05-14 09:01:24 +03:00
Marian M.	b2e7f7f6cd	Update docs (#1800 ) * Update README.md - New model - New features * Update parameters.md - Recent new parameters	2026-05-14 08:44:58 +03:00
Kawrakow	949bb8f1d6	More MTP tweaks (#1792 )	2026-05-13 17:55:43 +03:00
ubergarm	ca52a825db	feat: add --threads-mtmd for independent multimodal thread count (#1797 ) Add `-tm` / `--threads-mtmd` to control CPU thread count used during multimodal image/audio processing (mmproj encoding), separate from the main LLM thread count. This allows running the LLM on GPU with minimal CPU threads (e.g. `-t 1`) to reduce sync overhead, while using many threads (e.g. `-tm 16`) for CPU-bound mmproj encoding with `--no-mmproj-offload`. Fallback chain when `-tm` is not specified: 1. `--threads-batch` (-tb) — multimodal encoding is a batch/prefill-like operation, so it makes sense to track with batch thread count 2. `--threads` (-t) — final default Works with both mtmd-cli and llama-server. AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev	2026-05-13 17:44:43 +03:00
Forkoz	8a0f912cb2	Remove outdated asserts from mmproj (#1795 )	2026-05-13 17:40:11 +03:00
Kawrakow	6b221f0c1f	Fix ggml_nbytes (#1798 )	2026-05-13 17:39:25 +03:00
Kawrakow	397150caa2	MTP: faster recurrent state restore (#1791 ) * MTP: store ready per step convolution states * Cleanup	2026-05-13 11:00:24 +03:00
Kawrakow	86b5d076c5	Gemma4 MTP: avoid casting KV cache to f32 (#1786 )	2026-05-13 09:11:27 +03:00
ubergarm	f478a3ec0b	fix: only inflate n_batch for GPU-offloaded mmproj, not CPU (#1788 ) The get_batch_ubatch() function unconditionally inflated n_batch and n_ubatch whenever --mmproj was specified, regardless of whether the mmproj model actually ran on the GPU. This boosted batch size applies to both the main context and the MTP draft context, since params_base.speculative.cparams_dft is derived from common_context_params_to_llama(params_base). When mmproj runs on CPU (--no-mmproj-offload), this batch inflation is unnecessary for mmproj itself (CPU compute is sized by image dimensions independently), but it still inflates the MTP compute buffer proportionally. For large images (e.g. --image-max-tokens 4096), the MTP compute buffer ballooned to ~2020 MiB and triggered an OOM even though the mmproj model was fully on CPU and should have saved VRAM. Restrict the batch inflation to !params.mmproj.path.empty() && params.mmproj_use_gpu so it only triggers when mmproj actually occupies GPU memory. When mmproj runs on CPU, the existing per-chunk decode splitting in mtmd_helper_decode_image_chunk_impl handles large images correctly with the default batch size. AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev	2026-05-13 09:08:42 +03:00
firecoperana	cdc288bc97	server: reset cache tokens after pp stops (#1787 ) Co-authored-by: firecoperana <firecoperana>	2026-05-13 09:05:32 +03:00
Kawrakow	f9a93c37e2	Fix GLM-4.5 MTP loading (#1784 )	2026-05-12 18:06:17 +03:00
Jun Yamog	8b0cd0357a	fix: keep sm70 cublas f32 outputs in f32 (#1776 )	2026-05-12 07:38:42 +03:00
Kawrakow	cec1a6c1f5	MTP: Reuse graphs (again) (#1780 )	2026-05-12 07:36:12 +03:00
Samuel Oliveira Alves	be8435793e	Pre-allocate buffers for hybrid model checkpoints (#1774 ) * hybrid-spec: improve recurrent checkpoint handling in speculative decoding * change per-step save to support scheduling and asynchronous tensor operations * remove redudant backend tensor fallback * improve recurrent tensor handling for split graph	2026-05-12 07:21:25 +03:00
Lingfeng Ren	c2f498ab4c	MTP: use target slot position for drafting (#1781 )	2026-05-12 07:21:03 +03:00
Kawrakow	eb570eb966	MTP: Avoid per step SSM copy (#1778 ) * Avoid copying the per-step SSM state (CUDA) * Avoid copying the per-step SSM state (CPU) * Allocate only what is necessary for per-step SSM state * Cleanup	2026-05-11 18:15:55 +03:00
Kawrakow	3557b446f8	Avoid recurrent state copy (#1777 )	2026-05-11 13:13:59 +03:00
Kawrakow	94940cd882	MTP: ebable per step recurrent state for split mode graph (#1773 )	2026-05-11 12:40:04 +03:00
Lingfeng Ren	35845dd975	server : support MTP with multimodal prompts (#1758 ) Synchronize MTP state after mtmd decode batches so multimodal prompt chunks do not desync the draft context.	2026-05-11 09:51:07 +03:00
Kawrakow	23127139cb	Fix Mistral3 split mode graph (#1771 )	2026-05-10 17:05:13 +03:00
Kawrakow	4bbdb8ed0b	Faster per step recurrent state restore when using MTP (#1767 )	2026-05-10 07:51:06 +03:00
Samuel Oliveira Alves	c2b8bca807	Add MTP Support for Gemma 4 (#1744 ) * gemma-mtp: build the arch to load the MTP model * gemma-mtp: fix mtp kv state * gemma-mtp: refactor some functions and create gguf * gemma-mtp: make usable for embeddings models variant * gemma-mtp: fix qwen mtp load in graph split * gemma-mtp: refactor tensor creation and adjust output tensor handling * Gemma 4 MTP: improve tensor handling, and adjust split mode logic	2026-05-10 07:44:20 +03:00
XZiar	ab0f22b819	Use AVX version VNNI intrinsic when AVX512VNNI not available. (#1748 ) * Use AVX version VNNI intrinsic when AVX512VNNI not available. * remove changes under HAVE_FANCY_SIMD --------- Co-authored-by: XZiar <xziar@xziar.xziar>	2026-05-09 09:02:06 +03:00
Alex	51331f4973	Fix two speculative-decoding crashes that prevent any usage (#1760 ) This patch addresses two latent bugs in examples/speculative/speculative.cpp that prevent llama-speculative.exe from running on greedy sampling (temp=0) or producing rejection-sampling output (temp>0): 1. Line 191: `params.sparams.grammar = { COMMON_GRAMMAR_TYPE_NONE, "" };` invokes `common_grammar(type, grammar)` which asserts `type != NONE \|\| !grammar.empty()`. Both conditions fail with the intended-to-be-empty grammar, so every speculative run hits a hard `GGML_ASSERT` in common/sampling.h:63 immediately after model load. Fix: default-construct via `common_grammar{}` to bypass the field-init constructor. 2. Lines 293-294: `GGML_ASSERT(dist_tgt.sorted)` and `GGML_ASSERT(dist_dft.sorted)` fire whenever the draft sampler does not set the .sorted flag (which is most modern sampler paths). Comment them out — the next ~10 lines re-sort both distributions by id explicitly, so the assertion is incorrect anyway. Fix: replace the asserts with an explanatory comment. After both fixes, `llama-speculative.exe` runs to completion. The acceptance-rate measurement at temp=0 still looks suspicious (0% across same-family draft/target pairs), but that is a different issue out of scope for this PR. Tested on Qwen3-0.6B-IQ4_XS drafting Qwen3-1.7B-IQ4_XS, both base models from `bartowski/Qwen_Qwen3-*-GGUF` on Windows + ik_llama.cpp build at HEAD of windows-mingw-default-win10 (which is itself a follow-up to PR #1755).	2026-05-09 08:36:38 +03:00
Kawrakow	96127976f2	Use AVX2 when available for greedy speculative sampling (#1761 ) * Use AVX2 when available for greedy speculative sampling * Avoid some code duplication	2026-05-09 08:32:20 +03:00
Kawrakow	2f0b47c19d	Use async copies to save/restore recurrent state (#1759 )	2026-05-09 08:31:56 +03:00
Kawrakow	9f60de9cc5	Fix discarding tokens from the KV cache during MTP drafting (#1757 )	2026-05-09 08:31:25 +03:00
Alex	98950267c6	ggml : default GGML_WIN_VER to 0x0A00 (Windows 10) (#1755 ) The default of 0x602 (Windows 8) causes a build failure on any toolchain where _WIN32_WINNT propagates into vendored cpp-httplib (notably MinGW with the bundled w64devkit GCC). cpp-httplib's httplib.h has, for some time now, contained: #ifdef _WIN32 #if defined(_WIN32_WINNT) && _WIN32_WINNT < 0x0A00 #error "cpp-httplib doesn't support Windows 8 or lower. Please use Windows 10 or later." #endif #endif so the entire llama-server target fails to compile on Windows + MinGW unless the user passes -DGGML_WIN_VER=0x0A00 manually. Bumping the default to 0x0A00 (Windows 10) keeps Windows 8 reachable for anyone who explicitly requests it (-DGGML_WIN_VER=0x602) while letting the default Windows + MinGW build succeed end-to-end. Windows 8 / 8.1 reached end of support in January 2023, and Windows 10 is a strict superset of the Win8 surface used elsewhere (PrefetchVirtualMemory etc.), so this is strictly additive on the API side. Verified by building with w64devkit 2.8.0 (gcc 16.1.0) on Windows 11 without any -DGGML_WIN_VER override: all 266 ninja targets link cleanly, including bin/llama-server.exe, and llama-cli runs Qwen3-4B-Thinking-2507 IQ4_XS at ~6.2 tok/s with q8_0 KV at 4096 context.	2026-05-08 13:23:04 +03:00
joelfarthing	9a26522af2	qwen35moe : support MTP tail layer (#1745 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-05-07 15:46:41 +03:00
Zhekun Hu	9ddb510787	Add Turing and Ampere (A100) GGML to docker build file (#1691 ) * Add Turing and Ampere (A100) GGML to docker build file At the moment, the docker file for image builds do not build for CUDA architectures below 8.6, and ik_llama.cpp specifies support for architectures Turing and above, this PR sets the CUDA architecture list to include the architecture for Turing (7.5) and A100 (8.0) * Remove 80 because few ppl have A100s and it does seem like many cuda arches cause issues for build * switch to 86-real and 89-real with 75, 80, 90 using virtual ptx jit * nvm, even adding 90-virtual causes linker error --------- Co-authored-by: Codex <codex@local>	2026-05-07 12:58:58 +03:00
Henrik Berglund	75f0ab300e	Update repository clone instructions in build.md (#1753 )	2026-05-07 12:57:06 +03:00
dungquixote42	b93721902b	Add Expiring Logit Bias (#1731 ) * initial commit * fix substr() out of range * add tilde (~) as bias range indicator * fix runtime error when the first entry is exitword	2026-05-06 09:25:38 +03:00
firecoperana	39b3a188e8	server: fix mtmd checkpoint restore and avoid checkpoint host copies (#1743 ) Co-authored-by: firecoperana <firecoperana>	2026-05-06 08:42:21 +03:00
Kawrakow	e722f0bb73	MTP tweaks (#1741 )	2026-05-06 08:35:11 +03:00
Kawrakow	8b56d813a9	MTP improvements (#1736 ) * MTP improvements * Cleanup	2026-05-05 08:05:24 +03:00
Andrew Moryakov	45dfd80371	readme : link "Build for CPU" to AVX-512 build flags reference (#1735 ) Adds a short note in README's "Build for CPU" section pointing to the AVX-512 build flags reference in docs/build.md (added by #1729). The vanilla `cmake -B build -DGGML_NATIVE=ON` example shown right above silently falls back to the AVX2 path on AMD Zen4 / Intel Sapphire Rapids+ hardware; users hitting "my Zen4 build is slow" tend to look at the README first, so a single-paragraph cross-reference here saves them from having to dig through docs/ to find the right knob. No content moved — README still has its own short example, the new paragraph just points at the deeper reference.	2026-05-04 15:35:24 +03:00
Andrew Moryakov	a67287124d	scripts : add build-zen.{sh,bat} helpers for AVX-512-capable CPUs (#1734 ) Adds two thin one-liner-ish helpers that invoke `cmake` with the flag set documented in docs/build.md "CPU build flags for AVX-512", so that users on AMD Zen4 / Intel Sapphire Rapids+ hardware get the IQK HAVE_FANCY_SIMD path activated without having to remember the five relevant `GGML_AVX512_=ON` options. scripts/build-zen.sh - Linux / macOS bash wrapper scripts/build-zen.bat - Windows MSVC wrapper (run from a "x64 Native Tools Command Prompt") Both default to a "build" output directory, both pass through to the same cmake invocation, and both work alongside the existing build options (no behavioural change to vanilla CMake builds). .gitignore: added `!scripts/build-.sh` / `!scripts/build-.bat` exceptions, in line with the existing `!build-info.sh` / `!build.zig` exceptions, so the scripts directory build helpers don't get caught by the broad `build` artifact pattern. This is a follow-up to #1729 — the docs section explains why these flags matter, this PR makes them one command away.	2026-05-04 15:34:28 +03:00
Andrew Moryakov	485c431b9d	docs : restructure AVX-512 build flags section, recommend GGML_AVX512_=ON first (#1733 ) Per @ikawrakow follow-up suggestion in #1729 to "offer the original version at the beginning and note that in case that does not work, they can use GGML_ARCH_FLAGS in that way". Restructured the docs/build.md AVX-512 section so that the recommended high-level CMake options come first, with GGML_ARCH_FLAGS as the fallback for cases where the high-level options don't propagate the necessary macros (older MSVC, ARM cross-compile, exotic toolchains). Empirical confirmation that GGML_AVX512_=ON activates HAVE_FANCY_SIMD: on MSVC 2022, the resulting compile line (read from build/.../flags.make) contains both `/arch:AVX512` (from GGML_AVX512=ON) and explicit `-D__AVX512VNNI__` / `-D__AVX512VBMI__` / `-D__AVX512BF16__` (added by the matching GGML_AVX512_*=ON options via add_compile_definitions(...) at ggml/src/CMakeLists.txt:1361-1372). The runtime banner prints `HAVE_FANCY_SIMD is defined` and `system_info: AVX512_VNNI = 1`. Also added a brief note about the separate HAVE_VNNI256 gate in iqk_config.h:52-54, which gives meaningful speedups on AVX2-only CPUs with the VNNI extension (some Alder/Raptor Lake parts). Documentation only — no code changes.	2026-05-04 15:32:38 +03:00
Samuel Oliveira Alves	a342831115	suffix-spec: load corpus in chunks (#1721 )	2026-05-04 07:56:07 +03:00
Andrew Moryakov	418d60a909	docs : add AVX-512 build flags reference for Zen4 / Sapphire Rapids+ (#1729 ) The IQK quantized GEMM kernels (ggml/src/iqk/iqk_gemm_.cpp) are gated by HAVE_FANCY_SIMD in iqk_config.h, which requires five AVX-512 macros to be defined: __AVX512F__, __AVX512VNNI__, __AVX512VL__, __AVX512BW__, __AVX512DQ__. If they are not defined, the AVX-512 quantized matmul path is skipped silently — no build warning, no runtime symptom, just lower performance than the hardware can deliver. Surprises users on Windows/MSVC where -march=native semantics are not propagated. Adds a docs/build.md section that documents: - Which macros gate which path (HAVE_FANCY_SIMD for quant GEMM, __AVX512F__ alone for f16/f32, __AVX512BF16__ for bf16, __AVXVNNI__ for AVX2+VNNI-only CPUs). - Linux/GCC: GGML_NATIVE=ON (default) handles this automatically on Zen4 / Sapphire Rapids; just verify with objdump. - Windows/MSVC and cross-compile: explicit GGML_ARCH_FLAGS with -D__AVX512 defines is required. - Note on Zen4 implementing AVX-512 as 256-bit double-pumped. Documentation only — no code changes, no behavioural changes, no new CMake options introduced.	2026-05-03 17:35:01 +03:00
dmaivel	38c200373f	Fix graph reuse regression with MTP checkpoints (#1728 )	2026-05-03 12:04:39 +03:00
Heath Albritton	bc549da0f7	server : catch sampler/grammar exceptions to avoid process abort (#1725 ) (#1726 ) Wrap the two slot-level sample/accept call sites in try/catch (std::exception). On exception: log, send_error to the task, release the slot, continue serving. Matches the existing try/catch around common_sampler_init in the same file. Without this, llama_grammar_accept_token throwing "Unexpected empty grammar stack after accepting piece: <pad> (0)" (reproducible on Gemma 4 + json_schema + ctx_shift, see #1725) unwinds out of update_slots -> queue start_loop -> main, hits std::terminate, and aborts the whole server process.	2026-05-03 08:21:09 +03:00
Kawrakow	e76700119d	Support Mimo-2.5 (#1723 )	2026-05-03 08:16:02 +03:00
Kawrakow	2dd3818083	MTP: better graph reuse (#1713 )	2026-05-03 08:15:21 +03:00
dmaivel	1beaaa002d	speculative: enable MTP per-step checkpoints with CPU recurrent layers (#1724 )	2026-05-03 08:14:56 +03:00
firecoperana	b8eb8ccbb5	server: revert llama_decode stop (#1722 ) Co-authored-by: firecoperana <firecoperana>	2026-05-02 18:19:59 +03:00
firecoperana	9f1deefa71	server: revert checkpoint fix (#1716 ) Co-authored-by: firecoperana <firecoperana>	2026-05-02 16:09:28 +03:00
KeinNiemand	9c7d8b07cc	perplexity : fix large-vocab logit offset overflow (#1717 ) Co-authored-by: Codex <noreply@openai.com>	2026-05-02 16:08:49 +03:00
dmaivel	1b14f56693	speculative: keep MTP draft hidden state alive across steps (#1718 )	2026-05-02 16:05:41 +03:00

1 2 3 4 5 ...

4497 Commits