Add `-tm` / `--threads-mtmd` to control CPU thread count used during
multimodal image/audio processing (mmproj encoding), separate from the
main LLM thread count.
This allows running the LLM on GPU with minimal CPU threads (e.g. `-t 1`)
to reduce sync overhead, while using many threads (e.g. `-tm 16`) for
CPU-bound mmproj encoding with `--no-mmproj-offload`.
Fallback chain when `-tm` is not specified:
1. `--threads-batch` (-tb) — multimodal encoding is a batch/prefill-like
operation, so it makes sense to track with batch thread count
2. `--threads` (-t) — final default
Works with both mtmd-cli and llama-server.
AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev
The get_batch_ubatch() function unconditionally inflated n_batch and
n_ubatch whenever --mmproj was specified, regardless of whether the
mmproj model actually ran on the GPU. This boosted batch size applies
to both the main context and the MTP draft context, since
params_base.speculative.cparams_dft is derived from
common_context_params_to_llama(params_base).
When mmproj runs on CPU (--no-mmproj-offload), this batch inflation
is unnecessary for mmproj itself (CPU compute is sized by image
dimensions independently), but it still inflates the MTP compute buffer
proportionally. For large images (e.g. --image-max-tokens 4096), the
MTP compute buffer ballooned to ~2020 MiB and triggered an OOM even
though the mmproj model was fully on CPU and should have saved VRAM.
Restrict the batch inflation to !params.mmproj.path.empty() &&
params.mmproj_use_gpu so it only triggers when mmproj actually occupies
GPU memory. When mmproj runs on CPU, the existing per-chunk decode
splitting in mtmd_helper_decode_image_chunk_impl handles large images
correctly with the default batch size.
AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev
* Avoid copying the per-step SSM state (CUDA)
* Avoid copying the per-step SSM state (CPU)
* Allocate only what is necessary for per-step SSM state
* Cleanup
* Use AVX version VNNI intrinsic when AVX512VNNI not available.
* remove changes under HAVE_FANCY_SIMD
---------
Co-authored-by: XZiar <xziar@xziar.xziar>
This patch addresses two latent bugs in examples/speculative/speculative.cpp
that prevent llama-speculative.exe from running on greedy sampling
(temp=0) or producing rejection-sampling output (temp>0):
1. Line 191: `params.sparams.grammar = { COMMON_GRAMMAR_TYPE_NONE, "" };`
invokes `common_grammar(type, grammar)` which asserts
`type != NONE || !grammar.empty()`. Both conditions fail with the
intended-to-be-empty grammar, so every speculative run hits a hard
`GGML_ASSERT` in common/sampling.h:63 immediately after model load.
Fix: default-construct via `common_grammar{}` to bypass the
field-init constructor.
2. Lines 293-294: `GGML_ASSERT(dist_tgt.sorted)` and
`GGML_ASSERT(dist_dft.sorted)` fire whenever the draft sampler does
not set the .sorted flag (which is most modern sampler paths).
Comment them out — the next ~10 lines re-sort both distributions
by id explicitly, so the assertion is incorrect anyway.
Fix: replace the asserts with an explanatory comment.
After both fixes, `llama-speculative.exe` runs to completion. The
acceptance-rate measurement at temp=0 still looks suspicious (0%
across same-family draft/target pairs), but that is a different
issue out of scope for this PR.
Tested on Qwen3-0.6B-IQ4_XS drafting Qwen3-1.7B-IQ4_XS, both base
models from `bartowski/Qwen_Qwen3-*-GGUF` on Windows + ik_llama.cpp
build at HEAD of windows-mingw-default-win10 (which is itself a
follow-up to PR #1755).
The default of 0x602 (Windows 8) causes a build failure on any toolchain
where _WIN32_WINNT propagates into vendored cpp-httplib (notably MinGW with
the bundled w64devkit GCC). cpp-httplib's httplib.h has, for some time
now, contained:
#ifdef _WIN32
#if defined(_WIN32_WINNT) && _WIN32_WINNT < 0x0A00
#error "cpp-httplib doesn't support Windows 8 or lower. Please use
Windows 10 or later."
#endif
#endif
so the entire llama-server target fails to compile on Windows + MinGW
unless the user passes -DGGML_WIN_VER=0x0A00 manually.
Bumping the default to 0x0A00 (Windows 10) keeps Windows 8 reachable for
anyone who explicitly requests it (-DGGML_WIN_VER=0x602) while letting the
default Windows + MinGW build succeed end-to-end. Windows 8 / 8.1 reached
end of support in January 2023, and Windows 10 is a strict superset of the
Win8 surface used elsewhere (PrefetchVirtualMemory etc.), so this is
strictly additive on the API side.
Verified by building with w64devkit 2.8.0 (gcc 16.1.0) on Windows 11
without any -DGGML_WIN_VER override: all 266 ninja targets link cleanly,
including bin/llama-server.exe, and llama-cli runs Qwen3-4B-Thinking-2507
IQ4_XS at ~6.2 tok/s with q8_0 KV at 4096 context.
* Add Turing and Ampere (A100) GGML to docker build file
At the moment, the docker file for image builds do not build for CUDA architectures below 8.6, and ik_llama.cpp specifies support for architectures Turing and above, this PR sets the CUDA architecture list to include the architecture for Turing (7.5) and A100 (8.0)
* Remove 80 because few ppl have A100s and it does seem like many cuda arches cause issues for build
* switch to 86-real and 89-real with 75, 80, 90 using virtual ptx jit
* nvm, even adding 90-virtual causes linker error
---------
Co-authored-by: Codex <codex@local>
Adds a short note in README's "Build for CPU" section pointing to the
AVX-512 build flags reference in docs/build.md (added by #1729).
The vanilla `cmake -B build -DGGML_NATIVE=ON` example shown right above
silently falls back to the AVX2 path on AMD Zen4 / Intel Sapphire
Rapids+ hardware; users hitting "my Zen4 build is slow" tend to look at
the README first, so a single-paragraph cross-reference here saves them
from having to dig through docs/ to find the right knob.
No content moved — README still has its own short example, the new
paragraph just points at the deeper reference.
Adds two thin one-liner-ish helpers that invoke `cmake` with the flag
set documented in docs/build.md "CPU build flags for AVX-512", so that
users on AMD Zen4 / Intel Sapphire Rapids+ hardware get the IQK
HAVE_FANCY_SIMD path activated without having to remember the five
relevant `GGML_AVX512_*=ON` options.
scripts/build-zen.sh - Linux / macOS bash wrapper
scripts/build-zen.bat - Windows MSVC wrapper (run from a
"x64 Native Tools Command Prompt")
Both default to a "build" output directory, both pass through to the
same cmake invocation, and both work alongside the existing build
options (no behavioural change to vanilla CMake builds).
.gitignore: added `!scripts/build-*.sh` / `!scripts/build-*.bat`
exceptions, in line with the existing `!build-info.sh` / `!build.zig`
exceptions, so the scripts directory build helpers don't get caught
by the broad `build*` artifact pattern.
This is a follow-up to #1729 — the docs section explains why these
flags matter, this PR makes them one command away.
Per @ikawrakow follow-up suggestion in #1729 to "offer the original version
at the beginning and note that in case that does not work, they can use
GGML_ARCH_FLAGS in that way".
Restructured the docs/build.md AVX-512 section so that the recommended
high-level CMake options come first, with GGML_ARCH_FLAGS as the fallback
for cases where the high-level options don't propagate the necessary
macros (older MSVC, ARM cross-compile, exotic toolchains).
Empirical confirmation that GGML_AVX512_*=ON activates HAVE_FANCY_SIMD:
on MSVC 2022, the resulting compile line (read from build/.../flags.make)
contains both `/arch:AVX512` (from GGML_AVX512=ON) and explicit
`-D__AVX512VNNI__` / `-D__AVX512VBMI__` / `-D__AVX512BF16__` (added by
the matching GGML_AVX512_*=ON options via add_compile_definitions(...)
at ggml/src/CMakeLists.txt:1361-1372). The runtime banner prints
`HAVE_FANCY_SIMD is defined` and `system_info: AVX512_VNNI = 1`.
Also added a brief note about the separate HAVE_VNNI256 gate in
iqk_config.h:52-54, which gives meaningful speedups on AVX2-only CPUs
with the VNNI extension (some Alder/Raptor Lake parts).
Documentation only — no code changes.
The IQK quantized GEMM kernels (ggml/src/iqk/iqk_gemm_*.cpp) are gated
by HAVE_FANCY_SIMD in iqk_config.h, which requires five AVX-512 macros
to be defined: __AVX512F__, __AVX512VNNI__, __AVX512VL__, __AVX512BW__,
__AVX512DQ__. If they are not defined, the AVX-512 quantized matmul
path is skipped silently — no build warning, no runtime symptom, just
lower performance than the hardware can deliver. Surprises users on
Windows/MSVC where -march=native semantics are not propagated.
Adds a docs/build.md section that documents:
- Which macros gate which path (HAVE_FANCY_SIMD for quant GEMM,
__AVX512F__ alone for f16/f32, __AVX512BF16__ for bf16, __AVXVNNI__
for AVX2+VNNI-only CPUs).
- Linux/GCC: GGML_NATIVE=ON (default) handles this automatically on
Zen4 / Sapphire Rapids; just verify with objdump.
- Windows/MSVC and cross-compile: explicit GGML_ARCH_FLAGS with
-D__AVX512* defines is required.
- Note on Zen4 implementing AVX-512 as 256-bit double-pumped.
Documentation only — no code changes, no behavioural changes, no
new CMake options introduced.
Wrap the two slot-level sample/accept call sites in
try/catch (std::exception). On exception: log, send_error to the
task, release the slot, continue serving. Matches the existing
try/catch around common_sampler_init in the same file.
Without this, llama_grammar_accept_token throwing
"Unexpected empty grammar stack after accepting piece: <pad> (0)"
(reproducible on Gemma 4 + json_schema + ctx_shift, see #1725)
unwinds out of update_slots -> queue start_loop -> main, hits
std::terminate, and aborts the whole server process.