llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-27 23:50:20 -05:00

History

Matt Corallo 19620004f5

vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (#23056 )

Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even
though they're only 2-byte aligned, and Q3_K still wins on
NVIDIA as well.

mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, we can do
subtraction directly on a full int32_t rather than an i8vec4
with bit twiddling because the high bit is always free to start.

On Intel BMG on mesa, the switch to MMVQ provides an immediate
~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and
~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

The futher switch to block loads leads to a ~24% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA
override for K quants on Xe2 as well.

2026-06-01 11:46:48 +02:00

cmake

ggml : Parallelize quant LUT init (#23595 )

2026-05-25 10:15:46 +03:00

include

ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500)

2026-05-25 12:38:01 +03:00

src

vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (#23056 )

2026-06-01 11:46:48 +02:00

.gitignore

vulkan : cmake integration (#8119 )

2024-07-13 18:12:39 +02:00

CMakeLists.txt

ggml : bump version to 0.13.1 (ggml/1523)

2026-05-29 09:56:08 +03:00