mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-06-28 04:30:15 -05:00
docs : add AVX-512 build flags reference for Zen4 / Sapphire Rapids+ (#1729)
The IQK quantized GEMM kernels (ggml/src/iqk/iqk_gemm_*.cpp) are gated by HAVE_FANCY_SIMD in iqk_config.h, which requires five AVX-512 macros to be defined: __AVX512F__, __AVX512VNNI__, __AVX512VL__, __AVX512BW__, __AVX512DQ__. If they are not defined, the AVX-512 quantized matmul path is skipped silently — no build warning, no runtime symptom, just lower performance than the hardware can deliver. Surprises users on Windows/MSVC where -march=native semantics are not propagated. Adds a docs/build.md section that documents: - Which macros gate which path (HAVE_FANCY_SIMD for quant GEMM, __AVX512F__ alone for f16/f32, __AVX512BF16__ for bf16, __AVXVNNI__ for AVX2+VNNI-only CPUs). - Linux/GCC: GGML_NATIVE=ON (default) handles this automatically on Zen4 / Sapphire Rapids; just verify with objdump. - Windows/MSVC and cross-compile: explicit GGML_ARCH_FLAGS with -D__AVX512* defines is required. - Note on Zen4 implementing AVX-512 as 256-bit double-pumped. Documentation only — no code changes, no behavioural changes, no new CMake options introduced.
This commit is contained in:
parent
38c200373f
commit
418d60a909
@ -181,6 +181,76 @@ llama_new_context_with_model: CUDA_Host compute buffer size = 8.31 MiB
|
||||
gmake CC=/usr/local/bin/clang15 CXX=/usr/local/bin/clang++15 -j4
|
||||
```
|
||||
|
||||
## CPU build flags for AVX-512 (Zen4 / Sapphire Rapids+)
|
||||
|
||||
The IQK quantized GEMM kernels in `ggml/src/iqk/iqk_gemm_*.cpp` (the dominant
|
||||
hot path for quantized prompt processing) are gated by the `HAVE_FANCY_SIMD`
|
||||
macro defined in
|
||||
[`ggml/src/iqk/iqk_config.h`](../ggml/src/iqk/iqk_config.h):
|
||||
|
||||
```c
|
||||
#if defined(__AVX512F__) && defined(__AVX512VNNI__) && \
|
||||
defined(__AVX512VL__) && defined(__AVX512BW__) && defined(__AVX512DQ__)
|
||||
#define HAVE_FANCY_SIMD
|
||||
#endif
|
||||
```
|
||||
|
||||
If these five macros are not defined at compile time, the AVX-512 quantized
|
||||
matmul path is skipped and the build falls back to AVX2. There is no warning
|
||||
at build time and no obvious symptom at runtime — performance is simply lower
|
||||
than what an AVX-512-capable CPU (AMD Zen4 / Intel Sapphire Rapids+) can
|
||||
deliver. A few related gates are worth knowing about:
|
||||
|
||||
- `f16`/`f32` GEMM is gated only by `__AVX512F__`.
|
||||
- Native `bf16` GEMM and the use of a `bf16` KV cache in flash attention is
|
||||
gated by `__AVX512BF16__`.
|
||||
- For AVX2-only CPUs that implement the VNNI extension (`vpdpbusd`), the
|
||||
equivalent "fancy" path is gated by `__AVXVNNI__`. VNNI alone is
|
||||
responsible for most of the speedup on quantized matmul.
|
||||
|
||||
### Linux / GCC
|
||||
|
||||
Modern GCC with `GGML_NATIVE=ON` (the default unless cross-compiling)
|
||||
resolves `-march=native` on Zen4 / Sapphire Rapids hardware to a target that
|
||||
defines all of the macros above. No manual configuration is usually needed.
|
||||
Verification:
|
||||
|
||||
```bash
|
||||
objdump -d build/bin/llama-cli | grep -c vpdpbusd
|
||||
# A non-trivial count (hundreds) means VNNI compiled in.
|
||||
# Zero means the IQK kernels fell back to AVX2.
|
||||
```
|
||||
|
||||
### Windows / MSVC and other cases that need explicit defines
|
||||
|
||||
MSVC does not propagate `-march=native` semantics, and in cross-compile
|
||||
scenarios `GGML_NATIVE` is intentionally disabled. In both cases the
|
||||
macros must be supplied explicitly via `GGML_ARCH_FLAGS`, which the build
|
||||
system forwards verbatim to the C/C++ compiler line:
|
||||
|
||||
```bash
|
||||
cmake -B build -DCMAKE_BUILD_TYPE=Release \
|
||||
-DGGML_ARCH_FLAGS="-D__AVX512F__ -D__AVX512VNNI__ -D__AVX512VL__ -D__AVX512BW__ -D__AVX512DQ__ -D__AVX512BF16__"
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
For AVX2 CPUs that have VNNI but not AVX-512, the equivalent is:
|
||||
|
||||
```bash
|
||||
cmake -B build -DCMAKE_BUILD_TYPE=Release \
|
||||
-DGGML_ARCH_FLAGS="-D__AVXVNNI__"
|
||||
```
|
||||
|
||||
After the build completes, the same `objdump | grep -c vpdpbusd` check
|
||||
confirms the quantized path is in.
|
||||
|
||||
### Note on Zen4 throughput
|
||||
|
||||
On Zen4 the AVX-512 implementation is 256-bit double-pumped: each `_mm512_*`
|
||||
op issues two micro-ops with throughput of roughly one AVX-512 op per two
|
||||
cycles. The wider register width and reduced loop overhead still produce
|
||||
measurable gains over AVX2 on prompt processing for IQK kernels.
|
||||
|
||||
## Metal Build
|
||||
|
||||
On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user