Per @ikawrakow follow-up suggestion in #1729 to "offer the original version
at the beginning and note that in case that does not work, they can use
GGML_ARCH_FLAGS in that way".
Restructured the docs/build.md AVX-512 section so that the recommended
high-level CMake options come first, with GGML_ARCH_FLAGS as the fallback
for cases where the high-level options don't propagate the necessary
macros (older MSVC, ARM cross-compile, exotic toolchains).
Empirical confirmation that GGML_AVX512_*=ON activates HAVE_FANCY_SIMD:
on MSVC 2022, the resulting compile line (read from build/.../flags.make)
contains both `/arch:AVX512` (from GGML_AVX512=ON) and explicit
`-D__AVX512VNNI__` / `-D__AVX512VBMI__` / `-D__AVX512BF16__` (added by
the matching GGML_AVX512_*=ON options via add_compile_definitions(...)
at ggml/src/CMakeLists.txt:1361-1372). The runtime banner prints
`HAVE_FANCY_SIMD is defined` and `system_info: AVX512_VNNI = 1`.
Also added a brief note about the separate HAVE_VNNI256 gate in
iqk_config.h:52-54, which gives meaningful speedups on AVX2-only CPUs
with the VNNI extension (some Alder/Raptor Lake parts).
Documentation only — no code changes.
The IQK quantized GEMM kernels (ggml/src/iqk/iqk_gemm_*.cpp) are gated
by HAVE_FANCY_SIMD in iqk_config.h, which requires five AVX-512 macros
to be defined: __AVX512F__, __AVX512VNNI__, __AVX512VL__, __AVX512BW__,
__AVX512DQ__. If they are not defined, the AVX-512 quantized matmul
path is skipped silently — no build warning, no runtime symptom, just
lower performance than the hardware can deliver. Surprises users on
Windows/MSVC where -march=native semantics are not propagated.
Adds a docs/build.md section that documents:
- Which macros gate which path (HAVE_FANCY_SIMD for quant GEMM,
__AVX512F__ alone for f16/f32, __AVX512BF16__ for bf16, __AVXVNNI__
for AVX2+VNNI-only CPUs).
- Linux/GCC: GGML_NATIVE=ON (default) handles this automatically on
Zen4 / Sapphire Rapids; just verify with objdump.
- Windows/MSVC and cross-compile: explicit GGML_ARCH_FLAGS with
-D__AVX512* defines is required.
- Note on Zen4 implementing AVX-512 as 256-bit double-pumped.
Documentation only — no code changes, no behavioural changes, no
new CMake options introduced.
* Fix compilation on clang-cl.exe
Fixes https://github.com/ikawrakow/ik_llama.cpp/issues/1169
See bitwise ariphmetics here: https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html
Clang (and GCC) supports a language feature called Vector Extensions.
To Clang, `__m512i` is not just a "struct" or a "bag of bits"; it is recognized by the compiler as a native vector type.
Because it is a native vector type, Clang automatically maps standard C operators to the corresponding hardware instructions.
When you write `a | b`, Clang sees that a and b are 512-bit integer vectors.
It implicitly understands that the bitwise OR operator (|) applies to these vectors.
It automatically generates the VPORQ (or VPORD) instruction without needing any helper function.
MSVC follows a stricter, more traditional C++ model regarding intrinsics.
In MSVC, __m512i is defined in the header files (<immintrin.h>) as a struct or union (e.g., typedef struct __m512i { ... } __m512i). To the MSVC compiler, it is essentially a user-defined data type, not a fundamental language primitive like int or float.
Standard C++ does not define what `|` means for a user-defined struct.
MSVC does not have the same "Vector Extensions" that automatically apply operators to these structs.
When you write `a | b` in MSVC, the compiler looks for a definition of `operator|` for the __m512i struct. Since the standard headers don't provide one, the compiler throws an error.
You must use the explicit intrinsic function provided by Intel/MSVC: _mm512_or_si512(a, b).
To get the nice syntax `(a | b)` in MSVC, you have to manually "teach" the compiler what `|` means by defining the `operator|` overload yourself.
* Update README.md with build instructions for Windows
Current README lacks any guide for Windows users, whereas build process on that platform is quite compicated
* Update build.md with instruction about clang-cl.exe
Brings step-by-step build instruction for Windows
* Apply suggestions from code review
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
* Polish build.md for Windows usage
Added example of use for Windows
* Apply suggestions from code review
---------
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>