* opencl: rework FA kernel for f16 and f32
* opencl: flash-attention prefill prepass kernels
- flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple
- flash_attn_mask_pad_f16 pads the matching mask tile
- flash_attn_blk_f16 classifies each KV tile per query block as
fully masked / mixed / fully unmasked, so
the main kernel can skip fully-masked tiles
and the mask lookup for fully-unmasked ones
* opencl: FA kernels for q4_0 and q8_0
* opencl: `set_rows` for f32 to q8_0/q4_0
* opencl: dequant kernels for q4_0 and q8_0
* opencl: add FA tile tuning table with override
* opencl: wire host side for FA
* opencl: q4_0 MoE tensors are also SOA'ed
* opencl: cosmetic fix
* opencl: refactor, also clarify some code paths in comments
* opencl: fix inifity for `-cl-finite-math-only`
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
* [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy
Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies.
When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel.
This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps.
* Add new tests that execute the new optimized strided copy path
* Return unsupported for strided copy in OpenVINO, as new tests are failing
* CUDA: Improve performance via less synchronizations between token (#17795)
* Adds CPU-to-CUDA copy capability to
ggml_backend_cuda_cpy_tensor_async()
* Adds function to relax sync requirements between input copies on
supported backends (CUDA for now)
* Exchanges synchronous copy with async copy function.
* Adds macro guards to allow compilation in non-CUDA builds
* Reworked backend detection in ggml-backend.cpp to avoid linking
conflicts
* Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues
* Minor cleanup
* Makes opt-in to relax use of explicit syncs more general. Backends like
vulkan which require a synchronization between HtoD copies and graph
execution could also adopt this change now.
* Reintroduces stricter check for CPU->CUDA backend async copy via
GGML_DEVICE_TYPE_CPU.
* Corrects initialization of ggml_backend_sync_mode in
ggml_backend_sched_split initialization
* Simplifies synchronizations to adhere to `saaasg` pattern.
* Apply suggestion from @ggerganov (src->buffer to buf_src)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Apply suggestion from @ggerganov (src->buffer to buf_src) v2
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Apply suggestions from @johannesgaessler code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Adds single-GPU synchronizations to multi-GPU settings to fix hip backend pipeline parallel bugs.
* Scheduler Hardening: Exclude hip/MUSA from copy_from_host CPU split ->
GPU split optimization
* Scheduler Hardening: Re-adding original additional synchronizations for
non-async backends
* Adds disclaimer to hip/musa exclusion of copy_from_host. Highlights that it is out of
precaution, but that no perf-impact is visible, and that it can be
revisited separately anytime.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* vulkan: add INTEL_PRE_XE2 arch enum and enable coopmat1 on Intel Xe-LPG Plus (1/3, Xe1-ARLH)
Co-authored-by: Xia, Jie <jie.xia@intel.com>
Co-authored-by: Liu, Russell <russell.liu@intel.com>
* Address comments of bf16 and trailing whitespace
* Rename INTEL_PRE_XE2 to INTEL_XE1 and remove driver workaround
* Add Windows driver check
---------
Co-authored-by: Xia, Jie <jie.xia@intel.com>
Co-authored-by: Liu, Russell <russell.liu@intel.com>
* ggml-cpu: fix SVE leftover path in ggml_vec_dot_f32
2D convolutions with kernel size 9 produced different results on SVE
enabled ARM devices. After debugging it turned out that ggml_vec_dot_f32
was using data from inactive lanes.
Use svmla_f32_m(pg, sum1, ax1, ay1) so inactive lanes retain sum1.
* cont : clean-up
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add failing test-case to test-backend-ops
Extracted from https://github.com/ggml-org/llama.cpp/issues/24072
* Minimize repro with help of AI
N = 8 * (65535 - 1) + 1 = 524273
* Port and adjust workaround from 0ba798341e
Fall-back should share code, also relax y-z constraint to be inclusive
* Add test-case + fallback also for y dim
* Fix x-guards which is 2^{31}-1, so inlusive of INT_MAX
* Fix overflow problems for transposed copy kernel
* Sycl tp stage1 (#1)
* SYCL: tensor parallelism (--split-mode tensor) for dual-GPU
Adds the comm_init/comm_free/comm_allreduce_tensor trio that the
meta-backend queries via get_proc_address to enable backend-specific
all-reduce, mirroring the pattern used by ggml-cuda.cu.
For N=2 (the common dual-GPU case) implements a degenerate ring
all-reduce with two size-branched paths:
* Small (nelem < 32768): FP32 direct memcpy + per-device ADD kernel
chained via depends_on(memcpy_event). 4 SYCL submissions/call.
* Large (nelem >= 32768): BF16-compressed. Each device compresses
FP32 -> BF16 in a local outbox, cross-device memcpys to the peer's
inbox (HALF the PCIe bytes), then decompresses + adds into the
local FP32 partial. 6 SYCL submissions/call but PCIe bytes halved
-- wins for any tensor where PCIe dominates kernel time.
Threshold and BF16 path pattern mirror the CUDA NCCL allreduce.
Storage: ONE persistent uint8_t buffer per device, 4 * nelem bytes
(matches both path layouts: FP32 nelem floats; BF16 outbox+inbox =
2 * nelem uint16_t each). Single alloc+free per device keeps the
SYCL pool's strict-LIFO invariant trivial.
Initial impl handles N=2 FP32 contiguous tensors. Other cases return
false, causing the meta-backend to use its generic butterfly fallback.
Per-call sync is intentionally omitted. SYCL in-order queue semantics
ensure that the meta-backend's next compute on the same per-device
queue waits for our final ADD, and the next allreduce's first op on
the same persistent buffer waits via the same queue. Only comm_free
does an explicit final wait.
OneCCL is NOT used: OneCCL 2021.17 hardcodes single-device-per-process
in communicator_impl.hpp:47 (condition devices.size() == 1), which is
incompatible with llama.cpp's single-process multi-GPU model.
Measured on dual Intel Arc Pro B70 (NEO 26.05.x, oneAPI 2025.3 +
DPC++ nightly):
Llama-3.3-70B Q4_K_M, -sm tensor -fa 1 -ctk f16 -ctv f16:
pp512 = 377.08 t/s (vs 313.65 layer mode = +20.2%)
tg128 = 17.40 t/s (vs 9.74 layer mode = +78.6%)
Qwen3-Coder-Next-80B-A3B Q3_K_M (MoE):
pp512 = 216.56 t/s (vs 156.58 meta-backend butterfly = +38.3%)
tg128 = 17.60 t/s (vs 14.31 meta-backend butterfly = +23.0%)
Qwen3-4B Q4_K_M:
pp64 = 984.51 t/s, tg16 = 49.29 t/s
Llama-3.3-70B in SYCL TP now comfortably beats production layer mode
on both prefill and decode. Coder-Next-80B-A3B (MoE) also wins on
both — the BF16 path is what unlocks the many-medium-allreduces
prefill pattern.
Build/CMake: no changes. No new dependencies. ~210 lines added across
ggml-sycl.h and ggml-sycl.cpp.
* Fix comments
* documentation update to address PR feedback
* Bring over my device-to-device memcpy chagnes
* move the dev2dev_memcpy calls to the upstream 7-parameter variety
* Fix a typo and remove a trailing whitespace
* hex-mm: new weight layout and fusion updates
* hvx-mm: unroll the new tiled vec_dots to optimize hvx register util
* hex-mm: optimize dyn.quant format for q8_0 and q8_1 to reduce overhead in vec_dots.
* hvx-mm: parallel quantizer per block for large rows
* hvx-mm: simplify and futher optimize dyn.quant and vec_dots
* hvx-mm: keep intermediate per tile accumulators in fp16
* hmx-mm: optimize weight dequant by aligning the repacked tiles with the DMA
* hmx-mm: remove qweight scratch and just use vtcm_weight
* hmx-mm: remove all unused and obsolete code
* hmx-mm: the new tiled repack format is here to stay -- rename all x4x2 to _tiled
* hmx-mm: improve activation processing with dma prefetch
* hex-mm: fix hmx/hvx fallback logic and MUL_MAT_ID allocation (unbreaks OLMoE)
* hex-mm: align the weight tiles with dma just like we did in hmx-mm
* hex-mm: factor out common mm bits into htp/matmul-ops.h
* hex-mm: start moving mm kernel selection to the host
* hex-mm: move all of the matmul param compute into the host
* hmx-mm: restore pipelined mode
* hmx-mm: unroll the dequant functions to optimize register usage
* hmx-mm: further improve activation process
* hex-mm: use vtcm_seq_alloc for all vtcm allocations and define more common functions
* hex-mm: improve mm optimizer to acount for number of activation threads
* hex-mm: fix matmul-id kernel params selection (unbreaks OLMoE and LFM)
* hexagon: remove support for arch < v73 since HMX is now required for most use-cases
* hex-mm: cleanup naming for consistency
* hex-mm: make sure matmul fusion accounts for vtcm allocation
* hex-mm: minor cleanup for kernel_params definition
* hex-mm: replace hardcoded limits with proper checks for vtcm requirements
* hex-mm: add support for non-tiled mm as a fallback option and factor out hvx kernels into separate header
* hex-mm: remove unused functions
* hex-mm: add shorthand for MM_SELECT in run-tool script
* hvx-mm: factor out hvx/hmx microkernels and unify matmul entry and dispatch
* hex-mm: further cleanup matmul fallback path
* hex-mm: refactor matmul entry point and dispatch a bit further
* hexagon: update cmake build to enable hmx for everything
* hex-ops: optimize kernel_param updates and include summary in the logs
* hex-mm: add support for GGML_HEXAGON_MM_SELECT
* hex-mm: add hex-common header
* hex-mm: pass correct number of tasks to workpool
* hex-mm: add proper checks for no-work in dyn.quant tasks
* hex-mm: convert all quantizers into a macro
* hex-mm: fix hvx-flat fallback to pass all MUL_MAT tests
* hex-mm: vectorize q8_1 quantizer
* hex-mm: improve fused ffn mm stride handling
* hex-mm: consistent use of n_threads and pipeline in kernel_params
* hexagon: minor formatting
* hex-mm: update MUL_MAT_ID kernel_param handling to make sure host/npu are in sync
* hvx-mm: go back to accumulating in fp32 in tiled hvx kernels, more accurate and same perf
* hvx-mm: unroll the loops and remove masking that is not needed for tiled accums
* hmx-mm: optimize activation processing (slit loops, some unrolling, etc)
* hmx-mm: minor optimization for output processing
* hex-mm: consistent use of uint32_t and size_t in mm kernels
* hex-mm: remove legacy restrictions for rows to be multiple of 256
* hexagon: replace sprintf with snprintf
* hex-mm: relax hardcoded nrows checks and rely on VTCM size requirements
* hexagon: minor alignment fix
* hexagon: fix trailing spaces
* hex-mm: relax padding from 256 to 128 (leftovers)
* hex-mm: remove redundant checks for weight align to 128
we always use 2D dma for the weights and align them properly
* hmx-mm: MUL_MAT_ID better work distribution between hvx threads and hmx tracing
* hex-mm: specialize per-token mmid activation handling
* hex-profile: update python scripts to handle kernel-params section in the logging output
* hex-mm: move n_prefetch (aka dma_depth) into kernel params and remove unused fields
* hex-trace: use easier to parse format, simply and fix post-proc scripts
* hmx-mm: relax 32 row limit for output processing which helps utilization
* hmx-mm: use start-chunk idx for tracing info
* hmx-mm: parameterize activation dma pipeline
* hexagon: add support for simple graph caching to avoid recomputing kernel-params
* hex-mm: remove left-over repack functions
* hex-mm: tighten n_prefetch asserts
* hex-mm: remove duplicate round/align_up helper
* hexagon: cleanup common header used in host/npu
* hexagon: update early wakeup threshold
* hmx-mm: define cost constants and update solver to assume that repacked ne[1] is padded to 32
* hmx-mm: make precompute_matmul a bit more readable (split into smaller functions, etc)
* hex-mm: remove n_threads constraint
* hex-mm: minor formatting updates
* hex-mm: remove obsolete profiling logs
* hex-mm: restore hardcode gate to refuse lm-head to avoid repacking that tensor
* vulkan-shaders-gen: fail the build when a shader fails to compile
vulkan-shaders-gen did not detect shader-compile subprocess failures, so a
broken libggml-vulkan could be produced while the build reported success and
the breakage only surfaced at run time. execute_command() discarded the child
exit code (POSIX waitpid passed nullptr for status; the Windows branch never
called GetExitCodeProcess) and string_to_spv decided success only from whether
stderr was empty, so a non-zero exit with empty stderr, or a subprocess that
failed to launch, was treated as success.
Return the child exit code from execute_command() (WEXITSTATUS on POSIX,
GetExitCodeProcess on Windows), treat a non-zero exit or non-empty stderr or a
launch exception as a failure, and record it in an atomic flag. main() checks
the flag after process_shaders() and returns EXIT_FAILURE before writing the
output files, so the build stops instead of emitting a broken backend.
Fixes#24393
Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
* vulkan-shaders-gen: simplify compile_failed access and drop unreachable return
Address review feedback on #24450:
- Access the std::atomic<bool> compile_failed directly (= / implicit bool)
instead of .store()/.load(); the flag stays atomic because the worker
threads in process_shaders() set it concurrently.
- Remove the unreachable trailing return -1 in execute_command(): on POSIX the
child _exit()s after execvp and the parent returns (fork()<0 throws); on
Windows the block returns the exit code.
Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
---------
Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
* vulkan: make SQR/SQRT/SIN/COS/CLAMP/LEAKY_RELU use unary.comp
* vulkan: make NORM support noncontig
* add noncontiguous row test cases for norm/l2_norm, handle this in the CPU backend and l2_norm.comp
* fix supports_op for cuda and webgpu
The result-checking and test debug paths in ggml-vulkan.cpp call ggml_graph_compute_with_ctx() to compute a CPU reference graph, but that symbol is defined in ggml-cpu, which ggml-vulkan does not link. Enabling -DGGML_VULKAN_CHECK_RESULTS=ON (or -DGGML_VULKAN_RUN_TESTS=ON) therefore fails to link with an unresolved external (e.g. LNK2019 on MSVC, undefined reference on GCC/Clang). This regressed after ggml-cpu was split into its own library. Link ggml-cpu under those two options so the debug builds link again.
Signed-off-by: Wyatt Caldwell <218154709+Detensable@users.noreply.github.com>
* ggml-cpu: support K tails in Power10 MMA Q8/Q4 matmul
This patch removes the requirement that K be divisible by kc in the tinyBlas_Q0_PPC tiled matmul path. Process the final K panel using its actual depth and pass the reduced panel size through packing and kernel execution. This allows more workloads to use the MMA kernel and reduces fallback to mnpack.
* Apply suggestion from @taronaeo
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* cuda: add GGML_OP_COL2IM_1D, follow-up to the CPU op
* cuda: col2im_1d use fast_div_modulo for the index decomposition
* cuda: col2im_1d tighten supports_op, type match and contiguous dst
* rename GGML_SYCL_SUPPORT_LEVEL_ZERO to GGML_SYCL_SUPPORT_LEVEL_ZERO_API, and GGML_SYCL_ENABLE_LEVEL_ZERO to GGML_SYCL_USE_LEVEL_ZERO_API
* fix code format
* fix error when rebase
* ggml: Conditionally enable power11 backend based on compiler support
Guard POWER11 backend creation behind a compiler flag check for -mcpu=power11. This avoids build failures on current GCC/Clang toolchains while preserving forward compatibility once POWER11 support becomes available.
* Update CMakeLists.txt
ggml-cpu: Use -mcpu=power10 for P10 and P11
Reuse existing rope kernels with a function constant to toggle forward/backward
rotation, avoiding duplicate kernel code.
Assisted-by: pi:llama.cpp/Qwen3.6-27B
* metal : add f16 and bf16 support for concat operator
Extend the Metal backend concat operator to support f16 and bf16 tensor
types in addition to the existing f32 and i32 support.
- Template kernel_concat on type T with specializations for float, half,
bfloat, and int
- Add type-specific pipeline getter ggml_metal_library_get_pipeline_concat()
- Update device support check to allow f16 unconditionally and bf16 when
device supports bfloat16
- Update dispatch to select the correct kernel specialization by type
Assisted-by: pi:llama.cpp/Qwen3.6-27B
* metal : extend concat operator to support f16, bf16, i8, i16 and i64
Assisted-by: pi:llama.cpp/Qwen3.6-27B
* add dev2dev memcpy by SYCL API
* mv GGML_SYCL_DEV2DEV_MEMCPY to runntime table
* update the detect method for p2p comm
* fix the erro created during fix confilct
---------
Co-authored-by: Neo Zhang <NA>