llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-27 23:50:20 -05:00

History

opencl: flash attention improvement (#25069 )

* opencl: rework FA kernel for f16 and f32

* opencl: flash-attention prefill prepass kernels

- flash_attn_kv_pad_f16    pads the tail KV tile to a BLOCK_N multiple
- flash_attn_mask_pad_f16  pads the matching mask tile
- flash_attn_blk_f16       classifies each KV tile per query block as
                           fully masked / mixed / fully unmasked, so
                           the main kernel can skip fully-masked tiles
                           and the mask lookup for fully-unmasked ones

* opencl: FA kernels for q4_0 and q8_0

* opencl: `set_rows` for f32 to q8_0/q4_0

* opencl: dequant kernels for q4_0 and q8_0

* opencl: add FA tile tuning table with override

* opencl: wire host side for FA

* opencl: q4_0 MoE tensors are also SOA'ed

* opencl: cosmetic fix

* opencl: refactor, also clarify some code paths in comments

* opencl: fix inifity for `-cl-finite-math-only`

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>

2026-06-27 15:36:06 -07:00

cmake

ggml : Parallelize quant LUT init (#23595 )

2026-05-25 10:15:46 +03:00

include

sycl : support --split-mode tensor (#24152 )

2026-06-25 08:35:21 +03:00

src

opencl: flash attention improvement (#25069 )

2026-06-27 15:36:06 -07:00

.gitignore

vulkan : cmake integration (#8119 )

2024-07-13 18:12:39 +02:00

CMakeLists.txt

ggml : bump version to 0.15.3 (ggml/1550)

2026-06-26 15:04:42 +03:00