2668 Commits

Author SHA1 Message Date
Ruben Ortlam
e82beaa60d
vulkan: add fwht support for Intel with shmem reduction (#23964)
* vulkan: add fwht support for Intel with shmem reduction

* don't use N as workgroup size

* disable subgroup shuffle on MoltenVK AMD

* disable fwht shader on Intel Windows due to driver bug
2026-06-05 19:44:40 +02:00
Charles Xu
3ecfb150a4
kleidiai : dynamic chunck-based scheduling for hybrid execution (#23819) 2026-06-05 10:11:47 +03:00
Oliver Simons
2154a0fdcf
CUDA: enroll mul_mat_vec_q_moe into pdl (#24087)
* Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW

Data collected on a B4500:

Before
```
(llama.cpp) ➜  llama.cpp git:(master) ✗ python mtp-bench.py
  code_python        pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8
  code_cpp           pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8
  explain_concept    pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4
  summarize          pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6
  qa_factual         pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1
  translation        pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5
  creative_short     pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2
  stepwise_math      pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2
  long_code_review   pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9
```
After
```
(llama.cpp) ➜  llama.cpp git:(master) ✗ python mtp-bench.py
  code_python        pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9
  code_cpp           pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6
  explain_concept    pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8
  summarize          pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2
  qa_factual         pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5
  translation        pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4
  creative_short     pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8
  stepwise_math      pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7
  long_code_review   pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7
```

Server launched with:
```
➜  llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \
    -m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \
    --spec-type draft-mtp \
    --spec-draft-n-max 2 \
    -ngl all \
    -fa on \
    --host 0.0.0.0 \
    --port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"
```

* LC to overlap with following kernels
2026-06-05 08:37:34 +02:00
Mason Milburn
7fe2ae45ab
sycl : port multi-column MMVQ from CUDA backend (#21845)
mmvq:

Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL.
Read weights once per dispatch instead of once per column.
Covers all standard quant types + reorder paths for Q4_0, Q8_0,
Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to
incompatible vec_dot signatures.

ggml-sycl:

The weight reorder was only bootstrapped on single-token mat-vec
(ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec,
so it never triggered the reorder and ran on the slower non-reorder
kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too.
2026-06-05 08:10:31 +03:00
Kartik Sirohi
4c51309617
ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (#22209)
* ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128

Optimize the inner loop of ggml_vec_dot_q4_1_q8_1_generic using
WASM SIMD128 intrinsics, gated behind #ifdef __wasm_simd128__ so
non-wasm builds are completely unaffected.

Approach:
- single wasm_v128_load covers all 32 packed 4-bit weights
- nibbles unpacked via AND/SHR into two u8x16 registers
- widened to i16 before multiply (WASM SIMD has no i8*i8 instruction)
- 4x wasm_i32x4_dot_i16x8 calls accumulate all 32 element pairs
- horizontal reduce via 4x wasm_i32x4_extract_lane

Benchmark (node v25, emcc -O3 -msimd128, 64 blocks x QK8_1=32,
200k iterations):

| impl   | ns/call | speedup |
|--------|---------|---------|
| scalar |   880.7 |   1.00x |
| simd   |   257.8 |   3.42x |

Correctness verified against scalar reference across 10 random seeds
with exact output match.

* ggml: move q4_1_q8_1 WASM SIMD implementation to wasm backend

Relocate the SIMD128 implementation of ggml_vec_dot_q4_1_q8_1 to ggml/src/ggml-cpu/arch/wasm/quants.c to follow architecture-specific layout. Restore the generic implementation in ggml/src/ggml-cpu/quants.c.
Move for loop in the else block.

* ggml: use generic q4_1_q8_1 fallback in wasm backend
2026-06-04 16:12:38 +03:00
Georgi Gerganov
3d1998634e
metal : reduce rset heartbeat from 500ms -> 5ms (#24074) 2026-06-04 08:05:32 +03:00
Reese Levine
e8c54893f2
ggml-webgpu: FlashAttention refactor + standardize quantization support (#23834)
* Start work on flash_attn refactor

* Refactor

* Split k/v quantization

* Refactor and abstract quantization logic for flash_attn and mul_mat

* Add quantization support to tile path

* formatting

* Move to functions, add a check
2026-06-04 08:05:04 +03:00
rehan-10xengineer
3c7450cee1
ggml-cpu: extend RVV quantization vec dot to higher VLENs (#22754)
* ggml-cpu: add rvv 512b,1024b impls for iq4_xs

* ggml-cpu: refactor; add rvv 512b, 1024b impls for q6_K, i-quants

* ggml-cpu: refactor; add 512 and 1024 implementations of tq3_s, iq3_xxs, iq2_s, iq2_xs, iq2_xxs

improve iq2_xs impl for rvv 256

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

---------

Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai>
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
2026-06-04 08:03:40 +03:00
Andreas Kieslinger
9e58d4d692
Avoid PDL race conditions by disabling __restrict__ when PDL is used (#24030)
* Removes __restrict__ from PDL kernel headers due to incompatibility with
PDL. Adds preprocessor directives based on arch in kernel body to add
__restrict__ to retain performance on older architectures.

* Simplifies new __restrict__ usage via macro

* Add hopper to PDL __restrict__ fix.

Co-authored-by: Oliver Simons <osimons@nvidia.com>

---------

Co-authored-by: Oliver Simons <osimons@nvidia.com>
2026-06-03 13:56:42 +02:00
Charles Xu
3571fa5435
ggml-cpu: use runtime SVE width in FWHT (#24059) 2026-06-03 13:45:10 +03:00
Aman Gupta
f8f0a47a55
cuda: reserve space for quantize kv-cache at startup (#23907)
* cuda: reserve space for quantize kv-cache at startup

* address review comments

* remove forward decl

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* remove assert in ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-06-03 18:39:59 +08:00
lhez
63e66fdd23
opencl: use flat variants of q4_K and q6_K gemv for very large M (#24006) 2026-06-02 14:16:17 -07:00
Max Krasnyansky
5c394fdc8b
hexagon: profiler output fix and script updates (#24042)
* hex-ops: fix profiler output (ie remove the redundant NONEs)

* hex-prof: update profiling script to support tot.usec column
2026-06-02 14:08:29 -07:00
Max Krasnyansky
8f7f3bf141
hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (#23989)
* hex-mm: initial support for F32 * F32 -> F32 matmuls

* hex-rms-norm: fix src1 stride use in fused rms_norm_mul

* hex-ops: clear spad pointers in the ops that clober it

This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes.

* hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX

Decided to use Q4_0 * F32 -> F32 matmul for this.
Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16.
Super simple and pretty efficient.

* hmx-mm: route f16 2D matmuls through the same kernel used for all other types

* hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way

This update futher improves matmul performance and at the same time removes most of the redudant logic
we had in different paths.

* hmx-fa: slighlty improved pipeline simimar to matmul updates

* hmx-mm: initial version of MAT_MUL_ID support for HMX

* hmx-mm: fixed mxfp4 handling for MUL_MAT_ID

* hex-gdn: optimize GATED_DELTA_NET

DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :)

* hmx-mm: missed one more case where we can use fastmod

* hexagon: update DCVS settings for a slight perf bump

* hmx-fa: use fastdiv in hmx-flash-attn

* hmx-fa: precompute slope values to avoid disrupting the inner loop

* hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi

* hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty

* hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right
2026-06-01 23:40:08 -07:00
Todor Boinovski
d178a11818
hexagon: add gelu_quick (#24007) 2026-06-01 23:19:07 -07:00
Anav Prasad
1fd5f48037
clean up unused variables warnings (#23975) 2026-06-02 10:38:37 +08:00
lhez
210a6570ce
opencl: fix compiler warnings for non-adreno path (#23922)
* opencl: fix compiler warnings for non-adreno path

* opencl: fix const cast warning
2026-06-01 19:15:09 -07:00
Masashi Yoshimura
b8275a8acc
revert to using global_invocation_id for cpy shader (#23955) 2026-06-01 16:59:06 -07:00
shaofeiqi
27d9ed8397
opencl: add basic support for q5_0 and q5_1 (#23548)
* opencl: add general q5_0 support

* opencl: add general q5_1 support

* opencl: support non-uniform workgrp size

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-06-01 10:06:50 -07:00
Shrivas Shankar
95b8b8ec1a
metal: template GLU kernels to support f16/f32 (#23882)
Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs.
2026-06-01 15:40:28 +03:00
Jeff Bolz
55ac0909e5
vulkan: don't hold the device mutex while compiling pipelines (#23641)
* vulkan: don't hold the device mutex while compiling pipelines

We need to hold a lock while we traverse all pipelines and lazily initialize
them, but we don't need to hold it while the pipeline is being compiled. And
it doesn't need to be the same lock as the device mutex. We call load_shaders
each time a pipeline is needed, so we only need to compile that one pipeline
(and, for example, don't want to end up compiling a pipeline that another
thread should be compiling).

* remove 'needed'
2026-06-01 14:04:01 +02:00
Winston Ma
bef69f1306
vulkan: reduce host memory lock contention (#23376)
* vulkan: reduces lock contention

* replace unique_lock with lock_guard
2026-06-01 14:03:32 +02:00
Johannes Gäßler
8e6fff84de
TP: quantized KV cache support (#23792)
* TP: quantized KV cache support

* fix partial view

* remove overly strict assert
2026-06-01 12:30:10 +02:00
Matt Corallo
19620004f5
vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (#23056)
Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even
though they're only 2-byte aligned, and Q3_K still wins on
NVIDIA as well.

mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, we can do
subtraction directly on a full int32_t rather than an i8vec4
with bit twiddling because the high bit is always free to start.

On Intel BMG on mesa, the switch to MMVQ provides an immediate
~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and
~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

The futher switch to block loads leads to a ~24% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA
override for K quants on Xe2 as well.
2026-06-01 11:46:48 +02:00
Winston Ma
f8c0a19d46
vulkan: Removed unused functions (#23175) 2026-06-01 11:46:23 +02:00
Neo Zhang
a51142497a
[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (#23812)
* support Q4_1, Q5_0, Q5_1

* update ut case
2026-06-01 09:53:53 +03:00
Neo Zhang
4162522688
[SYCL] Add more types in GET_ROWS OP (#23710)
* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP

* correct the link
2026-06-01 09:53:04 +03:00
Neo Zhang
44e211cecf
sycl : Optimize Q3_K mul_mat by reorder (#23725) 2026-06-01 09:50:55 +03:00
lhez
d6588daa80
opencl: support bf16 by converting to f16 (#23839) 2026-05-30 10:17:47 -07:00
Georgi Gerganov
2d9b7c8e98
metal : restore im2col implementation for large kernels (#23901) 2026-05-30 15:26:13 +03:00
Jinyang He
d48a56effb
ggml : add some lsx support (#23798)
* loongarch : optimize LSX fp16 load/store with native intrinsics

Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in
__lsx_f16x4_load and __lsx_f16x4_store.

* loongarch : add LSX implementation for q8_0 dot product

* loongarch : add LSX implementation for q6_K dot product

* loongarch : add LSX implementation for iq4_xs dot product

* Improve reduce ops when sun int16 pairs to int32
2026-05-30 11:53:26 +03:00
Ruben Ortlam
6e093b80ea
vulkan: add Flash Attention support for BFloat16 KV cache (#23420)
* vulkan: add flash attention bf16 kv support

* vulkan: bf16 FA coopmat1 support

* vulkan: bf16 FA coopmat2 support

* fix FA bf16 f32 fallback

* fix FA bf16 coopmat1 shader

* fix FA bf16 coopmat2 shader

* code cleanup

* cleanup comment change

* address feedback

* add O_TYPE for cm2 FA

* use O_TYPE for gqaStore function

* reduce BFLOAT16 ifdefs
2026-05-30 10:39:31 +02:00
Reese Levine
151f3a98e9
ggml-webgpu: Check earlier for WebGPU required features (#23879) 2026-05-29 14:16:05 -07:00
Reese Levine
b22da25889
ggml-webgpu: add q4_0/q8_0 SET_ROWS (#23760)
* Add q8_0 and q4_0 set_rows

* Add fast(er) quantization set_rows path

* formatting/naming

* a little more naming

* Remove unused constant

* Don't override other override

* Avoid bitcast

* Narrow relaxation
2026-05-29 14:14:11 -07:00
Oliver Simons
6ed481eea4
CUDA: Check PTX version on host side to guard PDL dispatch (#23530)
* CUDA: Check PTX version on host side to guard PDL dispatch

Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code

* Implement MurmurHash3 mixer for better hash distribution

Magic constants were taken from boost:
2698b43803/include/boost/container_hash/detail/hash_mix.hpp (L19-L65)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments, make seed non-zero

* Apply code-formatting

* Replace std::size_t -> size_t for consistency

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 12:28:18 +02:00
fairydreaming
1f0aa2a696
model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (#23346)
* llama : support DeepSeek V3.2 model family (with DSA lightning indexer)

* convert : handle DeepseekV32ForCausalLM architecture

* ggml : support for f16 GGML_OP_FILL

* memory : separate hparams argument in llama_kv_cache constructor

* memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache)

* llama : support for LLM_ARCH_DEEPSEEK32

* model : llama_model_deepseek32 implementation

* model : merge two scale operations into one in DSA lightning indexer implementation

* chore : remove unused code

* model : support NVFP4 in DeepSeek V3.2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* memory : refactoring TODO

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
2026-05-29 10:15:17 +02:00
Georgi Gerganov
ea02bc37f5 ggml : bump version to 0.13.1 (ggml/1523) 2026-05-29 09:56:08 +03:00
Andreas Kieslinger
241cbd41d2
cuda : disables launch_fattn PDL enrollment due to compiler bug (#23825) 2026-05-29 07:46:10 +03:00
Matt Corallo
33c718db1f
meta : Add missing buffer set in allreduce fallback !COMPUTE clear (#23480)
Without this at least the vulkan backend will skip the `* 0` for
!COMPUTE tensors, causing corrupt output.
2026-05-29 06:30:24 +03:00
Max Krasnyansky
19e92c33ef
hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (#23835)
Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.
2026-05-28 14:05:54 -07:00
lhez
408ae2b9e5
opencl: move backend info printing into its own function (#23702)
* opencl: move backend info print into its own function

* opencl: move new log line

* opencl: fix for non adreno path
2026-05-28 11:05:42 -07:00
fl0rianr
30af6e2b98
ggml: auto apply iGPU flag CUDA/HIP if integrated device (#23007) 2026-05-28 15:01:14 +02:00
redfox
d7be46189f
mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729)
* mmvq Optim:  add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING

* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-28 14:51:14 +02:00
Jaden_Mach
bc81d47aba
CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (#23227)
* CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware

The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8)
to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled
GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs
substantially by quant family because the per-row GEMV cost is dominated
by dequantisation, not the dot-product itself: K-quants pay a heavier
super-block decode and so MMQ wins sooner; legacy and IQ quants have
lean decode and stay ahead until the batch fully populates an MFMA tile.

This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool,
mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant
thresholds on amd_mfma_available(cc):

  Q3_K, Q4_K, Q5_K  : MMVQ <= 3   (MMQ wins from batch=4: +5% .. +76%)
  Q2_K, Q6_K        : MMVQ <= 5   (MMQ wins from batch=6: +8% .. +35%)
  others            : MMVQ <= 8   (legacy & IQ regress under MMQ; unchanged)

Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical
to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold
for A/B testing.

Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct,
llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps.
Full table in PR description.

  Selected pp512 throughput (tok/s, ub=8):
    Q4_K_S:  559 -> 940  (+68%)
    Q5_K_S:  503 -> 884  (+76%)
    Q3_K_S:  629 -> 879  (+40%)
    Q2_K  :  615 -> 809  (+32%)
    Q6_K  :  582 -> 776  (+33%)

  Selected pp512 throughput (tok/s, ub=4):
    Q4_K_S:  444 -> 480  (+ 8%)
    Q4_0  :  682 -> 685  (+ 0%)   (no regression - retains MMVQ)
    IQ4_XS:  706 -> 698  (- 1%)   (no regression - retains MMVQ)

* CUDA: address review — inline MMVQ batch table, drop env hatch & doc block

* tune kernel selection logic for CDNA1

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-28 14:50:25 +02:00
Max Krasnyansky
a919001134
hexagon: minor refresh for HMX FA and MM (#23796)
* hex-fa: clean up qf32/fp32 handling and stride handling

* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79

* hex-fa: vectorize leftover handling

* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity

* hmx-mm: remove dead code

* hmx-mm: use fastdiv in x4x2 dequant

* hmx-mm: sandwich dequant and scatter to improve perf

* hmx-mm: fixed rebase conflicts

* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv

* hmx-mm: an even earlier dispatch for per-type dequant

* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

This is a bit faster than LUT.

* hex-cmake: one more tweak for lto

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-05-28 04:49:11 -07:00
Jeff Bolz
48e7078ee0
vulkan: fast path for walsh-hadamard transform (#23687)
* vulkan: fast path for walsh-hadamard transform

* disable for intel due to segfault
2026-05-28 13:18:43 +02:00
Winston Ma
7c48fb81ce
vulkan: fix wrong index variable in inner loop (#23665) 2026-05-28 12:48:34 +02:00
Winston Ma
91eb8f4fa0
vulkan: Fix memory logger unsafe iterator access (#23667) 2026-05-28 12:46:07 +02:00
fairydreaming
09e7b76c93
cuda : fix KQ mask offset integer overflow in fattn MMA kernel (#23610)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-28 10:55:42 +02:00
Martin Klacer
e31cdaa0eb
ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (#22841)
* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16



Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
2026-05-28 10:04:21 +03:00