ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

History

fa: preserve early-termination, fix multi-slot correctness via union of masks (#1880 )

* fa: fix FlashQKV early-termination causing S=0 assertion with --parallel N>1

The backward-scan optimization in compute_helper/compute_helper_q checks
only one mask position per k_step block on the last query row (q_step-1)
to find where valid KV entries end. When q_step > 1 and different query
rows have non-overlapping valid KV regions (multi-slot / --parallel N>1),
the scan on the last row's mask can miss blocks that contain valid entries
for earlier rows. This causes those rows to accumulate S=0, triggering
the GGML_ASSERT(S > 0) in normalize_and_store_1row.

Fix: remove the early-termination scan at all 4 sites and iterate all
nk1/k_step blocks unconditionally. The mask already handles correctness:
fully-masked blocks produce smax=-inf and skip V accumulation, so the
performance cost is minimal for TG (small nq1) and acceptable for PP.

Fixes #809

* fa: refactor multi-slot mask fix into mask_effective_nk1() helper

Replace 4× inlined early-termination scans with a shared helper that
computes the effective K boundary by scanning ALL query mask rows
(union-of-masks). This is the minimal fix for multi-slot parallel
inference where different slots have different sequence lengths.

The helper returns the k_step-aligned boundary covering the longest
active sequence across all rows, preserving single-slot performance
(single row = same boundary as before).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Turbomen008 <Turbomen008@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-05-26 16:16:49 +03:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

MLA TP -khad: ggml_dequant_hadamard fused op + wv_b/wk_b_pp Hadamard fold (#1852 )

2026-05-21 07:29:15 +03:00

src

fa: preserve early-termination, fix multi-slot correctness via union of masks (#1880 )

2026-05-26 16:16:49 +03:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

ggml : default GGML_WIN_VER to 0x0A00 (Windows 10) (#1755 )

2026-05-08 13:23:04 +03:00