mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-06-28 04:30:15 -05:00
* fa: fix FlashQKV early-termination causing S=0 assertion with --parallel N>1 The backward-scan optimization in compute_helper/compute_helper_q checks only one mask position per k_step block on the last query row (q_step-1) to find where valid KV entries end. When q_step > 1 and different query rows have non-overlapping valid KV regions (multi-slot / --parallel N>1), the scan on the last row's mask can miss blocks that contain valid entries for earlier rows. This causes those rows to accumulate S=0, triggering the GGML_ASSERT(S > 0) in normalize_and_store_1row. Fix: remove the early-termination scan at all 4 sites and iterate all nk1/k_step blocks unconditionally. The mask already handles correctness: fully-masked blocks produce smax=-inf and skip V accumulation, so the performance cost is minimal for TG (small nq1) and acceptable for PP. Fixes #809 * fa: refactor multi-slot mask fix into mask_effective_nk1() helper Replace 4× inlined early-termination scans with a shared helper that computes the effective K boundary by scanning ALL query mask rows (union-of-masks). This is the minimal fix for multi-slot parallel inference where different slots have different sequence lengths. The helper returns the k_step-aligned boundary covering the longest active sequence across all rows, preserving single-slot performance (single row = same boundary as before). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Turbomen008 <Turbomen008@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>