Analogous to the BF16 fix in eea6a82b25, this adds proper Q8_0
type handling in ggml_cuda_op_add:
- Add k_add_q8_0_f32 kernel: dequantize Q8_0, add F32, store F32
- Add k_add_q8_0_q8_0_f32 kernel: dequantize two Q8_0, add, store F32
- Add Q8_0+Q8_0/Q8_0+F32/F32+Q8_0 branches in the F32 dst (else) block,
preventing Q8_0 data from falling through to the incorrect half cast
- Expand Q8_0 dst branch to handle F32+Q8_0->Q8_0 (swapped args), not
just Q8_0+F32->Q8_0
* Refactor speculative decoding: move logic outside of server
* remove duplicated tokens in mtp kv cache
* narrow to only discard draft cells in MTP
* revert mtp_speculative_gen_draft
With --parallel 1, a client disconnect/timeout on a *queued* request aborts the
*active* decode of a different client (llama_decode: failed to decode, ret = -3 /
"Decode process is cancelled by user"), releasing the slot with the request
unfinished. To the active client the stream silently stalls and never returns,
while the server reports healthy — easy to misdiagnose as a network/proxy wedge.
Root cause: llama_decode_stop() signals a process-global stop flag that the
active decode loop polls. examples/server/server.cpp calls it *ungated* from the
request reader's connection-closed paths, so any reader closing (including a
queued, not-yet-running task's) trips the global flag against whatever decode is
currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" +
hybrid/recurrent ret=-3), which did not gate these call sites against non-active
readers, so the queued-cancel-kills-active cascade still fires on current main.
Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the
three llama_decode_stop() sites on it, so the global stop is signalled only when
one of THIS reader's tasks is on a slot (the active decode). A queued task's
disconnect then only drops that queued task. Verified in production under heavy
concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero
active-decode kills). Stdlib-only reproducer in the PR description.
Caveat: any_task_on_slot() reads the slots vector from the reader thread — the
same race class as the existing process-global flag; can be tightened to a
per-context/per-task cancellation if preferred.
* (qwen3vl) Correct calculation for injection point of deepstack image embeddings
INjection point for deepstack embeddings used Hyperparameter n_embd_inp(), which caused the hidden state to be double accounted for, causing an OOB array access. The correct accessor is n_embd()
* Fix m-rope when pipeline parallelism is enabled
The MMA flash-attention dispatcher only instantiated ncols2 = 8 and 4 for
head_dim 512, so any other GQA ratio hit GGML_ABORT. Gemma 4 12B's global
attention layers use head_dim 512 with a 16:1 GQA ratio (16 query heads /
1 KV head), which aborts at load. Because MTP speculative decoding requires
flash attention, this also blocks the Gemma 4 12B MTP drafter entirely.
Instantiating ncols2 = 16 there is not viable: it exceeds the maximum dynamic
shared memory on Ada (cudaFuncSetAttribute returns invalid argument). Instead,
route gqa_ratio % 8 == 0 (covering 8 and 16) through the existing ncols2 = 8
kernel, which already iterates over Q-head groups (iter_z = ceil(gqa_ratio /
ncols2)). gqa_ratio 8 and 4 behavior is unchanged; this mirrors the divisor
dispatch already used for the 576x512 case below.
Verified on RTX 4070 Ti SUPER (Ada, cc 8.9): Gemma 4 12B + MTP drafter now
runs with flash attention; draft acceptance 43-95% by workload, 1.5-2.2x
end-to-end speedup. The 26B-A4B drafter (gqa_ratio 8) is unaffected.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>