markaalonzo 48819dadaf
server: fix ret=-3 on hybrid/recurrent prompt cache, and clear sticky stop flag (#1673)
Two related issues that manifest as 'llama_decode ret=-3' on hybrid
architectures (e.g. Qwen3.5/3.6 MoE, Qwen3-Next), matching the symptom
reported in #1576.

1) server_context::apply_checkpoint() was written around transformer KV
   semantics (pos_min / pos_max per-token window). For hybrid and pure
   recurrent models the per-token pos_min threshold does not apply: the
   recurrent state is a single snapshot, and the server-side checkpoint
   is a whole-prefix record. The old selector 'cur.pos_min < pos_min_thold'
   can succeed on a checkpoint whose pos_max is past the current n_past,
   and — more commonly — fall through to do_reset = true, which zeros
   slot.n_past / slot.n_past_prompt. Zeroing in-place while the recurrent
   state in the context is still populated makes the next decode batch
   disagree with the live state, returning ret=-3.

   This change gates the checkpoint path on
   llama_model_has_recurrent(llama_get_model(slot.ctx)):
   - selector uses pos_max <= slot.n_past && pos_max < pos_next
     (whole-prefix match, leaves at least one token to decode);
   - on miss, slot state is preserved rather than zeroed, letting
     update_slots() continue from the already-valid n_past_prompt;
   - the erase loop drops any checkpoint whose pos_max > pos_next,
     matching the rewind semantics for recurrent state.

   Transformer behavior is unchanged.

2) stop_internal_decode is a file-static global in src/llama.cpp, set by
   llama_decode_stop() (called on client disconnect) and polled inside
   the decode loop to bail out with ret=-3. The flag is only cleared on
   one conditional path in server_slot::release(), so a stop signal that
   arrives after the interrupted llama_decode() has already returned
   bleeds into the NEXT decode call and causes an immediate ret=-3 with
   no work performed. Clear it at the top of the public llama_decode()
   entry so the signal is scoped to the in-flight decode it was meant
   for.

Build-verified: llama-server with GGML_CUDA=ON, -DCMAKE_CUDA_ARCHITECTURES=86
(sm_86), IQK flash-attn + matmul enabled. No new APIs introduced —
llama_model_has_recurrent is already public and already used elsewhere in
server-context.cpp.

Closes #1576
2026-04-23 09:19:17 +02:00
..
2026-04-23 09:05:39 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2025-06-19 10:24:53 +03:00
2026-04-16 17:26:31 +02:00
2026-04-23 09:05:39 +02:00
2025-12-15 08:27:20 +01:00
2024-08-12 15:14:32 +02:00
2023-03-29 20:21:09 +03:00
2024-07-27 07:55:01 +02:00