mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-06-28 04:30:15 -05:00
Two related issues that manifest as 'llama_decode ret=-3' on hybrid architectures (e.g. Qwen3.5/3.6 MoE, Qwen3-Next), matching the symptom reported in #1576. 1) server_context::apply_checkpoint() was written around transformer KV semantics (pos_min / pos_max per-token window). For hybrid and pure recurrent models the per-token pos_min threshold does not apply: the recurrent state is a single snapshot, and the server-side checkpoint is a whole-prefix record. The old selector 'cur.pos_min < pos_min_thold' can succeed on a checkpoint whose pos_max is past the current n_past, and — more commonly — fall through to do_reset = true, which zeros slot.n_past / slot.n_past_prompt. Zeroing in-place while the recurrent state in the context is still populated makes the next decode batch disagree with the live state, returning ret=-3. This change gates the checkpoint path on llama_model_has_recurrent(llama_get_model(slot.ctx)): - selector uses pos_max <= slot.n_past && pos_max < pos_next (whole-prefix match, leaves at least one token to decode); - on miss, slot state is preserved rather than zeroed, letting update_slots() continue from the already-valid n_past_prompt; - the erase loop drops any checkpoint whose pos_max > pos_next, matching the rewind semantics for recurrent state. Transformer behavior is unchanged. 2) stop_internal_decode is a file-static global in src/llama.cpp, set by llama_decode_stop() (called on client disconnect) and polled inside the decode loop to bail out with ret=-3. The flag is only cleared on one conditional path in server_slot::release(), so a stop signal that arrives after the interrupted llama_decode() has already returned bleeds into the NEXT decode call and causes an immediate ret=-3 with no work performed. Clear it at the top of the public llama_decode() entry so the signal is scoped to the in-flight decode it was meant for. Build-verified: llama-server with GGML_CUDA=ON, -DCMAKE_CUDA_ARCHITECTURES=86 (sm_86), IQK flash-attn + matmul enabled. No new APIs introduced — llama_model_has_recurrent is already public and already used elsewhere in server-context.cpp. Closes #1576