server: gate llama_decode_stop() to the active decode (fix queued-cancel cascade) (#1941)

With --parallel 1, a client disconnect/timeout on a *queued* request aborts the *active* decode of a different client (llama_decode: failed to decode, ret = -3 / "Decode process is cancelled by user"), releasing the slot with the request unfinished. To the active client the stream silently stalls and never returns, while the server reports healthy — easy to misdiagnose as a network/proxy wedge. Root cause: llama_decode_stop() signals a process-global stop flag that the active decode loop polls. examples/server/server.cpp calls it *ungated* from the request reader's connection-closed paths, so any reader closing (including a queued, not-yet-running task's) trips the global flag against whatever decode is currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" + hybrid/recurrent ret=-3), which did not gate these call sites against non-active readers, so the queued-cancel-kills-active cascade still fires on current main. Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the three llama_decode_stop() sites on it, so the global stop is signalled only when one of THIS reader's tasks is on a slot (the active decode). A queued task's disconnect then only drops that queued task. Verified in production under heavy concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero active-decode kills). Stdlib-only reproducer in the PR description. Caveat: any_task_on_slot() reads the slots vector from the reader thread — the same race class as the existing process-global flag; can be tightened to a per-context/per-task cancellation if preferred.
2026-06-28 04:30:15 -05:00 · 2026-06-12 08:25:44 +02:00 · 2026-06-12 08:25:44 +02:00 · b1eb8bb0a1
commit b1eb8bb0a1
parent 5fb707d19b
1 changed files with 15 additions and 3 deletions
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@ -330,6 +330,18 @@ struct server_response_reader {
        return !cancelled && received_count < id_tasks.size();
    }

+    // cancel-cascade fix: true only if one of THIS reader's tasks is on a
+    // slot (the active decode). Used to gate llama_decode_stop() so a queued/
+    // deferred task's disconnect cannot abort another task's active decode via
+    // the process-global stop_internal_decode flag. Best-effort cross-thread
+    // read (slots are not resized at runtime; same race class as the global).
+    bool any_task_on_slot() const {
+        for (const auto & slot : ctx_server.slots) {
+            if (slot.is_processing() && id_tasks.count(slot.id_task)) return true;
+        }
+        return false;
+    }
+
    // return nullptr if should_stop() is true before receiving a result
    // note: if one error is received, it will stop further processing and return error result
    server_task_result_ptr next(const std::function<bool()>& should_stop) {
@ -1127,7 +1139,7 @@ int main(int argc, char ** argv) {
                // non-stream, wait for the results
                auto all_results = rd->wait_for_all(is_connection_closed);
                if (all_results.is_terminated) {
-                    llama_decode_stop(); // send a signal to stop decode process
+                    if (rd->any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
                    return; // connection is closed
                }
                else if (all_results.error) {
@ -1150,7 +1162,7 @@ int main(int argc, char ** argv) {
                // ref: https://github.com/ggml-org/llama.cpp/pull/16486#discussion_r2419657309
                server_task_result_ptr first_result = rd->next(is_connection_closed);
                if (first_result == nullptr) {
-                    llama_decode_stop(); // send a signal to stop decode process
+                    if (rd->any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
                    return; // connection is closed
                }
                else if (first_result->is_error()) {
@ -1480,7 +1492,7 @@ int main(int argc, char ** argv) {

        // collect results
        if (all_results.is_terminated) {
-            llama_decode_stop();
+            if (rd.any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
            return; // connection is closed
        }
        else if (all_results.error) {