server: gate llama_decode_stop() to the active decode (fix queued-cancel cascade) (#1941)

With --parallel 1, a client disconnect/timeout on a *queued* request aborts the
*active* decode of a different client (llama_decode: failed to decode, ret = -3 /
"Decode process is cancelled by user"), releasing the slot with the request
unfinished. To the active client the stream silently stalls and never returns,
while the server reports healthy — easy to misdiagnose as a network/proxy wedge.

Root cause: llama_decode_stop() signals a process-global stop flag that the
active decode loop polls. examples/server/server.cpp calls it *ungated* from the
request reader's connection-closed paths, so any reader closing (including a
queued, not-yet-running task's) trips the global flag against whatever decode is
currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" +
hybrid/recurrent ret=-3), which did not gate these call sites against non-active
readers, so the queued-cancel-kills-active cascade still fires on current main.

Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the
three llama_decode_stop() sites on it, so the global stop is signalled only when
one of THIS reader's tasks is on a slot (the active decode). A queued task's
disconnect then only drops that queued task. Verified in production under heavy
concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero
active-decode kills). Stdlib-only reproducer in the PR description.

Caveat: any_task_on_slot() reads the slots vector from the reader thread — the
same race class as the existing process-global flag; can be tightened to a
per-context/per-task cancellation if preferred.
This commit is contained in:
Simon Lundell 2026-06-12 08:25:44 +02:00 committed by GitHub
parent 5fb707d19b
commit b1eb8bb0a1
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -330,6 +330,18 @@ struct server_response_reader {
return !cancelled && received_count < id_tasks.size();
}
// cancel-cascade fix: true only if one of THIS reader's tasks is on a
// slot (the active decode). Used to gate llama_decode_stop() so a queued/
// deferred task's disconnect cannot abort another task's active decode via
// the process-global stop_internal_decode flag. Best-effort cross-thread
// read (slots are not resized at runtime; same race class as the global).
bool any_task_on_slot() const {
for (const auto & slot : ctx_server.slots) {
if (slot.is_processing() && id_tasks.count(slot.id_task)) return true;
}
return false;
}
// return nullptr if should_stop() is true before receiving a result
// note: if one error is received, it will stop further processing and return error result
server_task_result_ptr next(const std::function<bool()>& should_stop) {
@ -1127,7 +1139,7 @@ int main(int argc, char ** argv) {
// non-stream, wait for the results
auto all_results = rd->wait_for_all(is_connection_closed);
if (all_results.is_terminated) {
llama_decode_stop(); // send a signal to stop decode process
if (rd->any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
return; // connection is closed
}
else if (all_results.error) {
@ -1150,7 +1162,7 @@ int main(int argc, char ** argv) {
// ref: https://github.com/ggml-org/llama.cpp/pull/16486#discussion_r2419657309
server_task_result_ptr first_result = rd->next(is_connection_closed);
if (first_result == nullptr) {
llama_decode_stop(); // send a signal to stop decode process
if (rd->any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
return; // connection is closed
}
else if (first_result->is_error()) {
@ -1480,7 +1492,7 @@ int main(int argc, char ** argv) {
// collect results
if (all_results.is_terminated) {
llama_decode_stop();
if (rd.any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
return; // connection is closed
}
else if (all_results.error) {