mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-06-28 04:30:15 -05:00
server: gate llama_decode_stop() to the active decode (fix queued-cancel cascade) (#1941)
With --parallel 1, a client disconnect/timeout on a *queued* request aborts the
*active* decode of a different client (llama_decode: failed to decode, ret = -3 /
"Decode process is cancelled by user"), releasing the slot with the request
unfinished. To the active client the stream silently stalls and never returns,
while the server reports healthy — easy to misdiagnose as a network/proxy wedge.
Root cause: llama_decode_stop() signals a process-global stop flag that the
active decode loop polls. examples/server/server.cpp calls it *ungated* from the
request reader's connection-closed paths, so any reader closing (including a
queued, not-yet-running task's) trips the global flag against whatever decode is
currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" +
hybrid/recurrent ret=-3), which did not gate these call sites against non-active
readers, so the queued-cancel-kills-active cascade still fires on current main.
Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the
three llama_decode_stop() sites on it, so the global stop is signalled only when
one of THIS reader's tasks is on a slot (the active decode). A queued task's
disconnect then only drops that queued task. Verified in production under heavy
concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero
active-decode kills). Stdlib-only reproducer in the PR description.
Caveat: any_task_on_slot() reads the slots vector from the reader thread — the
same race class as the existing process-global flag; can be tightened to a
per-context/per-task cancellation if preferred.
This commit is contained in:
parent
5fb707d19b
commit
b1eb8bb0a1
@ -330,6 +330,18 @@ struct server_response_reader {
|
||||
return !cancelled && received_count < id_tasks.size();
|
||||
}
|
||||
|
||||
// cancel-cascade fix: true only if one of THIS reader's tasks is on a
|
||||
// slot (the active decode). Used to gate llama_decode_stop() so a queued/
|
||||
// deferred task's disconnect cannot abort another task's active decode via
|
||||
// the process-global stop_internal_decode flag. Best-effort cross-thread
|
||||
// read (slots are not resized at runtime; same race class as the global).
|
||||
bool any_task_on_slot() const {
|
||||
for (const auto & slot : ctx_server.slots) {
|
||||
if (slot.is_processing() && id_tasks.count(slot.id_task)) return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
// return nullptr if should_stop() is true before receiving a result
|
||||
// note: if one error is received, it will stop further processing and return error result
|
||||
server_task_result_ptr next(const std::function<bool()>& should_stop) {
|
||||
@ -1127,7 +1139,7 @@ int main(int argc, char ** argv) {
|
||||
// non-stream, wait for the results
|
||||
auto all_results = rd->wait_for_all(is_connection_closed);
|
||||
if (all_results.is_terminated) {
|
||||
llama_decode_stop(); // send a signal to stop decode process
|
||||
if (rd->any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
|
||||
return; // connection is closed
|
||||
}
|
||||
else if (all_results.error) {
|
||||
@ -1150,7 +1162,7 @@ int main(int argc, char ** argv) {
|
||||
// ref: https://github.com/ggml-org/llama.cpp/pull/16486#discussion_r2419657309
|
||||
server_task_result_ptr first_result = rd->next(is_connection_closed);
|
||||
if (first_result == nullptr) {
|
||||
llama_decode_stop(); // send a signal to stop decode process
|
||||
if (rd->any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
|
||||
return; // connection is closed
|
||||
}
|
||||
else if (first_result->is_error()) {
|
||||
@ -1480,7 +1492,7 @@ int main(int argc, char ** argv) {
|
||||
|
||||
// collect results
|
||||
if (all_results.is_terminated) {
|
||||
llama_decode_stop();
|
||||
if (rd.any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
|
||||
return; // connection is closed
|
||||
}
|
||||
else if (all_results.error) {
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user