* host-swap tensor loop
the host-swap functionality is only triggered when the certain env. variables are declared
* target_include_directories tweak
* hot-swap tensor support
two intrusions:
1.) at the model loading to collect the snapshot
2.) the modification of the `/health` HTTP endpoint to be able to trigger the hot-swap via sending the `llama-server` the HTTP-request.
*both a braced by the specific env. variables
* hot-swap tensor support; graph invalidation
ggml_backend_cuda_invalidate_graphs export
* hot-swap tensor support
graph invalidation implementation; extended debug output (commented out)
* llama_reload_changed_tensors export
* tensor hot-swap on-demand reload
cpu-only/hybrid/gpu-only with split mode layer/graph full support implementation
* docs
* reuse the gguf parsing from llama.cpp
gguf_init_from_file, gguf_find_tensor, ggml_get_tensor
* remove the manual scheduling for hybrid inference
* update docs
* tensor shape validation
* update docs
* update docs
accidentally wiped the previous changes; so recovered them
* revert the GGML_CUDA_MAX_DEVICES to 16
* update llama_reload_changed_tensor
update llama_reload_changed_tensor, revert CMakeLists.txt
* update llama_reload_changed_tensor
* GGML_MAX_SRC
GGML_MAX_SRC compile-time definition support
* GGML_MAX_SRC
GGML_MAX_SRC compile-time definition support
* GGML_MAX_SRC
GGML_MAX_SRC compile-time definition support
* llama_reload_changed_tensor
update llama_reload_changed_tensor definition
* refactory
move the tensor-reloading implementation to llama-reload.cpp, llama-reload-info.h; some bugfixes and code reduction
* revert
added back the missing newline
* update docs
* reload_info constructor
* bugfix: cpu-only
TODO: improve the working environment by compiling for multiple hardware configurations; possibly make a test pipeline
* cpu-only bugfix
set the fix again after unsuccessful sync with main
* windows os compilation fix
#include <string>
* fix windows os build
error C2039: 'string': is not a member of 'std'
* remove dead file
* implement perplexity in server
* Revert "implement perplexity in server"
* Refactor speculative decoding: move logic outside of server
* remove duplicated tokens in mtp kv cache
* narrow to only discard draft cells in MTP
* revert mtp_speculative_gen_draft
With --parallel 1, a client disconnect/timeout on a *queued* request aborts the
*active* decode of a different client (llama_decode: failed to decode, ret = -3 /
"Decode process is cancelled by user"), releasing the slot with the request
unfinished. To the active client the stream silently stalls and never returns,
while the server reports healthy — easy to misdiagnose as a network/proxy wedge.
Root cause: llama_decode_stop() signals a process-global stop flag that the
active decode loop polls. examples/server/server.cpp calls it *ungated* from the
request reader's connection-closed paths, so any reader closing (including a
queued, not-yet-running task's) trips the global flag against whatever decode is
currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" +
hybrid/recurrent ret=-3), which did not gate these call sites against non-active
readers, so the queued-cancel-kills-active cascade still fires on current main.
Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the
three llama_decode_stop() sites on it, so the global stop is signalled only when
one of THIS reader's tasks is on a slot (the active decode). A queued task's
disconnect then only drops that queued task. Verified in production under heavy
concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero
active-decode kills). Stdlib-only reproducer in the PR description.
Caveat: any_task_on_slot() reads the slots vector from the reader thread — the
same race class as the existing process-global flag; can be tightened to a
per-context/per-task cancellation if preferred.
* (qwen3vl) Correct calculation for injection point of deepstack image embeddings
INjection point for deepstack embeddings used Hyperparameter n_embd_inp(), which caused the hidden state to be double accounted for, causing an OOB array access. The correct accessor is n_embd()
* Fix m-rope when pipeline parallelism is enabled
update config
ui: fix audio and video modality detection (#23756)
When model props are fetched asynchronously from the server,
modelPropsVersion is incremented to trigger reactivity, but
only the vision effect was listening to it.
webui: update ignore files
ui: handle audio/vnd.wave as audio WAV file (#23754)
Firefox on Linux uses this MIME type
ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (#23910)
webui: add custom CSS injection via config (#23904)
* webui: add custom CSS injection via config
register a customCSS setting in the Developer section under Custom JSON,
syncable so it rides the existing ui-config pass through. inject the value
into a single style element in the head, reactive on the setting. lets an
operator theme a prebuilt binary through --ui-config without rebuilding,
and lets a user set it from the settings panel.
move the textContent write into a use: action on the head style node.
the action is the idiomatic way to touch a node, so the no-dom-manipulating
lint rule is satisfied without a disable. value stays text through
textContent, never parsed as HTML.
* Update tools/ui/src/lib/constants/settings-keys.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* ui: address review from @allozaur, rename custom config key to customJson with migration
rename the custom config key to customJson across the type, the chat
request builder, the settings save check and the custom tools reader,
keeping the custom API param name unchanged. add a non destructive
migration that copies the legacy custom key to customJson at startup.
only render the head style tag when custom CSS is set.
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
server: real-time reasoning interruption via control endpoint (#23971)
Builds on the manual reasoning budget trigger from #23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
* ui: track reasoning phase via explicit streaming state
Add isReasoning to the chat store, mirroring the isLoading pattern:
per conversation map, private setter, public accessor and reactive
export. Set from the stream callbacks, true on reasoning chunks, false
on the first content chunk, reset on stream end and resynced on
conversation switch. The skip button now keys off isReasoning so it
shows only during the thinking phase, not the whole generation.
* ui: extract control endpoint and action into constants
Move the chat completion routes, the slots route and the reasoning
control action out of chat.service into api-endpoints and a dedicated
control-actions module. No behavior change, drops the magic strings so
the control protocol has a single source of truth.
* server: target reasoning control by completion id
Address @ngxson review on the control endpoint.
Switch from id_slot to the chat completion id to avoid a TOCTOU: the
slot can be reassigned between the lookup and the control request, so
matching the live completion (oaicompat_cmpl_id) is safe and a finished
one simply matches nothing. Rename the action to reasoning_end, guard
it on the reasoning_control flag of the target slot, and reduce the
response to {success} with an optional message.
* ui: target reasoning control by completion id
Keep the streamed completion id on the message and post it back to the
control endpoint instead of probing /slots. Drops the slot discovery
and the TOCTOU that came with it. Action renamed to reasoning_end,
response read as {success}.
* server: address review from @ngxson
Move the control fields into task_params and drop the redundant
comments on the control path.
* server: document the reasoning control endpoint
* Update tools/ui/src/lib/types/database.d.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* ui: rename cmplId to completionId
Per @allozaur review, clearer name for the streamed completion id.
* ui: wire completion id capture through the agentic flow
The webui streams through the agentic flow, which relayed onModel but
not onCompletionId, so the completion id never reached the message and
the control request was never sent. Relay it through the flow and its
callbacks type, declare id on the chunk type, and log an explicit error
when the button fires without a usable id.
* ui: target reasoning control model from the message
The model is a property of the completion, so read it from the streaming
message like the id, not from the model dropdown which is unrelated UI
state. Makes the request self-consistent by construction instead of just
unlikely to drift.
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
ui: Add Thinking mode toggle with reasoning effort levels + improvements for Chat Form Add Action UI (#23434)
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* fix: Model tags
ui: simplify network error handling (#23431)
Previously error to string conversion was split in two different files,
with one converting errors into strings, and another function analyzing
those strings to generate yet another string.
Now the the error handling for network fetches has been centralised and
uses directly HTTP error codes whereas possible to generate the
human-readable error strings.
It also fixes an issue where all JSON errors reported from the backend,
such as "Invalid API key", would get turned incorrectly in to
"Failed to connect to server" due to poor matching logic in the
now-gone getErrorMessage function.
update html
ui: Mermaid Diagrams in chat + interactive preview (#24032)
webui: fix tool selector toggle/counter, key tools by stable identity (#24065)
* webui: fix tool selector toggle/counter, key tools by stable identity
Key the disabled set, counts and toggles by a stable per-tool key
instead of bare function name, deduped from one canonical list. Per-tool
checkboxes become presentational (single row handler, no nested button),
category checkboxes drop the tristate (n/total carries partial). One
getEnabledToolsForLLM keeps normalized MCP schemas and dedupes by name.
* ui: use SvelteSet and SvelteMap for local tool collections to satisfy svelte/prefer-svelte-reactivity
Co-authored-by: firecoperana <firecoperana>
* Add configuration files for format, presets and examples
* add clang in pre-commit config
* remove clang configurations
* Refactor .gitignore for consistency in formatting
common/chat, server: refactor, move all conversion functions to common, add tests (#20690)
jinja : remove unused header (#22310)
common : fix jinja warnings with clang 21 (#22313)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
chat: fix handling of space in reasoning markers (#22353)
* chat: fix handling of space in reasoning markers
common : re-arm reasoning budget after DONE on new <think> (#22323)
common : determine generation prompt using longest common prefix (#22657)
common/autoparser: fixes for newline handling / forced tool calls (#22654)
* chat/autoparser: the fixes
* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.
* Trim whitespace on apply instead
common/chat : preserve media markers for typed-content templates (#22634)
common : revert reasoning budget +inf logit bias (#22740)
common : do not wrap raw strings in schema parser for tagged parsers (#22827)
common : enable streaming JSON argument values (#23173)
* common : remove atomic from json arguments
* common : remove parsing logic on JSON arguments
common : do not pass prompt tokens to reasoning budget sampler (#22488)
reasoning-budget: clone should do a deep-copy (#23095)
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
If a slot is reused for a standard completion (`/v1/completions`) after
being used for a chat completion (`/v1/chat/completions`), the previous
chat's PEG parser would remain active in the slot's parameters. This
caused standard text completions to throw on the raw text.
Add `-tm` / `--threads-mtmd` to control CPU thread count used during
multimodal image/audio processing (mmproj encoding), separate from the
main LLM thread count.
This allows running the LLM on GPU with minimal CPU threads (e.g. `-t 1`)
to reduce sync overhead, while using many threads (e.g. `-tm 16`) for
CPU-bound mmproj encoding with `--no-mmproj-offload`.
Fallback chain when `-tm` is not specified:
1. `--threads-batch` (-tb) — multimodal encoding is a batch/prefill-like
operation, so it makes sense to track with batch thread count
2. `--threads` (-t) — final default
Works with both mtmd-cli and llama-server.
AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev
This patch addresses two latent bugs in examples/speculative/speculative.cpp
that prevent llama-speculative.exe from running on greedy sampling
(temp=0) or producing rejection-sampling output (temp>0):
1. Line 191: `params.sparams.grammar = { COMMON_GRAMMAR_TYPE_NONE, "" };`
invokes `common_grammar(type, grammar)` which asserts
`type != NONE || !grammar.empty()`. Both conditions fail with the
intended-to-be-empty grammar, so every speculative run hits a hard
`GGML_ASSERT` in common/sampling.h:63 immediately after model load.
Fix: default-construct via `common_grammar{}` to bypass the
field-init constructor.
2. Lines 293-294: `GGML_ASSERT(dist_tgt.sorted)` and
`GGML_ASSERT(dist_dft.sorted)` fire whenever the draft sampler does
not set the .sorted flag (which is most modern sampler paths).
Comment them out — the next ~10 lines re-sort both distributions
by id explicitly, so the assertion is incorrect anyway.
Fix: replace the asserts with an explanatory comment.
After both fixes, `llama-speculative.exe` runs to completion. The
acceptance-rate measurement at temp=0 still looks suspicious (0%
across same-family draft/target pairs), but that is a different
issue out of scope for this PR.
Tested on Qwen3-0.6B-IQ4_XS drafting Qwen3-1.7B-IQ4_XS, both base
models from `bartowski/Qwen_Qwen3-*-GGUF` on Windows + ik_llama.cpp
build at HEAD of windows-mingw-default-win10 (which is itself a
follow-up to PR #1755).