llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-27 23:50:20 -05:00

Author	SHA1	Message	Date
Pascal	1a87dcdc45	server + ui: SSE Replay Buffer (#23226 ) * server: SSE replay buffer, survives client disconnect Opt in on POST /v1/chat/completions when the client sends X-Stream-Resume: 1 and a non empty X-Conversation-Id. The conv id is the session identity end to end, no extra opaque token. The drain runs detached server side and buffers SSE bytes, the generation survives HTTP disconnect, F5, or lets users switch from iOS Safari to another app without losing the actively generated response. Routes: GET /v1/stream/<conv_id>?from=N replay GET /v1/streams[?conversation_id=X] list, drives sidebar spinners DELETE /v1/stream/<conv_id> Stop, idempotent Router parent fans out to children for list and delete, probes on GET to route to the owner, fans out DELETE on POST so "one session per conv" holds across model swaps. WebUI: the layout snapshots /v1/streams at mount and on visibilitychange, the sidebar reflects live inferences across all convs. The chat page reattaches on mount, append vs fresh is detected from existing content so continue mid stream keeps its prefix. update_slots: on llama_memory_seq_rm refusal at a deep position, full clear of the seq and reprefill from zero instead of GGML_ABORT. OAI strict path unchanged when the opt in headers are absent. * server: create stream session only after post_tasks succeeds * server, ui: drop X-Stream-Resume, X-Conversation-Id alone enables the replay buffer * server: drop magic 17, derive the X-Conversation-Id header length from sizeof at build time * refactor: address review feedback from ngxson * server-context: cleaning * server-stream: fix use-after-free on rd Guard stop_producer with a shared alive flag, flipped by on_stream_end before rd dies. Prevents a late cancel (session eviction by a later POST on the same conv_id, or a DELETE arriving after the producer ended) from touching a destroyed rd. * ui: fix cross-conversation contamination Scope streaming flags per conv so one finishing does not unflag the others, guard discoverActiveStream against concurrent runs to avoid duplicate attaches, and stop racing syncRemoteRunningStreams for the sidebar set. * server-http: keep request alive in detached SSE drain The response next() lambda may reach into request via &req long after on_complete reset the request shared_ptr. Capture request in the detached thread so it outlives the drain. ui: address review feedback from coder543 Forward Authorization to /v1/stream and /v1/streams fetches, the resumable routes must obey --api-key like the rest of the API. Wrap reader.read() in a try/catch, the underlying connection drop rejects with TypeError instead of resolving done=true, treat it as a premature end of stream so the existing resume loop kicks in. Freeze the model at session start in chatStreamingStates.model and thread it through cancel and resume, the dropdown selection may have changed since the POST and the server side identity is fixed at that time. * format * ui: remove unused selectedModelName * server-stream: poll session->is_cancelled() in stream_aware_should_stop Address review feedback from coder543. The cancel propagation through rd.stop() relies on the slot eventually processing the cancel task and posting a result that notifies the recv condvar, remove_waiting_task_ids does not notify directly. Add a defensive poll on session->is_cancelled() so the producer-side next() loop exits on its next iteration after cancel() without waiting for the cancel task to round trip through a slot. * server-stream, ui: replace GET /v1/streams with POST /v1/streams/lookup Address review feedback from coder543. Listing live sessions leaks the conversation_id of every concurrent user, which defeats the random UUID unguessability. The new route takes {conversation_ids: [...]} in the body and returns matches only for the ids the caller already owns, so foreign UUIDs stay private. The router fans out the same POST to every child and aggregates, the WebUI passes the convs visible in its sidebar. * ui: read conv ids from IndexedDB in syncRemoteRunningStreams The conversations store is not hydrated yet at +layout onMount, so the sidebar spinners stayed off for background convs until the user clicked on them. Read straight from the DB to dodge the init race. * server-models: deduplicate stream lookup timeouts behind one constant * ui: extract visibility kick grace into a stream constant, bump to 1000 ms * make it safer & more simple * server-stream: survive client disconnect via stream_pipe::finish_producer After the RAII rewrite the generation stopped the moment the client disconnected. httplib bails its content provider on the is_peer_alive check at the top of write_content_chunked, so returning true from the provider never keeps it producing: the response resets, rd is destroyed and its task gets cancelled. Reinstate the disconnect survival inside the pipe. stream_pipe gains finish_producer, which pumps the response next() into the ring buffer until the generation ends, and mark_producer_done for the clean wire end. server-http only triggers them: mark before sink.done on a clean close, finish in on_complete when the peer left early. No detach, no stream logic in server-http beyond the trigger, and the strict OAI path is untouched when no pipe is attached. Known limitation: finish_producer pumps synchronously on the http worker, so a disconnected stream keeps its worker busy until the generation ends. A follow-up will move the drain off the http worker so no worker is held. * server-stream: drain disconnected streams on a manager owned thread The previous commit pumped the post disconnect drain synchronously in on_complete, on the http worker, so a disconnected stream kept its worker busy until the generation ended. Under a wave of reloads or tab closes that pins workers from the pool. Move the drain off the http worker. on_complete now hands the response to stream_session_manager::adopt_orphan, which pumps it to completion on a manager owned thread and releases the worker at once. One thread per disconnected stream still generating, stored in a list, joined and reaped on the next adopt, by the GC, and at shutdown. No detach, the thread lifecycle is fully owned by the manager. needs_drain gates the handoff so a cleanly finished stream never spawns a thread, and the strict OAI path stays untouched when no pipe is attached. stop_gc now cancels sessions before finalizing them, so an in flight drain sees is_cancelled and exits instead of blocking the shutdown join until the generation ends naturally. * ui: add missing JSDoc * server-stream: drain on the http worker, drop the manager thread Address @ngxson review: httplib runs a large dynamic pool and a worker blocked in next() sits on a condvar instead of burning cpu, so draining the rest of the generation on that worker is fine and much simpler than a dedicated thread. on_complete calls finish_producer directly again. Removes adopt_orphan, the orphan thread list and its reaping, the stop_gc session cancel that only existed to unblock those threads, and the now dead drain_shutdown flag. * server-stream: split stream_pipe into producer and consumer classes Address @ngxson review: one class covering both ends was messy. stream_pipe is now a base holding the session and is_cancelled, with stream_pipe_producer (write, mark_producer_done, finish_producer, cleanup, finalizes on destruct) and stream_pipe_consumer (read only, no finalize) deriving from it. Drops the is_producer_ discriminator and its runtime guards, the type now encodes the role. res.spipe is retyped to shared_ptr<stream_pipe_producer> since it is only ever a producer. No behavior change. * server-stream: rename producer methods to unix pipe semantics Address @ngxson review: mark_producer_done becomes done(), finish_producer becomes close(), matching a unix pipe write end. The producer_done_ member follows as done_. write() is unchanged. No behavior change. * server, ui: route resumable streams via a conv map, persist resume identity Address ngxson review: drop the polling probe, proxy_post records a conv_id -> model map and the stream routes resolve the owning child with one lookup. The map is the single source of truth, the ::model suffix stays for child session uniqueness but the router never parses it. UI: the server keys a session by the POST time identity (conv::model), but reload probed with the bare conv id and missed model tagged sessions, so F5 stopped the stream and sidebar spinners stayed off. Persist the model and rebuild the exact identity on resume, single conv and bulk sidebar both send it. Add unit coverage for the identity round trip. * ui: resolve continue target by id to stop cross-conversation flash on switch * ui: skip stream resume when the abort is intentional * server: move the conv id to model map into a self contained tracker Address review from ngxson: server_models held two mutexes side by side, the global one and a bare conv_model_mu guarding a loose map, which made the locking hard to follow. Wrap the map and its lock in a small conv_model_tracker struct that owns its mutex, one mutex per struct. The remember, lookup and forget methods move inline into the tracker, server_models exposes a single conv_models member and the routes call models.conv_models.lookup and friends. No behavior change, the map stays the single source of truth for routing resumable streams to a child. * ui: replace stream magic values with enums and shared constants Address review from allozaur: lift the inline literals around the resumable stream code into named symbols so the intent is explicit and reusable. * ui: fold the stream resume and discovery helpers into ChatService Address review from allozaur: drop the two standalone stream-.service files. They were used only by the chat service and store, carried no shared state, and did not follow the static class pattern the other services use, so a separate abstraction was not warranted. Move the helpers onto ChatService as static methods. No behavior change, tests now exercise them through ChatService. docs: document the SSE replay buffer in server README-dev Add the resumable streaming section, list stream_session_manager in the backend component inventory, and link PR 23226 in the related PRs. * ui: align attachServerStream call with onCompletionId param in handleStreamResponse * server-http: rename del_ to del to match get and post * ui: address review feedback from allozaur * ui: drop duplicate SSE constants, keep sse.ts canonical * ui: use svelte:document for the visibilitychange listener address review from allozaur: replace the manual document.addEventListener in onMount with a declarative <svelte:document onvisibilitychange>. svelte handles attach, detach and SSR, so the typeof document guard and the onMount cleanup go away. onMount keeps only the first load snapshot. * server: trim redundant stream drain comments Address review from ngxson * server: balance and clean up stream comments remove redundant comments and tighten the verbose ones across the resumable stream code, keeping the concurrency and lifetime rationale that is not obvious from the code. also fix two stale comments in server.cpp and server-models.h that still described the old ::model suffix probe and fan out routing, now replaced by the conv_id -> model map Address review from ngxson * ui: balance and clean up stream comments dedup repeated rationale (frozen conv::model identity, the lookup privacy note, the abort patterns) down to one canonical spot, tighten the verbose blocks, and keep the concurrency and resume-offset reasoning. fix stale comments in stream-identity.ts and chat.service.ts that still described the old loopback probe and fan out routing, now the conv_id -> model map. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-06-26 09:31:29 +02:00
Xuan-Son Nguyen	e9d1b76d0a	server: use status code 403 for disabled features (#24970 ) * server: use status code 403 for disabled features * cont * fix test case	2026-06-25 16:36:40 +02:00
Xuan-Son Nguyen	60bc8866b1	common: refactor model handling (#24980 ) * common: refactor models handling * remote preset * cont * rm skip_download option * missing header * fix plan.model_files * fix --offline case * move hf_plan to download * refactor * rm redundant curr_ex, add comments * adapt	2026-06-25 15:17:51 +02:00
Xuan-Son Nguyen	75ad0b23ed	server: fix remote preset handling, add test (#24938 ) * server: add test for remote preset * fix remote preset handling * fix * fix test	2026-06-23 13:28:34 +02:00
Xuan-Son Nguyen	721354fbdf	server: (router) move model downloading to dedicated process (#24834 ) * server: real-time model load progress tracking via /models/sse * update docs * server: move model download to child process * rm unused * fix most problems * clean up * nit fixes * fix test case * do not detact() thread * shorter MODEL_DOWNLOAD_TIMEOUT in test * throttle	2026-06-22 18:24:04 +02:00
Xuan-Son Nguyen	2b686a9120	server: refactor child --> router communication (#24821 ) * server: refactor child --> router communication * fix wakeup case * add docs * improve update_status() * nits	2026-06-20 01:02:26 +02:00
Xuan-Son Nguyen	8c2d6f6475	server: add --agent arg, remove redundant webui naming compat (#24801 ) * server: add --agent arg, remove redundant webui naming compat * corrent env * fix the test * llama-gen-docs * nits: wordings	2026-06-19 16:06:13 +02:00
Xuan-Son Nguyen	552258c535	server: (router) rework -hf preset repo (#24739 ) * server: temporary remove HF remote preset * rework remove preset.ini support * rm unused get_remote_preset_whitelist() * print warning * add docs * rm stray file	2026-06-18 12:45:23 +02:00
Xuan-Son Nguyen	4b4d13ae72	server: (router) add model management API (#23976 ) * wip * server: (router) add SSE realtime updates API * nits * wip * add download API * add download api * update docs * add delete endpoint * fix std::terminate * fix crash * fix 2 * add tests * nits	2026-06-17 18:04:58 +02:00
Xuan-Son Nguyen	18ef86ecec	server: skip unused log lines on router mode (#24463 )	2026-06-11 11:36:35 +02:00
Xuan-Son Nguyen	f5c6ae1827	mtmd, server: add "placeholder bitmap" for counting tokens , add /input_tokens API (#23913 ) mtmd: add "placeholder bitmap" for counting tokens w/o preprocessing * fast path skip preproc for placeholder * fix build * correct the api * add server endpoint + tests * add object name * update docs * add proxy handling * fix build * fix audio input path * use is_placeholder in process_mtmd_prompt() * nits * nits (2) * docs: clarify chat/completions/input_tokens is not official * fix merge problem	2026-06-06 11:06:51 +02:00
Pascal	354ebac8cb	server: real-time reasoning interruption via control endpoint (#23971 ) * server: real-time reasoning interruption via control endpoint Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. * ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export. Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. * ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated control-actions module. No behavior change, drops the magic strings so the control protocol has a single source of truth. * server: target reasoning control by completion id Address @ngxson review on the control endpoint. Switch from id_slot to the chat completion id to avoid a TOCTOU: the slot can be reassigned between the lookup and the control request, so matching the live completion (oaicompat_cmpl_id) is safe and a finished one simply matches nothing. Rename the action to reasoning_end, guard it on the reasoning_control flag of the target slot, and reduce the response to {success} with an optional message. * ui: target reasoning control by completion id Keep the streamed completion id on the message and post it back to the control endpoint instead of probing /slots. Drops the slot discovery and the TOCTOU that came with it. Action renamed to reasoning_end, response read as {success}. * server: address review from @ngxson Move the control fields into task_params and drop the redundant comments on the control path. * server: document the reasoning control endpoint * Update tools/ui/src/lib/types/database.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: rename cmplId to completionId Per @allozaur review, clearer name for the streamed completion id. * ui: wire completion id capture through the agentic flow The webui streams through the agentic flow, which relayed onModel but not onCompletionId, so the completion id never reached the message and the control request was never sent. Relay it through the flow and its callbacks type, declare id on the chunk type, and log an explicit error when the button fires without a usable id. * ui: target reasoning control model from the message The model is a property of the completion, so read it from the streaming message like the id, not from the model dropdown which is unrelated UI state. Makes the request self-consistent by construction instead of just unlikely to drift. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-06-02 07:26:20 +02:00
Adrien Gallouët	29f1482221	app : introduce the llama unified executable (#23296 ) * app : introduce the llama unified executable Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use serve for server Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Hide completion and bench, add help command Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove STATIC Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use -impl targets instead of -lib Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Revert "Remove STATIC" This reverts commit cc44caccb9902b34a3531633edac911e5b3d65cd. --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-20 13:22:22 +02:00
Pascal	64b38b561b	server: skip device enumeration in router mode to avoid creating CUDA primary context (#23137 )	2026-05-16 21:21:06 +02:00
Aleksander Grygier	59778f0196	ui: Restructure repo to use `tools/ui` folder and `ui` / `UI` / `llama-ui` / `LLAMA_UI` naming (#23064 ) * webui: Move static build output from `tools/server/public` to `build/ui` directory * refactor: Move to `tools/ui` * refactor: rename CMake variables and preprocessor defines - Rename LLAMA_BUILD_WEBUI -> LLAMA_BUILD_UI (old kept as deprecated) - Rename LLAMA_USE_PREBUILT_WEBUI -> LLAMA_USE_PREBUILT_UI (old kept as deprecated) - Backward compat: old vars auto-forward to new ones with DEPRECATION warning - Rename internal vars: WEBUI_SOURCE -> UI_SOURCE, WEBUI_SOURCE_DIR -> UI_SOURCE_DIR, etc. - Rename HF bucket: LLAMA_WEBUI_HF_BUCKET -> LLAMA_UI_HF_BUCKET - Emit both LLAMA_BUILD_WEBUI and LLAMA_BUILD_UI preprocessor defines - Emit both LLAMA_WEBUI_DEFAULT_ENABLED and LLAMA_UI_DEFAULT_ENABLED * refactor: rename CLI flags (--webui -> --ui) with backward compat - Add --ui/--no-ui (old --webui/--no-webui kept as deprecated aliases) - Add --ui-config (old --webui-config kept as deprecated alias) - Add --ui-config-file (old --webui-config-file kept as deprecated alias) - Add --ui-mcp-proxy/--no-ui-mcp-proxy (old --webui-mcp-proxy kept as deprecated) - Add new env vars: LLAMA_ARG_UI, LLAMA_ARG_UI_CONFIG, LLAMA_ARG_UI_CONFIG_FILE, LLAMA_ARG_UI_MCP_PROXY - C++ struct fields: params.ui, params.ui_config_json, params.ui_mcp_proxy added alongside old fields - Backward compat: old fields synced to new ones in g_params_to_internals * refactor: update C++ server internals with backward compat - Rename json_webui_settings -> json_ui_settings (both kept in server_context_meta) - Rename params.webui usage -> params.ui (both synced, old still works) - JSON API emits both "ui"/"ui_settings" and "webui"/"webui_settings" keys - Server routes use params.ui_mcp_proxy \|\| params.webui_mcp_proxy - Preprocessor guards use #if defined(LLAMA_BUILD_UI) \|\| defined(LLAMA_BUILD_WEBUI) * refactor: rename CI/CD workflows, artifacts, and build script - Rename webui-build.yml -> ui-build.yml; artifact webui-build -> ui-build - Rename webui-publish.yml -> ui-publish.yml; var HF_BUCKET_WEBUI_STATIC_OUTPUT -> HF_BUCKET_UI_STATIC_OUTPUT - Rename server-webui.yml -> server-ui.yml; job webui-build/checks -> ui-build/checks - Update server.yml: job/artifact refs webui-build -> ui-build - Update release.yml: all webui-build/publish refs -> ui-build/publish; HF_TOKEN_WEBUI_STATIC_OUTPUT -> HF_TOKEN_UI_STATIC_OUTPUT - Update server-self-hosted.yml: webui-build -> ui-build - Update build-self-hosted.yml: HF_WEBUI_VERSION -> HF_UI_VERSION - Rename webui-download.cmake -> ui-download.cmake (internal refs updated) - Update labeler.yml: server/webui -> server/ui path label * docs: update CODEOWNERS and server README docs - Update CODEOWNERS: team ggml-org/llama-webui -> ggml-org/llama-ui, path /tools/server/webui/ -> /tools/ui/ - Update server README.md: CLI tables show --ui flags with deprecated --webui aliases - Update server README-dev.md: "WebUI" -> "UI", paths updated to tools/ui/ * fix: Small fixes for UI build * fix: CMake.txt syntax * chore: Formatting * fix: `.editorconfig` for llama-ui * chore: Formatting * refactor: Use `APP_NAME` in Error route * refactor: Cleanup * refactor: Single migration service * make llama-ui a linkable target * fix: UI Build output * fix: Missing change * fix: separate llama-ui npm build output into build/tools/ui/dist subfolder + use cmake npm build instead of downloading ui-build.yml artifacts in CI * refactor: UI workflows cleanup --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-05-16 02:02:40 +02:00
Georgi Gerganov	67b2b7f2f2	logs : reduce (#23021 ) * logs : reduce * args : fix envs * server : fix build * common : print verbosity level at start * server : clean-up logs * server : print prompt processing timings + sampling params * minor : whitespaces	2026-05-14 13:05:52 +03:00
Xuan-Son Nguyen	29debb3a6a	server: support Vertex AI compatible API (#22545 ) * server: support Vertex AI compatible API * a bit safer * support other AIP_* env var * various fixes * if AIP_MODE is unset, do nothing * fix test case * fix windows build	2026-05-08 15:23:04 +02:00
Xuan-Son Nguyen	9dcf835528	server: (router) expose child model info from router's /v1/models (#22683 ) * server: (router) expose child model info from router's /v1/models * update docs	2026-05-08 14:42:15 +02:00
Georgi Gerganov	2bacb1eb77	server : validate --tools CLI argument against known tool names (#22538 ) Previously, unknown tool names passed via --tools were silently ignored. Now the server validates each tool name at startup and exits with an error if an unrecognized tool is specified, listing the available tools. Assisted-by: llama.cpp:local pi	2026-05-05 06:35:27 +03:00
Georgi Gerganov	cfe9838d26	fit-params : refactor + add option to output estimated memory per device (#22171 ) * fit-params : add option to output estimated memory per device * cont : minor * cont : refactor * cont : move fit params implementation to libcommon * cont : header * cont : headers * cont : codeowners	2026-04-21 09:54:36 +03:00
Georgi Gerganov	cf8b0dbda9	server : remove /api endpoints (#22165 ) * server : remove /api endpoints * cont : remove /api/tags	2026-04-20 20:41:19 +03:00
Georgi Gerganov	6990e2f1f7	libs : rename libcommon -> libllama-common (#21936 ) * cmake : allow libcommon to be shared * cmake : rename libcommon to libllama-common * cont : set -fPIC for httplib * cont : export all symbols * cont : fix build_info exports * libs : add libllama-common-base * log : add common_log_get_verbosity_thold()	2026-04-17 11:11:46 +03:00
Xuan-Son Nguyen	e489a5ca0e	server: support OAI /v1/audio/transcriptions API (#21863 ) * server: support OAI /v1/audio/transcriptions API * address autoreview comments * correct default response_format value	2026-04-14 11:09:52 +02:00
ddh0	5d3a4a7da5	server : fix logging of build + system info (#21460 ) This PR changes the logging that occurs at startup of llama-server. Currently, it is redundant (including CPU information twice) and it is missing the build + commit info.	2026-04-05 16:14:02 +02:00
Adrien Gallouët	41361c8599	common : move up common_init() and fix Windows UTF-8 logs (#21176 ) The build info is now only for debug, so we avoid the duplicate with `--version`. The UTF-8 setup at the beginning is needed to avoid logging garbage on Windows. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-31 12:53:41 +02:00
Xuan-Son Nguyen	20197b6fe3	server: add built-in tools backend support (#20898 ) * wip: server_tools * refactor * displayName -> display_name * snake_case everywhere * rm redundant field * change arg to --tools all * add readme mention * llama-gen-docs	2026-03-27 10:07:11 +01:00
Xuan-Son Nguyen	49bfddeca1	server: allow router to report child instances sleep status (#20849 ) * server: allow router to report child instances sleep status * refactor * move sleeping to state * nits	2026-03-22 18:33:52 +01:00
Aleksander Grygier	f6235a41ef	webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts (#18655 )	2026-03-06 10:00:39 +01:00
SamareshSingh	cb8f4fa3f8	Fix locale-dependent float printing in GGUF metadata (#17331 ) * Set C locale for consistent float formatting across all binaries. * Add C locale setting to all tools binaries Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/ directory to ensure consistent floating-point formatting. * Apply suggestion from @JohannesGaessler --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-04 09:30:40 +01:00
Sami Kama	5596a35791	server: Mirroring /v1/responses to /responses to match /v1/chat/completions pattern (#19873 )	2026-02-28 00:44:42 +08:00
Pascal	2e7e638523	server : support multiple model aliases via comma-separated --alias (#19926 ) * server : support multiple model aliases via comma-separated --alias * server : update --alias description and regenerate docs * server : multiple model aliases and tags - address review feedback from ngxson - --alias accepts comma-separated values (std::set, no duplicates) - --tags for informational metadata (not used for routing) - aliases resolve transparently in router via get_meta/has_model - /v1/models exposes aliases and tags fields * regenerate docs * nits * server : use first alias as model_name for backward compat address review feedback from ngxson * server : add single-model test for aliases and tags	2026-02-27 07:05:23 +01:00
손희준	fbbf3ad190	server: /v1/responses (partial) (#18486 ) * from previous PR * Make instruction(system) as first message * Convert [input_message] (text/image/file) * Rename convert_responses_to_chatcmpl(body) -> response_body * Initial tool call support * Erase instructions field from chatcmpl body * Feed reasoning texts to chat template * Use std::vector instead of opaque json array * Make output_item.added events consistent * Move `server_task_result_cmpl_partial::update` from header to source * Match ID of output_item.added and .done events * Add function_call only if there is no "fc_" prefix * Add function call output at non-streaming API * Test if ID is persistent * Add doc * Fix style - use trailing comma * Rewrite state management * catch up with upstream/master * Fix style - "type" is the first item of SSE data * Explicitly check "instructions" from response_body * Make lambdas static * Check if reasoning content exists * Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final * Reject `input_file` since it is not supported by chatcmpl * Add "fc_" prefix to non-straming function call id as coderabbit pointed out --------- Co-authored-by: openingnow <>	2026-01-21 17:47:23 +01:00
Vladislav Sayapin	da143b9940	server : fix router child env in containerized environments (#18562 )	2026-01-05 14:12:05 +01:00
Xuan-Son Nguyen	6ce863c803	server: prevent data race from HTTP threads (#18263 ) * server: prevent data race from HTTP threads * fix params * fix default_generation_settings * nits: make handle_completions_impl looks less strange * stricter const * fix GGML_ASSERT(idx < states.size()) * move index to be managed by server_response_reader * http: make sure req & res lifecycle are tied together * fix compile * fix index handling buggy * fix data race for lora endpoint * nits: fix shadow variable * nits: revert redundant changes * nits: correct naming for json_webui_settings	2025-12-22 14:23:34 +01:00
Xuan-Son Nguyen	ddcb75dd8a	server: add auto-sleep after N seconds of idle (#18228 ) * implement sleeping at queue level * implement server-context suspend * add test * add docs * optimization: add fast path * make sure to free llama_init * nits * fix use-after-free * allow /models to be accessed during sleeping, fix use-after-free * don't allow accessing /models during sleep, it is not thread-safe * fix data race on accessing props and model_meta * small clean up * trailing whitespace * rm outdated comments	2025-12-21 02:24:42 +01:00
Pascal	6ce3d85796	server: (webui) add --webui-config (#18028 ) * server/webui: add server-side WebUI config support Add CLI arguments --webui-config (inline JSON) and --webui-config-file (file path) to configure WebUI default settings from server side. Backend changes: - Parse JSON once in server_context::load_model() for performance - Cache parsed config in webui_settings member (zero overhead on /props) - Add proper error handling in router mode with try/catch - Expose webui_settings in /props endpoint for both router and child modes Frontend changes: - Add 14 configurable WebUI settings via parameter sync - Add tests for webui settings extraction - Fix subpath support with base path in API calls Addresses feedback from @ngxson and @ggerganov * server: address review feedback from ngxson * server: regenerate README with llama-gen-docs	2025-12-17 21:45:45 +01:00
Xuan-Son Nguyen	bde461de8c	server: (router) allow child process to report status via stdout (#18110 ) * server: (router) allow child process to report status via stdout * apply suggestions	2025-12-17 14:54:11 +01:00
yifant-code	59977eba7b	server: fix crash when batch > ubatch with embeddings (#17912 ) * server: fix crash when batch > ubatch with embeddings (#12836) Fixes #12836 where the server crashes with GGML_ASSERT failure when running with embeddings enabled and n_batch > n_ubatch. Root cause: Embeddings use non-causal attention which requires all tokens to be processed within a single ubatch. When n_batch > n_ubatch, the server attempts to split processing, causing assertion failure. Solution: - Add parameter validation in main() after common_params_parse() - When embeddings enabled and n_batch > n_ubatch: * Log warnings explaining the issue * Automatically set n_batch = n_ubatch * Prevent server crash This follows the approach suggested by @ggerganov in issue #12836. Note: This supersedes stalled PR #12940 which attempted a runtime fix in the old examples/server/server.cpp location. This implementation validates at startup in tools/server/server.cpp (current location). Testing: - Build: Compiles successfully - Validation triggers: Warns when -b > -ub with --embedding - Auto-correction works: Adjusts n_batch = n_ubatch - No false positives: Valid params don't trigger warnings - Verified on macOS M3 Pro with embedding model * Update tools/server/server.cpp --------- Co-authored-by: ytian218 <ytian218@bloomberg.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-16 14:27:36 +02:00
Xuan-Son Nguyen	7b1db3d3b7	arg: clarify auto kvu/np being set on server (#17997 ) * arg: clarify auto kvu/np being set on server * improve docs * use invalid_argument	2025-12-16 12:01:27 +01:00
Xuan-Son Nguyen	13628d8bdb	server: add --media-path for local media files (#17697 ) * server: add --media-path for local media files * remove unused fn	2025-12-02 22:49:20 +01:00
Chad Voegele	c4357dcc35	Server: Change Invalid Schema from Server Error (500) to User Error (400) (#17572 ) * Make invalid schema a user error (400) * Move invalid_argument exception handler to ex_wrapper * Fix test * Simplify test back to original pattern	2025-12-02 17:33:50 +01:00
Xuan-Son Nguyen	ec18edfcba	server: introduce API for serving / loading / unloading multiple models (#17470 ) * server: add model management and proxy * fix compile error * does this fix windows? * fix windows build * use subprocess.h, better logging * add test * fix windows * feat: Model/Router server architecture WIP * more stable * fix unsafe pointer * also allow terminate loading model * add is_active() * refactor: Architecture improvements * tmp apply upstream fix * address most problems * address thread safety issue * address review comment * add docs (first version) * address review comment * feat: Improved UX for model information, modality interactions etc * chore: update webui build output * refactor: Use only the message data `model` property for displaying model used info * chore: update webui build output * add --models-dir param * feat: New Model Selection UX WIP * chore: update webui build output * feat: Add auto-mic setting * feat: Attachments UX improvements * implement LRU * remove default model path * better --models-dir * add env for args * address review comments * fix compile * refactor: Chat Form Submit component * ad endpoint docs * Merge remote-tracking branch 'webui/allozaur/server_model_management_v1_2' into xsn/server_model_maagement_v1_2 Co-authored-by: Aleksander <aleksander.grygier@gmail.com> * feat: Add copy to clipboard to model name in model info dialog * feat: Model unavailable UI state for model selector * feat: Chat Form Actions UI logic improvements * feat: Auto-select model from last assistant response * chore: update webui build output * expose args and exit_code in API * add note * support extra_args on loading model * allow reusing args if auto_load * typo docs * oai-compat /models endpoint * cleaner * address review comments * feat: Use `model` property for displaying the `repo/model-name` naming format * refactor: Attachments data * chore: update webui build output * refactor: Enum imports * feat: Improve Model Selector responsiveness * chore: update webui build output * refactor: Cleanup * refactor: Cleanup * refactor: Formatters * chore: update webui build output * refactor: Copy To Clipboard Icon component * chore: update webui build output * refactor: Cleanup * chore: update webui build output * refactor: UI badges * chore: update webui build output * refactor: Cleanup * refactor: Cleanup * chore: update webui build output * add --models-allow-extra-args for security * nits * add stdin_file * fix merge * fix: Retrieve lost setting after resolving merge conflict * refactor: DatabaseStore -> DatabaseService * refactor: Database, Conversations & Chat services + stores architecture improvements (WIP) * refactor: Remove redundant settings * refactor: Multi-model business logic WIP * chore: update webui build output * feat: Switching models logic for ChatForm or when regenerating messges + modality detection logic * chore: update webui build output * fix: Add `untrack` inside chat processing info data logic to prevent infinite effect * fix: Regenerate * feat: Remove redundant settigns + rearrange * fix: Audio attachments * refactor: Icons * chore: update webui build output * feat: Model management and selection features WIP * chore: update webui build output * refactor: Improve server properties management * refactor: Icons * chore: update webui build output * feat: Improve model loading/unloading status updates * chore: update webui build output * refactor: Improve API header management via utility functions * remove support for extra args * set hf_repo/docker_repo as model alias when posible * refactor: Remove ConversationsService * refactor: Chat requests abort handling * refactor: Server store * tmp webui build * refactor: Model modality handling * chore: update webui build output * refactor: Processing state reactivity * fix: UI * refactor: Services/Stores syntax + logic improvements Refactors components to access stores directly instead of using exported getter functions. This change centralizes store access and logic, simplifying component code and improving maintainability by reducing the number of exported functions and promoting direct store interaction. Removes exported getter functions from `chat.svelte.ts`, `conversations.svelte.ts`, `models.svelte.ts` and `settings.svelte.ts`. * refactor: Architecture cleanup * feat: Improve statistic badges * feat: Condition available models based on modality + better model loading strategy & UX * docs: Architecture documentation * feat: Update logic for PDF as Image * add TODO for http client * refactor: Enhance model info and attachment handling * chore: update webui build output * refactor: Components naming * chore: update webui build output * refactor: Cleanup * refactor: DRY `getAttachmentDisplayItems` function + fix UI * chore: update webui build output * fix: Modality detection improvement for text-based PDF attachments * refactor: Cleanup * docs: Add info comment * refactor: Cleanup * re * refactor: Cleanup * refactor: Cleanup * feat: Attachment logic & UI improvements * refactor: Constants * feat: Improve UI sidebar background color * chore: update webui build output * refactor: Utils imports + move types to `app.d.ts` * test: Fix Storybook mocks * chore: update webui build output * test: Update Chat Form UI tests * refactor: Tooltip Provider from core layout * refactor: Tests to separate location * decouple server_models from server_routes * test: Move demo test to tests/server * refactor: Remove redundant method * chore: update webui build output * also route anthropic endpoints * fix duplicated arg * fix invalid ptr to shutdown_handler * server : minor * rm unused fn * add ?autoload=true\|false query param * refactor: Remove redundant code * docs: Update README documentations + architecture & data flow diagrams * fix: Disable autoload on calling server props for the model * chore: update webui build output * fix ubuntu build * fix: Model status reactivity * fix: Modality detection for MODEL mode * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-01 19:41:04 +01:00
Xuan-Son Nguyen	ab49f094d2	server: move server-context to its own cpp\|h (#17595 ) * git mv * add server-context.h * add server-context.h * clean up headers * cont : cleanup * also expose server_response_reader (to be used by CLI) * fix windows build * decouple server_routes and server_http --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-29 22:04:44 +01:00
o7si	3ce7a65c2f	server: fix: /metrics endpoint returning JSON-escaped Prometheus format (#17386 ) * fix: /metrics endpoint returning JSON-escaped Prometheus format * mod: remove string overload from ok() method	2025-11-28 19:14:00 +01:00
Fredrik Hultin	ddf9f94389	server : add Anthropic Messages API support (#17570 ) * server : add Anthropic Messages API support * remove -@pytest.mark.slow from tool calling/jinja tests * server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py * server : removed redundant n field logic in anthropic_params_from_json * server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream() * server : refactor Anthropic API to use OAI conversion * make sure basic test always go first * clean up * clean up api key check, add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-11-28 12:57:04 +01:00
Xuan-Son Nguyen	b8372eecd9	server: split server.cpp code into server/common/task/queue (#17362 ) * add server-task, server-common * add server-queue * rm redundant includes * move enum stop_type to server-task * server : headers cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-24 14:41:53 +01:00
Xuan-Son Nguyen	0de8878c96	server: split HTTP into its own interface (#17216 ) * server: split HTTP into its own interface * move server-http and httplib to its own file * add the remaining endpoints * fix exception/error handling * renaming * missing header * fix missing windows header * fix error responses from http layer * fix slot save/restore handler * fix case where only one stream chunk is returned * add NOMINMAX * do not call sink.write on empty data * use safe_json_to_str for SSE * clean up * add some comments * improve usage of next() * bring back the "server is listening on" message * more generic handler * add req.headers * move the chat template print to init() * add req.path * cont : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-17 22:05:44 +01:00
Georgi Gerganov	5b2093becc	server : handle context overflow during decode (#17267 ) * server : handle context overflow during decode * server : minor refactor	2025-11-16 09:23:37 +02:00
Xuan-Son Nguyen	9b17d74ab7	mtmd: add mtmd_log_set (#17268 )	2025-11-14 15:56:19 +01:00
Georgi Gerganov	d396b43748	server : fix "can batch with" bug (#17263 )	2025-11-14 14:03:45 +02:00

1 2 3 4

156 Commits