26 Commits

Author SHA1 Message Date
Pascal
1a87dcdc45
server + ui: SSE Replay Buffer (#23226)
* server: SSE replay buffer, survives client disconnect

Opt in on POST /v1/chat/completions when the client sends
X-Stream-Resume: 1 and a non empty X-Conversation-Id. The conv id is
the session identity end to end, no extra opaque token. The drain
runs detached server side and buffers SSE bytes, the generation
survives HTTP disconnect, F5, or lets users switch from iOS Safari
to another app without losing the actively generated response.

Routes:
  GET    /v1/stream/<conv_id>?from=N       replay
  GET    /v1/streams[?conversation_id=X]   list, drives sidebar spinners
  DELETE /v1/stream/<conv_id>              Stop, idempotent

Router parent fans out to children for list and delete, probes on GET
to route to the owner, fans out DELETE on POST so "one session per
conv" holds across model swaps.

WebUI: the layout snapshots /v1/streams at mount and on
visibilitychange, the sidebar reflects live inferences across all
convs. The chat page reattaches on mount, append vs fresh is detected
from existing content so continue mid stream keeps its prefix.

update_slots: on llama_memory_seq_rm refusal at a deep position, full
clear of the seq and reprefill from zero instead of GGML_ABORT.

OAI strict path unchanged when the opt in headers are absent.

* server: create stream session only after post_tasks succeeds

* server, ui: drop X-Stream-Resume, X-Conversation-Id alone enables the replay buffer

* server: drop magic 17, derive the X-Conversation-Id header length from sizeof at build time

* refactor: address review feedback from ngxson

* server-context: cleaning

* server-stream: fix use-after-free on rd

Guard stop_producer with a shared alive flag, flipped by on_stream_end
before rd dies. Prevents a late cancel (session eviction by a later
POST on the same conv_id, or a DELETE arriving after the producer
ended) from touching a destroyed rd.

* ui: fix cross-conversation contamination

Scope streaming flags per conv so one finishing does not unflag the
others, guard discoverActiveStream against concurrent runs to avoid
duplicate attaches, and stop racing syncRemoteRunningStreams for the
sidebar set.

* server-http: keep request alive in detached SSE drain

The response next() lambda may reach into *request via &req long
after on_complete reset the request shared_ptr. Capture request in
the detached thread so it outlives the drain.

* ui: address review feedback from coder543

Forward Authorization to /v1/stream and /v1/streams fetches, the resumable routes
must obey --api-key like the rest of the API.

Wrap reader.read() in a try/catch, the underlying connection drop rejects with
TypeError instead of resolving done=true, treat it as a premature end of stream
so the existing resume loop kicks in.

Freeze the model at session start in chatStreamingStates.model and thread it
through cancel and resume, the dropdown selection may have changed since the
POST and the server side identity is fixed at that time.

* format

* ui: remove unused selectedModelName

* server-stream: poll session->is_cancelled() in stream_aware_should_stop

Address review feedback from coder543. The cancel propagation through
rd.stop() relies on the slot eventually processing the cancel task and
posting a result that notifies the recv condvar, remove_waiting_task_ids
does not notify directly. Add a defensive poll on session->is_cancelled()
so the producer-side next() loop exits on its next iteration after
cancel() without waiting for the cancel task to round trip through a slot.

* server-stream, ui: replace GET /v1/streams with POST /v1/streams/lookup

Address review feedback from coder543. Listing live sessions leaks the
conversation_id of every concurrent user, which defeats the random UUID
unguessability. The new route takes {conversation_ids: [...]} in the
body and returns matches only for the ids the caller already owns, so
foreign UUIDs stay private. The router fans out the same POST to every
child and aggregates, the WebUI passes the convs visible in its sidebar.

* ui: read conv ids from IndexedDB in syncRemoteRunningStreams

The conversations store is not hydrated yet at +layout onMount, so the
sidebar spinners stayed off for background convs until the user clicked
on them. Read straight from the DB to dodge the init race.

* server-models: deduplicate stream lookup timeouts behind one constant

* ui: extract visibility kick grace into a stream constant, bump to 1000 ms

* make it safer & more simple

* server-stream: survive client disconnect via stream_pipe::finish_producer

After the RAII rewrite the generation stopped the moment the client
disconnected. httplib bails its content provider on the is_peer_alive
check at the top of write_content_chunked, so returning true from the
provider never keeps it producing: the response resets, rd is destroyed
and its task gets cancelled.

Reinstate the disconnect survival inside the pipe. stream_pipe gains
finish_producer, which pumps the response next() into the ring buffer
until the generation ends, and mark_producer_done for the clean wire
end. server-http only triggers them: mark before sink.done on a clean
close, finish in on_complete when the peer left early. No detach, no
stream logic in server-http beyond the trigger, and the strict OAI path
is untouched when no pipe is attached.

Known limitation: finish_producer pumps synchronously on the http
worker, so a disconnected stream keeps its worker busy until the
generation ends. A follow-up will move the drain off the http worker so
no worker is held.

* server-stream: drain disconnected streams on a manager owned thread

The previous commit pumped the post disconnect drain synchronously in
on_complete, on the http worker, so a disconnected stream kept its
worker busy until the generation ended. Under a wave of reloads or tab
closes that pins workers from the pool.

Move the drain off the http worker. on_complete now hands the response
to stream_session_manager::adopt_orphan, which pumps it to completion on
a manager owned thread and releases the worker at once. One thread per
disconnected stream still generating, stored in a list, joined and
reaped on the next adopt, by the GC, and at shutdown. No detach, the
thread lifecycle is fully owned by the manager. needs_drain gates the
handoff so a cleanly finished stream never spawns a thread, and the
strict OAI path stays untouched when no pipe is attached.

stop_gc now cancels sessions before finalizing them, so an in flight
drain sees is_cancelled and exits instead of blocking the shutdown join
until the generation ends naturally.

* ui: add missing JSDoc

* server-stream: drain on the http worker, drop the manager thread

Address @ngxson review: httplib runs a large dynamic pool and a worker
blocked in next() sits on a condvar instead of burning cpu, so draining
the rest of the generation on that worker is fine and much simpler than
a dedicated thread.

on_complete calls finish_producer directly again. Removes adopt_orphan,
the orphan thread list and its reaping, the stop_gc session cancel that
only existed to unblock those threads, and the now dead drain_shutdown
flag.

* server-stream: split stream_pipe into producer and consumer classes

Address @ngxson review: one class covering both ends was messy. stream_pipe
is now a base holding the session and is_cancelled, with stream_pipe_producer
(write, mark_producer_done, finish_producer, cleanup, finalizes on destruct)
and stream_pipe_consumer (read only, no finalize) deriving from it.

Drops the is_producer_ discriminator and its runtime guards, the type now
encodes the role. res.spipe is retyped to shared_ptr<stream_pipe_producer>
since it is only ever a producer. No behavior change.

* server-stream: rename producer methods to unix pipe semantics

Address @ngxson review: mark_producer_done becomes done(), finish_producer
becomes close(), matching a unix pipe write end. The producer_done_ member
follows as done_. write() is unchanged. No behavior change.

* server, ui: route resumable streams via a conv map, persist resume identity

Address ngxson review: drop the polling probe, proxy_post records a conv_id ->
model map and the stream routes resolve the owning child with one lookup. The
map is the single source of truth, the ::model suffix stays for child session
uniqueness but the router never parses it.

UI: the server keys a session by the POST time identity (conv::model), but reload
probed with the bare conv id and missed model tagged sessions, so F5 stopped the
stream and sidebar spinners stayed off. Persist the model and rebuild the exact
identity on resume, single conv and bulk sidebar both send it.

Add unit coverage for the identity round trip.

* ui: resolve continue target by id to stop cross-conversation flash on switch

* ui: skip stream resume when the abort is intentional

* server: move the conv id to model map into a self contained tracker

Address review from ngxson: server_models held two mutexes side by side, the
global one and a bare conv_model_mu guarding a loose map, which made the locking
hard to follow. Wrap the map and its lock in a small conv_model_tracker struct
that owns its mutex, one mutex per struct. The remember, lookup and forget
methods move inline into the tracker, server_models exposes a single conv_models
member and the routes call models.conv_models.lookup and friends. No behavior
change, the map stays the single source of truth for routing resumable streams
to a child.

* ui: replace stream magic values with enums and shared constants

Address review from allozaur: lift the inline literals around the resumable
stream code into named symbols so the intent is explicit and reusable.

* ui: fold the stream resume and discovery helpers into ChatService

Address review from allozaur: drop the two standalone stream-*.service files.
They were used only by the chat service and store, carried no shared state, and
did not follow the static class pattern the other services use, so a separate
abstraction was not warranted. Move the helpers onto ChatService as static
methods. No behavior change, tests now exercise them through ChatService.

* docs: document the SSE replay buffer in server README-dev

Add the resumable streaming section, list stream_session_manager in the
backend component inventory, and link PR 23226 in the related PRs.

* ui: align attachServerStream call with onCompletionId param in handleStreamResponse

* server-http: rename del_ to del to match get and post

* ui: address review feedback from allozaur

* ui: drop duplicate SSE constants, keep sse.ts canonical

* ui: use svelte:document for the visibilitychange listener

address review from allozaur: replace the manual document.addEventListener
in onMount with a declarative <svelte:document onvisibilitychange>. svelte
handles attach, detach and SSR, so the typeof document guard and the onMount
cleanup go away. onMount keeps only the first load snapshot.

* server: trim redundant stream drain comments

Address review from ngxson

* server: balance and clean up stream comments

remove redundant comments and tighten the verbose ones across the resumable
stream code, keeping the concurrency and lifetime rationale that is not obvious
from the code. also fix two stale comments in server.cpp and server-models.h
that still described the old ::model suffix probe and fan out routing, now
replaced by the conv_id -> model map

Address review from ngxson

* ui: balance and clean up stream comments

dedup repeated rationale (frozen conv::model identity, the lookup privacy note,
the abort patterns) down to one canonical spot, tighten the verbose blocks, and
keep the concurrency and resume-offset reasoning. fix stale comments in
stream-identity.ts and chat.service.ts that still described the old loopback
probe and fan out routing, now the conv_id -> model map.

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-06-26 09:31:29 +02:00
Xuan-Son Nguyen
721354fbdf
server: (router) move model downloading to dedicated process (#24834)
* server: real-time model load progress tracking via /models/sse

* update docs

* server: move model download to child process

* rm unused

* fix most problems

* clean up

* nit fixes

* fix test case

* do not detact() thread

* shorter MODEL_DOWNLOAD_TIMEOUT in test

* throttle
2026-06-22 18:24:04 +02:00
Xuan-Son Nguyen
d6d899580d
server: real-time model load progress tracking via /models/sse (#24828)
* server: real-time model load progress tracking via /models/sse

* update docs

* add mutex for notify_to_router

* correct docs
2026-06-21 11:58:14 +02:00
Xuan-Son Nguyen
2b686a9120
server: refactor child --> router communication (#24821)
* server: refactor child --> router communication

* fix wakeup case

* add docs

* improve update_status()

* nits
2026-06-20 01:02:26 +02:00
Xuan-Son Nguyen
175147e8f6
server: remove all internal mentions about "webui" (#24817) 2026-06-19 22:12:46 +02:00
Xuan-Son Nguyen
8c2d6f6475
server: add --agent arg, remove redundant webui naming compat (#24801)
* server: add --agent arg, remove redundant webui naming compat

* corrent env

* fix the test

* llama-gen-docs

* nits: wordings
2026-06-19 16:06:13 +02:00
Xuan-Son Nguyen
4b4d13ae72
server: (router) add model management API (#23976)
* wip

* server: (router) add SSE realtime updates API

* nits

* wip

* add download API

* add download api

* update docs

* add delete endpoint

* fix std::terminate

* fix crash

* fix 2

* add tests

* nits
2026-06-17 18:04:58 +02:00
Xuan-Son Nguyen
06d26dfdff
download: add option to skip_download (#23059)
* download: add option to skip_download

* fix

* fix 2

* if file doesn't exist, respect skip_download flag
2026-05-29 16:30:55 +02:00
Aleksander Grygier
59778f0196
ui: Restructure repo to use tools/ui folder and ui / UI / llama-ui / LLAMA_UI naming (#23064)
* webui: Move static build output from `tools/server/public` to `build/ui` directory

* refactor: Move to `tools/ui`

* refactor: rename CMake variables and preprocessor defines

- Rename LLAMA_BUILD_WEBUI -> LLAMA_BUILD_UI (old kept as deprecated)
- Rename LLAMA_USE_PREBUILT_WEBUI -> LLAMA_USE_PREBUILT_UI (old kept as deprecated)
- Backward compat: old vars auto-forward to new ones with DEPRECATION warning
- Rename internal vars: WEBUI_SOURCE -> UI_SOURCE, WEBUI_SOURCE_DIR -> UI_SOURCE_DIR, etc.
- Rename HF bucket: LLAMA_WEBUI_HF_BUCKET -> LLAMA_UI_HF_BUCKET
- Emit both LLAMA_BUILD_WEBUI and LLAMA_BUILD_UI preprocessor defines
- Emit both LLAMA_WEBUI_DEFAULT_ENABLED and LLAMA_UI_DEFAULT_ENABLED

* refactor: rename CLI flags (--webui -> --ui) with backward compat

- Add --ui/--no-ui (old --webui/--no-webui kept as deprecated aliases)
- Add --ui-config (old --webui-config kept as deprecated alias)
- Add --ui-config-file (old --webui-config-file kept as deprecated alias)
- Add --ui-mcp-proxy/--no-ui-mcp-proxy (old --webui-mcp-proxy kept as deprecated)
- Add new env vars: LLAMA_ARG_UI, LLAMA_ARG_UI_CONFIG, LLAMA_ARG_UI_CONFIG_FILE, LLAMA_ARG_UI_MCP_PROXY
- C++ struct fields: params.ui, params.ui_config_json, params.ui_mcp_proxy added alongside old fields
- Backward compat: old fields synced to new ones in g_params_to_internals

* refactor: update C++ server internals with backward compat

- Rename json_webui_settings -> json_ui_settings (both kept in server_context_meta)
- Rename params.webui usage -> params.ui (both synced, old still works)
- JSON API emits both "ui"/"ui_settings" and "webui"/"webui_settings" keys
- Server routes use params.ui_mcp_proxy || params.webui_mcp_proxy
- Preprocessor guards use #if defined(LLAMA_BUILD_UI) || defined(LLAMA_BUILD_WEBUI)

* refactor: rename CI/CD workflows, artifacts, and build script

- Rename webui-build.yml -> ui-build.yml; artifact webui-build -> ui-build
- Rename webui-publish.yml -> ui-publish.yml; var HF_BUCKET_WEBUI_STATIC_OUTPUT -> HF_BUCKET_UI_STATIC_OUTPUT
- Rename server-webui.yml -> server-ui.yml; job webui-build/checks -> ui-build/checks
- Update server.yml: job/artifact refs webui-build -> ui-build
- Update release.yml: all webui-build/publish refs -> ui-build/publish; HF_TOKEN_WEBUI_STATIC_OUTPUT -> HF_TOKEN_UI_STATIC_OUTPUT
- Update server-self-hosted.yml: webui-build -> ui-build
- Update build-self-hosted.yml: HF_WEBUI_VERSION -> HF_UI_VERSION
- Rename webui-download.cmake -> ui-download.cmake (internal refs updated)
- Update labeler.yml: server/webui -> server/ui path label

* docs: update CODEOWNERS and server README docs

- Update CODEOWNERS: team ggml-org/llama-webui -> ggml-org/llama-ui, path /tools/server/webui/ -> /tools/ui/
- Update server README.md: CLI tables show --ui flags with deprecated --webui aliases
- Update server README-dev.md: "WebUI" -> "UI", paths updated to tools/ui/

* fix: Small fixes for UI build

* fix: CMake.txt syntax

* chore: Formatting

* fix: `.editorconfig` for llama-ui

* chore: Formatting

* refactor: Use `APP_NAME` in Error route

* refactor: Cleanup

* refactor: Single migration service

* make llama-ui a linkable target

* fix: UI Build output

* fix: Missing change

* fix: separate llama-ui npm build output into build/tools/ui/dist subfolder + use cmake npm build instead of downloading ui-build.yml artifacts in CI

* refactor: UI workflows cleanup

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-05-16 02:02:40 +02:00
Xuan-Son Nguyen
7bfe120c21
mtmd, server, common: expose modalities to /v1/models (#22952)
* mtmd, server, common: expose modalities to /v1/models

* fix build

* rename to mtmd_caps
2026-05-12 19:08:07 +02:00
Xuan-Son Nguyen
9dcf835528
server: (router) expose child model info from router's /v1/models (#22683)
* server: (router) expose child model info from router's /v1/models

* update docs
2026-05-08 14:42:15 +02:00
Xuan-Son Nguyen
935a340292
server: implement /models?reload=1 (#21848) 2026-05-04 16:23:26 +02:00
tha80
983ca8992e
server: (router) Forward form-data to model server (Fixes #22044) (#22118)
* This commit enables the router to forward form-data to model server.
Fixes #22044 (enabling to use the /v1/audio/transcriptions in router mode)

* * Applied the suggestion from Copilots first comment: using the non-throwing json::parse overload.
* Addressed Copilots third comment by extending the files representation to also include filename and content-type
* Addressed Copilots fourth comment by making the RNG thread_local

* Changed variable body from std::string to std::ostringstream in build_multipart_body
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127099053

* Added sanitize_field lambda in build_multipart_body for key, filename and content_type
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127104647

* explicitly checking if value/item is string before calling value/item.get<std::string>()
as requested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127111279

* Added double quote to the sanitize lambda and throw on json parse failure

---------

Co-authored-by: Ralph Paßgang <ralph@trust-it.de>
2026-04-27 23:55:00 +02:00
Xuan-Son Nguyen
49bfddeca1
server: allow router to report child instances sleep status (#20849)
* server: allow router to report child instances sleep status

* refactor

* move sleeping to state

* nits
2026-03-22 18:33:52 +01:00
Evan Huus
23fbfcb1ad
server: Parse port numbers from MCP server URLs in CORS proxy (#20208)
* Parse port numbers from MCP server URLs

* Pass scheme to http proxy for determining whether to use SSL

* Fix download on non-standard port and re-add port to logging

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-03-09 17:47:54 +01:00
Pascal
2e7e638523
server : support multiple model aliases via comma-separated --alias (#19926)
* server : support multiple model aliases via comma-separated --alias

* server : update --alias description and regenerate docs

* server : multiple model aliases and tags

- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields

* regenerate docs

* nits

* server : use first alias as model_name for backward compat

address review feedback from ngxson

* server : add single-model test for aliases and tags
2026-02-27 07:05:23 +01:00
Vladislav Sayapin
da143b9940
server : fix router child env in containerized environments (#18562) 2026-01-05 14:12:05 +01:00
wbtek
5b1248c9af
server : Cmdline arg -to changes http read timeout from current 600sec default (#18279)
* Prevent crash if TTFT >300sec, boosted to 90 days

* server : allow configurable HTTP timeouts for child models

* server : pass needed timeouts from params only

---------

Co-authored-by: Greg Slocum <fromgit@wbtek.slocum.net>
2025-12-29 17:12:48 +01:00
Xuan-Son Nguyen
f5acfb2ffa
server: (router) add stop-timeout option (#18350)
* server: (router) add stop-timeout option

* also allow stop while loading

* add docs

* unload_lru: also wait for unload to complete
2025-12-24 23:47:49 +01:00
Xuan-Son Nguyen
9e39a1e6a9
server: support load model on startup, support preset-only options (#18206)
* server: support autoload model, support preset-only options

* add docs

* load-on-startup

* fix

* Update common/arg.cpp

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-12-20 09:25:27 +01:00
Xuan-Son Nguyen
98c1c7a7bf
presets: refactor, allow cascade presets from different sources, add global section (#18169)
* presets: refactor, allow cascade presets from different sources

* update docs

* fix neg arg handling

* fix empty mmproj

* also filter out server-controlled args before to_ini()

* skip loading custom_models if not specified

* fix unset_reserved_args

* fix crash on windows
2025-12-19 12:08:20 +01:00
Pascal
6ce3d85796
server: (webui) add --webui-config (#18028)
* server/webui: add server-side WebUI config support

Add CLI arguments --webui-config (inline JSON) and --webui-config-file
(file path) to configure WebUI default settings from server side.

Backend changes:
- Parse JSON once in server_context::load_model() for performance
- Cache parsed config in webui_settings member (zero overhead on /props)
- Add proper error handling in router mode with try/catch
- Expose webui_settings in /props endpoint for both router and child modes

Frontend changes:
- Add 14 configurable WebUI settings via parameter sync
- Add tests for webui settings extraction
- Fix subpath support with base path in API calls

Addresses feedback from @ngxson and @ggerganov

* server: address review feedback from ngxson

* server: regenerate README with llama-gen-docs
2025-12-17 21:45:45 +01:00
Xuan-Son Nguyen
bde461de8c
server: (router) allow child process to report status via stdout (#18110)
* server: (router) allow child process to report status via stdout

* apply suggestions
2025-12-17 14:54:11 +01:00
Pascal
f32ca51bfe
server: add presets (config) when using multiple models (#17859)
* llama-server: recursive GGUF loading

Replace flat directory scan with recursive traversal using
std::filesystem::recursive_directory_iterator. Support for
nested vendor/model layouts (e.g. vendor/model/*.gguf).
Model name now reflects the relative path within --models-dir
instead of just the filename. Aggregate files by parent
directory via std::map before constructing local_model

* server : router config POC (INI-based per-model settings)

* server: address review feedback from @aldehir and @ngxson

PEG parser usage improvements:
- Simplify parser instantiation (remove arena indirection)
- Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping)
- Fix last line without newline bug (+ operator instead of <<)
- Remove redundant end position check

Feature scope:
- Remove auto-reload feature (will be separate PR per @ngxson)
- Keep config.ini auto-creation and template generation
- Preserve per-model customization logic

Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Co-authored-by: ngxson <ngxson@users.noreply.github.com>

* server: adopt aldehir's line-oriented PEG parser

Complete rewrite of INI parser grammar and visitor:
- Use p.chars(), p.negate(), p.any() instead of p.until()
- Support end-of-line comments (key=value # comment)
- Handle EOF without trailing newline correctly
- Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*)
- Simplified visitor (no pending state, no trim needed)
- Grammar handles whitespace natively via eol rule

Business validation preserved:
- Reject section names starting with LLAMA_ARG_*
- Accept only keys starting with LLAMA_ARG_*
- Require explicit section before key-value pairs

Co-authored-by: aldehir <aldehir@users.noreply.github.com>

* server: fix CLI/env duplication in child processes

Children now receive minimal CLI args (executable, model, port, alias)
instead of inheriting all router args. Global settings pass through
LLAMA_ARG_* environment variables only, eliminating duplicate config
warnings.

Fixes: Router args like -ngl, -fa were passed both via CLI and env,
causing 'will be overwritten' warnings on every child spawn

* add common/preset.cpp

* fix compile

* cont

* allow custom-path models

* add falsey check

* server: fix router model discovery and child process spawning

- Sanitize model names: replace / and \ with _ for display
- Recursive directory scan with relative path storage
- Convert relative paths to absolute when spawning children
- Filter router control args from child processes
- Refresh args after port assignment for correct port value
- Fallback preset lookup for compatibility
- Fix missing argv[0]: store server binary path before base_args parsing

* Revert "server: fix router model discovery and child process spawning"

This reverts commit e3832b42eeea7fcb108995966c7584479f745857.

* clarify about "no-" prefix

* correct render_args() to include binary path

* also remove arg LLAMA_ARG_MODELS_PRESET for child

* add co-author for ini parser code

Co-authored-by: aldehir <hello@alde.dev>

* also set LLAMA_ARG_HOST

* add CHILD_ADDR

* Remove dead code

---------

Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Co-authored-by: ngxson <ngxson@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: aldehir <hello@alde.dev>
2025-12-10 22:18:21 +01:00
Pascal
5ceed62421
server: fix duplicate HTTP headers in multiple models mode (#17698)
* llama-server: fix duplicate HTTP headers in multiple models mode (#17693)

* llama-server: address review feedback from ngxson

- restrict scope of header after std::move
- simplify header check (remove unordered_set)
2025-12-03 10:28:43 +01:00
Xuan-Son Nguyen
ec18edfcba
server: introduce API for serving / loading / unloading multiple models (#17470)
* server: add model management and proxy

* fix compile error

* does this fix windows?

* fix windows build

* use subprocess.h, better logging

* add test

* fix windows

* feat: Model/Router server architecture WIP

* more stable

* fix unsafe pointer

* also allow terminate loading model

* add is_active()

* refactor: Architecture improvements

* tmp apply upstream fix

* address most problems

* address thread safety issue

* address review comment

* add docs (first version)

* address review comment

* feat: Improved UX for model information, modality interactions etc

* chore: update webui build output

* refactor: Use only the message data `model` property for displaying model used info

* chore: update webui build output

* add --models-dir param

* feat: New Model Selection UX WIP

* chore: update webui build output

* feat: Add auto-mic setting

* feat: Attachments UX improvements

* implement LRU

* remove default model path

* better --models-dir

* add env for args

* address review comments

* fix compile

* refactor: Chat Form Submit component

* ad endpoint docs

* Merge remote-tracking branch 'webui/allozaur/server_model_management_v1_2' into xsn/server_model_maagement_v1_2

Co-authored-by: Aleksander <aleksander.grygier@gmail.com>

* feat: Add copy to clipboard to model name in model info dialog

* feat: Model unavailable UI state for model selector

* feat: Chat Form Actions UI logic improvements

* feat: Auto-select model from last assistant response

* chore: update webui build output

* expose args and exit_code in API

* add note

* support extra_args on loading model

* allow reusing args if auto_load

* typo docs

* oai-compat /models endpoint

* cleaner

* address review comments

* feat: Use `model` property for displaying the `repo/model-name` naming format

* refactor: Attachments data

* chore: update webui build output

* refactor: Enum imports

* feat: Improve Model Selector responsiveness

* chore: update webui build output

* refactor: Cleanup

* refactor: Cleanup

* refactor: Formatters

* chore: update webui build output

* refactor: Copy To Clipboard Icon component

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

* refactor: UI badges

* chore: update webui build output

* refactor: Cleanup

* refactor: Cleanup

* chore: update webui build output

* add --models-allow-extra-args for security

* nits

* add stdin_file

* fix merge

* fix: Retrieve lost setting after resolving merge conflict

* refactor: DatabaseStore -> DatabaseService

* refactor: Database, Conversations & Chat services + stores architecture improvements (WIP)

* refactor: Remove redundant settings

* refactor: Multi-model business logic WIP

* chore: update webui build output

* feat: Switching models logic for ChatForm or when regenerating messges + modality detection logic

* chore: update webui build output

* fix: Add `untrack` inside chat processing info data logic to prevent infinite effect

* fix: Regenerate

* feat: Remove redundant settigns + rearrange

* fix: Audio attachments

* refactor: Icons

* chore: update webui build output

* feat: Model management and selection features WIP

* chore: update webui build output

* refactor: Improve server properties management

* refactor: Icons

* chore: update webui build output

* feat: Improve model loading/unloading status updates

* chore: update webui build output

* refactor: Improve API header management via utility functions

* remove support for extra args

* set hf_repo/docker_repo as model alias when posible

* refactor: Remove ConversationsService

* refactor: Chat requests abort handling

* refactor: Server store

* tmp webui build

* refactor: Model modality handling

* chore: update webui build output

* refactor: Processing state reactivity

* fix: UI

* refactor: Services/Stores syntax + logic improvements

Refactors components to access stores directly instead of using exported getter functions.

This change centralizes store access and logic, simplifying component code and improving maintainability by reducing the number of exported functions and promoting direct store interaction.

Removes exported getter functions from `chat.svelte.ts`, `conversations.svelte.ts`, `models.svelte.ts` and `settings.svelte.ts`.

* refactor: Architecture cleanup

* feat: Improve statistic badges

* feat: Condition available models based on modality + better model loading strategy & UX

* docs: Architecture documentation

* feat: Update logic for PDF as Image

* add TODO for http client

* refactor: Enhance model info and attachment handling

* chore: update webui build output

* refactor: Components naming

* chore: update webui build output

* refactor: Cleanup

* refactor: DRY `getAttachmentDisplayItems` function + fix UI

* chore: update webui build output

* fix: Modality detection improvement for text-based PDF attachments

* refactor: Cleanup

* docs: Add info comment

* refactor: Cleanup

* re

* refactor: Cleanup

* refactor: Cleanup

* feat: Attachment logic & UI improvements

* refactor: Constants

* feat: Improve UI sidebar background color

* chore: update webui build output

* refactor: Utils imports + move types to `app.d.ts`

* test: Fix Storybook mocks

* chore: update webui build output

* test: Update Chat Form UI tests

* refactor: Tooltip Provider from core layout

* refactor: Tests to separate location

* decouple server_models from server_routes

* test: Move demo test  to tests/server

* refactor: Remove redundant method

* chore: update webui build output

* also route anthropic endpoints

* fix duplicated arg

* fix invalid ptr to shutdown_handler

* server : minor

* rm unused fn

* add ?autoload=true|false query param

* refactor: Remove redundant code

* docs: Update README documentations + architecture & data flow diagrams

* fix: Disable autoload on calling server props for the model

* chore: update webui build output

* fix ubuntu build

* fix: Model status reactivity

* fix: Modality detection for MODEL mode

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-01 19:41:04 +01:00