Adds an opt-in LLAMA_BUILD_MTMD CMake option so build-xcframework.sh
can link libmtmd.a into the framework binary without pulling in the
rest of tools/ (which doesn't cross-build cleanly to iOS/tvOS/visionOS).
- CMakeLists.txt: new option, default OFF. When on with
LLAMA_BUILD_TOOLS=OFF, only the tools/mtmd subdir is added. Useful
for any binding that wants just libmtmd (Apple XCFramework, WASM).
- tools/mtmd/CMakeLists.txt: gate the CLI exe targets on
LLAMA_BUILD_TOOLS. Gating on LLAMA_BUILD_COMMON is not enough — it
defaults ON in standalone builds and visionOS xcodebuild then fails
with "install TARGETS given no BUNDLE DESTINATION for MACOSX_BUNDLE
executable target 'llama-mtmd-cli'".
- build-xcframework.sh: turn the option on, pass -DLLAMA_BUILD_MTMD,
add libmtmd.a to combine_static_libraries, and copy mtmd.h and
mtmd-helper.h into the framework Headers dir. The umbrella module
map then exposes them, so Swift / Obj-C consumers can import the
mtmd C API directly.
After this, nm on ios-arm64/llama.framework/llama shows 52 _mtmd_
symbols. Verified end-to-end: a Swift target links the produced
framework and calls mtmd_default_marker, mtmd_bitmap_init, etc.
without a shim on macos / iphoneos / iphonesimulator / xros slices.
Co-authored-by: Abraham Gonzalez <abraham@theabecaster.com>
* ui: show model load progress on the selector trigger
Mirror the in-dropdown stage progress as a thin bar on the selector
trigger, so the active model's load percent stays visible when the menu
is closed. Same status gating and composite fraction as the dropdown
row, so both bars track the selected model in sync.
Suggested-by: Julien Chaumond <@julien-c>
* ui: show model load progress bar on the in-conversation model selector
* ui: tune model load indicator to a pulsing highlight (suggested by @ngxson)
Also wire the indicator onto the mobile sheet trigger, which was missing
it since mobile uses the sheet instead of the dropdown.
* ui: thin (@allozaur) pulsating (@ngxson) model load bar
* server : improve message span logic
* cont : cast size_t to int32_t in comparisons
* server : create checkpoints before every user msg
* chat : remove \n in gemma4 delimiters
* chat : merge msg delimiter structs into one
* cont : reword comment
* cont : initialize tokens in delimiter
* cont : add server_tokens::get_raw_tokens() for mtmd
* cont : move message finding to server_tokens and skip mtmd tokens
* cont : update cohere2moe parser
* cont : increase min-step to 8192 and always produce a chkpt for last user message
* server: real-time model load progress tracking via /models/sse
* update docs
* server: move model download to child process
* rm unused
* fix most problems
* clean up
* nit fixes
* fix test case
* do not detact() thread
* shorter MODEL_DOWNLOAD_TIMEOUT in test
* throttle
* ui: model status and load progress via /models/sse feed
* ui: centralize SSE wire-format delimiters into shared constants for the chat and /models/sse parsers
* ui: type /models/sse event names as a ServerModelsSseEventType enum
Address review from allozaur
line_start -1 normalized to n+1, so append inserted at lines.begin() + n + 1,
one past end() -> heap-buffer-overflow in vector::_M_range_insert.
Normalize -1 to n (insert at end()), restrict -1 to append mode and reject it
for replace/delete instead of silently clobbering the last line. Parenthesize
the insert offset so empty-file append computes the position as int first,
avoiding a transient begin() - 1 on a null vector data pointer.
* common/peg : refactor until gbnf grammar into an ac automaton
* cont : add a test with multiple strings
* cont : pad state with 0s so rules line up
* cont : clean up comments
* cont : use set everywhere
* cont : inline state num string padding
* cont : add a ref to PR
* cont : fix regression in server-tools.cpp
* server: avoid forwarding auth headers in CORS proxy
* format
* fix test
* fix e2e test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Use std::partial_sort to order only the requested top-n tokens instead
of the full vocabulary
logprobs sort: vocab=128000 n_top=0 iters=100
full sort: 8555.6 us/op
partial sort: 704.3 us/op
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Absorb get_slot_by_id logic into get_available_slot so slot selection
is handled by a single function call. When a specific slot id is
requested, the LCP similarity check still runs to enable proper
prompt cache updates.
Assisted-by: pi:llama.cpp/Qwen3.6-27B
* server: add "X-Accel-Buffering": "no" header to streaming endpoints
This header tells Nginx (as a reverse proxy) to NOT buffer responses. (only affects streaming endpoints)
Without it, Nginx will break streaming with certain applications (notably the Pi coding harness).
* ui : add model selector storybook stories
Covers list, favorites, single-model, all status states
(loading/loaded/sleeping/failed/idle), and selection states.
* ui : improve model selector mobile UX with hover media queries
Use @media (hover:none) to show action buttons directly on touch
devices and color-code them by model status (amber=sleeping,
green=loaded, muted=idle). Status dots hidden on touch. Desktop
hover behavior unchanged.
Throw on grammar parse failure so the server returns HTTP 400
instead of silently dropping the constraint.
Add a regression test for the invalid-grammar response.
Fixes#24144