* common : add common_chat_split_by_role
* cont : fix spans to reach end of message
* server: fix checkpoints creation
- extract message_spans from chat templates
- find the prompt token position before the latest user message
- split prompt batching at that position
- create a context checkpoint before the latest user input
- avoid periodic mid-prompt checkpoints when that position is known
- handle multimodal prompts when mapping text/template positions to server prompt tokens
- add --checkpoint-min-step to control minimum spacing between checkpoints
* cont : clean-up
* Support autoparser detection for message barriers
* server: fix message span delimiter and update docs
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
* pi : update
* ci : fix ios build
* ci : fix andoroid
* ci : fix apple builds
* cmake : add install() for impl libraries
Add install(TARGETS <target> LIBRARY) for all -impl libraries that were
changed from STATIC to shared (controlled by BUILD_SHARED_LIBS) in
commit bb28c1fe2. Without this, cmake --install fails to copy the shared
libraries, causing runtime errors like:
llama-server: error while loading shared libraries: libllama-server-impl.so
Ref: https://github.com/ggml-org/llama.cpp/issues/23494#issuecomment-4512912515
Assisted-by: llama.cpp:local pi
* ci : fix xcframework build
* cmake : remove STATIC from impl libraries, allow BUILD_SHARED_LIBS control
Remove explicit STATIC from all -impl libraries (server, cli, completion, bench,
batched-bench, fit-params, quantize, perplexity) so BUILD_SHARED_LIBS controls
shared vs static linkage.
Add WINDOWS_EXPORT_ALL_SYMBOLS ON for proper DLL export on Windows.
Assisted-by: llama.cpp:local pi
* cmake : enable LLAMA_BUILD_APP by default
Assisted-by: llama.cpp:local pi
* ci : disable app in build-cmake-pkg.yml
- HunyuanOCR shares the same HF arch and vision layout as HunyuanVL butwas split into a separate path that skipped the +0.1 bilinear sampler used by the HF reference.
- Collapse OCR into the HUNYUANVL projector + HUNYUAN_VL text arch
* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7)
* MTP: clean-up (#9)
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: 8c05923630
Assisted-by: llama.cpp:local pi
* delta_net_base: use ggml_pad instead of new_tensor
* review: add need_rs_seq
* review: rename part_bounded to n_rs
* review: deslop comments
* review: rename, add asserts
* server : adjust checkpoint logic (#11)
* server : adjust checkpoint logic
* cont : rm asserts
* server-context: fix early exit
* spec : fix compatibility with n-gram and add TODOs (#13)
* metal : cleanup
* llama : fix faulty bitwise check in recurrent memory
* server : disable RS-based MTP in combination with other spec types
* spec : add TODOs
* cont : fix comment
* cont : update comment
* common : fix logic for ngram + mtp compat
* llama-memory: enable checkpointing with partial rollback
* cont: add test-case for loading into a dirty ctx
* llama-memory-recurrent: clear rs_idx in clear
* download: fix mtp path
* llama-arch: fix enorm op
* docs: update docs
* conversion: fix type annotations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spec : refactor
* spec : drop support for incompatible vocabs
* spec : update common_speculative_init()
* cont : pass seq_id
* cont : dedup ctx_seq_rm_type
* server : sketch the ctx_dft decode loop
* server : draft prompt cache and checkpoints
* server : improve ctx names
* server, spec : transition to unified spec context
* cont : sync main and drft contexts
* cont : async drft eval when possible
* cont : handle non-ckpt models
* cont : pass correct n_past for drafting
* cont : process images throught the draft context
* spec : handle draft running out of context
* server : fix mtmd draft processing
* server : fix URL for draft model
* server : add comment
* server : clean-up + dry
* speculative-simple : update
* spec : fix n_past type
* server : fix slot ctx_drft ptr
* tools : update readme
* naming : improve consistency
* spec : refactor for multi-sequence speculative context
* cont : prepare params
* cont : prepare params
* spec : support parallel drafts
* server : support parallel drafting
* llama : reuse device buffers when possible
* server, spec : clean-up
* cont : clean-up
* cont : minor
* spec : reset `drafting` flag at the end
* spec : introduce `common_speculative_process()`
* spec : allow for multiple spec types (chain of speculators)
* replace old type field of type common_speculative_type in the
common_params_speculative struct with a vector to allow multiple
types to be specified
* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
to figure out which implementations the user has enabled
* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
to parse the already user provided spec types
* all speculators run sequentially, best one wins (we verify its drafted tokens)
* maximize expected accepted tokens for current round by calculating the
product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
and the draft's length
---------
Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
* docs : update speculative decoding parameters after refactor (#22397)
Update docs/speculative.md to reflect the new parameter naming scheme
introduced in PR #22397:
- Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min
- Replace --spec-ngram-size-n/m with per-implementation variants
- Add documentation for all new --spec-ngram-*- parameters
- Update all example commands
Assisted-by: llama.cpp:local pi
* pi : add rule to use gh CLI for GitHub resources
Assisted-by: llama.cpp:local pi
* docs : run llama-gen-docs
* arg : fix typo
This change implements the third requested change in issue 20429.
Because defaults.sampling contains the reasoning budget token count and
the reasoning budget message, it's not necessary to assign them to
struct variables.
* chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs
* Fix ty errors.
* Fix flake8 err
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.
Issue: https://github.com/ggml-org/llama.cpp/issues/20429
Original PR: https://github.com/ggml-org/llama.cpp/pull/20297
The build info is now only for debug, so we avoid the duplicate
with `--version`.
The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* misc : prefer ggml-org models in docs and examples
Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.
* remove accidentally committed file
* server : support multiple model aliases via comma-separated --alias
* server : update --alias description and regenerate docs
* server : multiple model aliases and tags
- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields
* regenerate docs
* nits
* server : use first alias as model_name for backward compat
address review feedback from ngxson
* server : add single-model test for aliases and tags