* simpler n_rewind logic, delete old comments
* use more consistent names, add updt_w_cur to json schema
* align comments
* refactor review logic, update struct/variable names
* revert cosmetic changes
* check enable/disable in llama_prep_adaptive_p_impl()
* delete extra whitespaces after statement
* show target in debug prints
* more concise debug print
* delete old comments
* update with loop instead of move()
* comment out all adaptive p debug prints
* more debug prints
* move review() variables: common_sampler struct -> common_sampler_review() args
* match n_unsent type
* fix merge bugs, delete adaptive p references in buffer_and_check_string_ban()
* restore accidental erasure
* Revert "adaptive p: collect probability before logit bias"
This reverts commit 1434878461c49d1a2a9047fc15d5e7b78421fd2a.
* server : support multi-modal context checkpoints and prompt caching
do not create checkpoint right after image processing
improve mtmd check for slot ops
fix context shift
do not abort if template parse failed
* change to debug message when detecting ban token
---------
Co-authored-by: firecoperana <firecoperana>
* WIP: support pre-merged up/gate experts
Haha, mainline has elected to arrange the merged tensors
the other way around compared to what I had done in the on-the-fly merge.
* Change the order of on-the-fly packed up/gate
* OpenAI
* CUDA TG
* CPU
* WIP
* WIP
* WIP
* WIP
* WIP
* WIP
* WIP
Loads and starts running, crashes with illegal memory access in
quantize_mmq_q8_1. This almost always indicates NaNs in the input
to the MoE FFN part.
* WIP
* WIP
Loads and runs, wrong results (very high PPL)
Performance looks promising, around 25% better than previous sm graph.
Needs f32 or bf16 graph reduce type.
* WIP - still wrong
* Fix after rebase
* WIP
* WIP
* This seems to be working for dense Qwen3.5!!!
* WIP: Qwen3-Next is not quite working
* Some cleanup
* Disable Qwen3-Next for now
* Disable graph parallel when mmproj was specified
* Read/write split recurrent state
* That should not crash
* Re-enable vision - it works now
* Recurrent layers should now be counted for split cache
---------
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
common : add nemotron 3 parsing (#18077)
common : add parser for ministral/mistral large 3/devstral 2 (#17713)
common : default content to an empty string (#18485)
chat: make tool description and parameters optional per OpenAI spec (#18478)
Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.
Attempts to fix#17667
common : implement new jinja template engine (#18462)
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jinja: correct member access rule (#18905)
jinja : fix lexing of float literals with sign (#18901)
jinja : add missing tojson filter for bool (#18900)
jinja : attribute support for join, map and sort (#18883)
jinja : fix object item order (and properly implement dictsort) (#18904)
tests : add test-jinja -py option for cross-checking (#18906)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
ci : run test-jinja -py on high perf [no ci] (#18916)
jinja : fix undefined keys and attributes and int/float as bool (#18924)
jinja: support none|string (#18995)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
jinja : implement mixed type object keys (#18955)
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)
`tojson` is not a supported `undefined` filter
keep it DRY and fix some types
jinja : do not pass empty tools and add some none filters (#19176)
jinja : add unordered_map include to value.h [no ci] (#19205)
jinja : add missing 'in' test to template engine (#19004) (#19239)
The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".
This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.
Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.
Includes test cases for all three containment types plus
reject/select filter usage.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Add Jinja support for "indent" string filter (#19529)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
add vendor
refactor chat
server : support preserving reasoning_content in assistant message (#18994)
chat : fix translategemma crash on common_chat_format_example (#19019)
chat: fix language input for translategemma (#19052)
Co-authored-by: Aldehir Rojas <hello@alde.dev>
---------
Co-authored-by: Aldehir Rojas <hello@alde.dev>
chat: fix case where template accepts type content only (#19419)
mtmd : chat : Fix extra \n between text and media marker (#19595)
Thanks to @tugot17 for detecting and reporting the issue.
For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.
However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.
This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.
PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.
With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.
I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`
Please propose alternative ways of fixing this issue.
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
common : merge qwen3-coder and nemotron nano 3 parsers (#19765)
common : fix improper trimming in XML parser on complete message (#19805)
Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>
jinja: correct stats for tojson and string filters (#19785)
jinja : correct default size for string slices (#19913)
common : handle unicode during partial json parsing (#16526)
common : fix json schema with '\' in literals (#17307)
add back qwen_coder_xml and mirothinker
Co-authored-by: Aldehir Rojas <hello@alde.dev>
* fix adaptive p sampler rewinding too far back
* update comments
* correct default value for total_weight, more comments
* new variables/names
* update comment for n_rewind
* move null pointer check back to common_sampler_review()
* refactor weighted_sum and total_weight to vector<pair>, better boundary check in llama_review_adaptive_p_impl()
* Revive fused delta-net
* Add command line argument for fused delta net
* Simplify/improve CUDA delta-net
* Add -fdn to llama-bench
* More CUDA fused delta net optimizations
* CPU optimizations
* Much faster fused delta-net on the CPU
It seems it is faster than the chunked implementation!
* Change meaning of fdn from bool flag to threshold value
* Use eps = 1e-6
* Give some nodes a name
* Don't re-apply L2 norm - it has already been done
* This seems quite a bit better
* More tweaks
* Restore per context buffer size log
Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.
* server: enable checkpoint for recurrent models
create checkpoint after cancel
fix ban string and rm context during rewind
add checkpoint interval
only save recurrent cache
* save checkpoint during pp
---------
Co-authored-by: firecoperana <firecoperana>
* Revive fused delta-net
* Add command line argument for fused delta net
* Simplify/improve CUDA delta-net
* Add -fdn to llama-bench
* More CUDA fused delta net optimizations
* CPU optimizations
* Much faster fused delta-net on the CPU
It seems it is faster than the chunked implementation!
* Change meaning of fdn from bool flag to threshold value
* Use eps = 1e-6
* Give some nodes a name
* Display the size of the tensors overriden during the tensor loading
Ex:
`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`
become
`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`
And pass in debug the later displayed size of the unnamed buffer overrides.
Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB`
That double display is cluttering the screen without being very informative.
* change bytes display to MiB.
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
---------
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
* Partial Requant feature for llama-quantize
- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.
* Create output directory if it doesn't exist in llama-quantize
* Create output directory if it doesn't exist in gguf-split
* Add exit when directory fails to be created on Windows
* Use std::filesystem
* cleanup
* wip: port MTP architecture
Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.
Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.
* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).
* core: enable hybrid outputs (logits + embeddings) for MTP support
* fix(mtp): correct KV-cache slot finding for updates
* fix(mtp): persist hidden states to prevent context corruption during drafting
* refactor(mtp): clean unused code
* fix(mtp): update server to new functions name
* fix(mtp): fix graph and save hidden state
* mtp: refactor integration, context params and kv cache search
* mtp: fix hidden state extraction and speculative acceptance flow
* server: fix MTP warmup for long prompts and reset token buffer
* llama: refactor MTP operation state to context parameters
* server: fix n_past calculation in MTP acceptance
* llama: fix mtp enable flags
* speculative: refactor MTP to use common_speculative interface
* context: remove unused signatures
* clip: fix deprecated enum-enum conversion warning
* common: fix format string crash in help message
* context: fix mtp activation logic
* adaptive p: upadte internal state only if not rewinding
* adaptive p: conditional update for speculative decoding
* adaptive p: refactor to rewind instead of update
* adaptive p fix: better comments
* fix rewind check
* add record to handle multi-token rewind
* better comment
* qwen3next: add architecture support and recurrent-state fixes
* qwen3next: optimize broadcast sub and single-seq ssm conv
* cuda: build MoE row mapping on device in mul_mat_id
* cuda: add guarded multi-seq fast path for ssm_conv
* docs: update qwen3next perf report for cuda MoE/SSM tuning
* cuda: reduce qwen3next moe/ssm sync overhead and refresh eval
* qwen3next: split cpu/cuda eval builds and tune PP scheduling
* qwen3next: harden seq-state flow and support optional dense FFN layers
* qwen3next: trim delta-net graph overhead in chunking path
* qwen3next: remove redundant v_conv cont in delta path
* qwen3next: avoid extra cont on linear attention output
* qwen3next: drop redundant cont before recurrent state flatten
* qwen3next: keep recurrent state in 4d layout through delta path
* qwen3next: add fused delta-net op and wire model path
* tests: add backend-op coverage for ggml_delta_net
* qwen3next: add runtime switch for fused delta-net path
* docs: refresh qwen3next perf review and benchmark matrix
* qwen3next: default fused delta-net off and document quality checks
* qwen3next: add decode-only fused delta mode
* qwen3next: make fused delta safe by default and fix fused tensor layout
* qwen3next: warn when forcing fused decode mode
* qwen3next: add fused-delta regression runner script
* qwen3next: integrate fused regression into eval harness
* qwen3next: clean up chunked delta-net shape handling
* qwen3next: add absolute sanity guards to fused regression
* qwen3next: add unified regression runner script
* qwen3next: disable flash-attn for cpu-only contexts
* docs: reconcile qwen3next status and remaining upstream gaps
* common: add qwen3next fused-delta runtime flag
* cuda: add qwen3next delta-net kernel dispatch override
* docs: update qwen3next quality and serving baseline findings
* qwen3next: keep fused delta on safe path and remove PR artifacts
* qwen3next: align autoregressive delta-net decode layout
* Revert "qwen3next: align autoregressive delta-net decode layout"
This reverts commit 9241164a5ea9e032a2456fbf2dd0bf798b264fd7.
* cuda: port solve-tri fast-paths for qwen3next delta-net
* qwen3next: add fused-delta runtime flag and drop env toggle
* qwen3next: make fused delta single-flag and default on
* Account for GPU arch differences
* Revert "cuda: build MoE row mapping on device in mul_mat_id"
This reverts commit 89e9ecfa840b04e88699ab3803eb732cd78727f9.
* qwen3next: drop non-essential MoE scheduling and split heuristics
* qwen3next: avoid generic ggml_sub broadcast changes
* llama: restore only_active_experts log message
* Remove unnecessary hacks, disable fusion for now.
* qwen3next: port hybrid recurrent state memory semantics
* qwen3next: clean up recurrent state slot plumbing
* qwen3next: fix hybrid V-cache layout plumbing
* qwen3next: guard recurrent state slots against kv capacity
* qwen3next: persist recurrent state in session data
- serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches
* qwen3next: drop unused fused-delta builder path
- remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member
* qwen3next: remove unused fused-delta CLI/context plumbing
- drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init
* ggml: remove unused DELTA_NET operator stack
* Missing include
* Reorder ops/unary ops
So we don't change again the enum values of the mul mat ops
* Minor
* Discard unnecessary changes in llama-build-context.cpp
* Minor
* Revert "Discard unnecessary changes in llama-build-context.cpp"
This reverts commit edadb80ed68c4c0831e9c22609a9a3af19be9735.
* Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches
* Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next
* Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next
It was single-threaded and was taking ~25% of the computation time
during TG. It is now down to 2%.
Strangely enough, I measure 13.6 t/s with llama-bench, but if I
let the model give me an actual response with llama-cli, I get close
to 17 t/s.
* Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next
For Qwen3Next there is a scale op on a largish tensor (548k elements)
that has a single row for TG, so was done in a single thread.
We now simply use blocks of 1024 elements.
* Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next
* CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next
* Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512
* Multithreading for OP_SUB
* Don't commit with timing trace on
* Multithread neg and sigmoid
* Be able to turn on/off fusion more easily (CPU)
* Name the mul_mat ops so we know where the time goes
* WIP
* Much better PP on CUDA
* CUDA: fuse transpose -> cont -> sum_rows -> transpose
Needs non-coontiguous variant of sum_rows.
On the CPU this gave 30+% improvement in TG performance,
on CUDA ist is disapointing 6-7%. I guess, this is because
Georgi's cont CPU implementation was so bad that skipping
it made such a big difference.
* CUDA: faster mul for special case relevant for Qwen3Next
Worth 1% in TG
* Fix CPU OP_CONT
---------
Co-authored-by: yurko <yurko@local>
Co-authored-by: Yurko <yurko@example.com>
Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net>
Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>