1163 Commits

Author SHA1 Message Date
Nexes the Elder
094f76ee86
Cleaner log for adjusted splits (#1494)
* sweep-bench: add more skipped patterns to --minilog

* cleaner log for adjusted splits

* Add totalization for adjusted splits

* Clean up semicolons

* Addition for totalizer ^^

* Change accordingly to review

* Forgotten leftover removed

* 'total' instead of 'totalized'
2026-03-24 07:49:40 +01:00
firecoperana
cdf9142aa5
fix grammar stack empty error for qwen3.5 (#1490)
* fix grammar stack empty error for qwen3.5

* Add to --help

---------

Co-authored-by: firecoperana <firecoperana>
2026-03-24 07:48:20 +01:00
Kawrakow
4eb08208f2
Fix misleading quantize error message (#1493) 2026-03-23 13:55:18 +01:00
firecoperana
0c9bc3ed28
server: support --minilog to log request message for completions/response/anthropic and response (#1477)
Co-authored-by: firecoperana <firecoperana>
2026-03-20 16:13:43 +01:00
firecoperana
10b44eca72
server: sync anthropic api code (#1469)
* server: sync anthropic api code

* fix cc header issue

---------

Co-authored-by: firecoperana <firecoperana>
2026-03-20 10:18:45 +01:00
Juk Armstrong
08f81b5afd
Fix batch calculation for image processing (#1475) 2026-03-20 10:16:05 +01:00
Nexes the Elder
6c665f38fd
sweep-bench: add -minilog argument to reduce verbose logging (#1468)
Purpose:
Add --minilog flag to llama-sweep-bench that filters log output to show only essential GPU/layer distribution information while suppressing verbose model metadata and per-layer device assignment messages.

Changes:
- Add llama_selective_log_callback with blacklist approach (sweep-bench.cpp)

Blacklisted patterns (hidden):
- Per-layer device assignments ('Setting default device in layer')
- KV metadata dump header and entries
- Tensor type counts
- Model validation messages
- EOG/special token cache info
- Metadata printout (llm_load_print_meta, print_info)
- Layer sizes table
- Tensor loading info (llm_load_tensors)
- Separator lines
- Most common cases of incomplete/continuation lines are also hidden

All other log output is shown, including:
- GPU VRAM info
- Split/buffer distribution per device
- Graph split estimates
- Final benchmark table and timings
2026-03-20 09:40:56 +01:00
firecoperana
f9b7fe9749
llama: add --dry-run option (#1462)
Co-authored-by: firecoperana <firecoperana>
2026-03-18 17:20:17 +01:00
Nexes the Elder
61fad8b094
Print timings in sweep-bench (#1454) 2026-03-18 06:57:00 +01:00
StrikeOner
a399456c12
fix: propagate CPPHTTPLIB_OPENSSL_SUPPORT to cpp-httplib target when LLAMA_SERVER_SSL=ON (#1451)
Without this, libcpp-httplib.a is compiled without SSL support, causing
an undefined reference to httplib::SSLServer at link time even though
the OpenSSL libraries are present on the link line.

Fixes #1449

Co-authored-by: kerem seyhan <kerem.seyhan@codecut.de>
2026-03-17 16:39:11 +01:00
hksdpc255
fe92e30d1e
server : preserve anthropic thinking blocks in conversion (#1441) 2026-03-16 13:59:19 +01:00
hksdpc255
18a9b4c125
fix chat parser not been used in anthropic api (#1437) 2026-03-16 08:59:01 +01:00
hksdpc255
a655a95378
Prevent adding content that starts with 'x-anthropic-' to system_content. (#1436) 2026-03-16 08:57:09 +01:00
dungquixote42
be2940f57a
Adaptive P sampler: update review logic, delete old code comments, put prep stage after logit bias (#1386)
* simpler n_rewind logic, delete old comments

* use more consistent names, add updt_w_cur to json schema

* align comments

* refactor review logic, update struct/variable names

* revert cosmetic changes

* check enable/disable in llama_prep_adaptive_p_impl()

* delete extra whitespaces after statement

* show target in debug prints

* more concise debug print

* delete old comments

* update with loop instead of move()

* comment out all adaptive p debug prints

* more debug prints

* move review() variables: common_sampler struct -> common_sampler_review() args

* match n_unsent type

* fix merge bugs, delete adaptive p references in buffer_and_check_string_ban()

* restore accidental erasure

* Revert "adaptive p: collect probability before logit bias"

This reverts commit 1434878461c49d1a2a9047fc15d5e7b78421fd2a.
2026-03-14 12:34:12 +01:00
Kawrakow
633c1baa94
Enable imatrix calculation for models with fused ffn_up/gate_exps tensors (#1418) 2026-03-13 17:57:38 +01:00
firecoperana
433531ddae
server : support multi-modal context checkpoints and prompt caching (#1398)
* server : support multi-modal context checkpoints and prompt caching

do not create checkpoint right after image processing

improve mtmd check for slot ops

fix context shift

do not abort if template parse failed

* change to debug message when detecting ban token

---------

Co-authored-by: firecoperana <firecoperana>
2026-03-13 08:07:57 +01:00
SneedwareInc
525d8b8a40
Update server string+regex ban documentation (#1407)
* Update server string/regex ban documentation

* Update README.md

* Update README.md
2026-03-13 07:08:38 +01:00
SneedwareInc
4a247593dc
Make string ban more robust and add regex ban (#1243)
* Test new ctx_sampling->n_rewind system

* CRLF quickfix

* Adaptive p check

* merge banned_n

* Fix attempt 1

* Fix attempt 2
2026-03-11 15:30:27 +01:00
firecoperana
ab1d74074b
common : introduce composable PEG parser combinators for chat parsing and new jinja template engine (#1369)
---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>

common : add nemotron 3 parsing (#18077)

common : add parser for ministral/mistral large 3/devstral 2 (#17713)

common : default content to an empty string (#18485)

chat: make tool description and parameters optional per OpenAI spec (#18478)

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix #17667

common : implement new jinja template engine (#18462)
---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

jinja: correct member access rule (#18905)

jinja : fix lexing of float literals with sign (#18901)

jinja : add missing tojson filter for bool (#18900)

jinja : attribute support for join, map and sort (#18883)

jinja : fix object item order (and properly implement dictsort) (#18904)

tests : add test-jinja -py option for cross-checking (#18906)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ci : run test-jinja -py on high perf [no ci] (#18916)

jinja : fix undefined keys and attributes and int/float as bool (#18924)

jinja: support none|string (#18995)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

jinja : implement mixed type object keys (#18955)

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)

`tojson` is not a supported `undefined` filter

keep it DRY and fix some types

jinja : do not pass empty tools and add some none filters (#19176)

jinja : add unordered_map include to value.h [no ci] (#19205)

jinja : add missing 'in' test to template engine (#19004) (#19239)

The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".

This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.

Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.

Includes test cases for all three containment types plus
reject/select filter usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

Add Jinja support for "indent" string filter (#19529)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

add vendor

refactor chat

server : support preserving reasoning_content in assistant message (#18994)

chat : fix translategemma crash on common_chat_format_example (#19019)

chat: fix language input for translategemma (#19052)

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>

chat: fix case where template accepts type content only (#19419)

mtmd : chat : Fix extra \n between text and media marker (#19595)

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

common : merge qwen3-coder and nemotron nano 3 parsers (#19765)

common : fix improper trimming in XML parser on complete message (#19805)

Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>

jinja: correct stats for tojson and string filters (#19785)

jinja : correct default size for string slices (#19913)

common : handle unicode during partial json parsing (#16526)

common : fix json schema with '\' in literals (#17307)

add back qwen_coder_xml and mirothinker

Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-03-09 11:03:33 +01:00
dungquixote42
a903409a5e
fix adaptive p sampler rewinding too far back (#1359)
* fix adaptive p sampler rewinding too far back

* update comments

* correct default value for total_weight, more comments

* new variables/names

* update comment for n_rewind

* move null pointer check back to common_sampler_review()

* refactor weighted_sum and total_weight to vector<pair>, better boundary check in llama_review_adaptive_p_impl()
2026-03-04 13:26:25 +01:00
Kawrakow
fd16a418de
Fix clang warnings on macOS (#1354)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-03 16:27:16 +01:00
Kawrakow
505e2c57f9
Reduce memory use when processing large images (#1349) 2026-03-02 17:54:56 +01:00
Nexes the Elder
d4ac5f1566
gguf-split: fix the split output files naming (#1336)
* Fix gguf-split.cpp splits output naming

With this fix, the initial extension of the source .gguf file is not included in the naming of the output file before the numeration of the splits.

ex:

No more model.gguf-00001-of-00200.gguf
Instead, model-00001-of-00200.gguf

* increase ggml_max_context to 2048

* Revert GGML_MAX_CONTEXTS to 64
2026-03-02 08:43:47 +01:00
Kawrakow
d239dabcc6
Graph parallel for Qwen-3.5-MoE (#1347)
* Graph parallel for Qwen3.5-MoE

* Add --max-gpu to llama-bench

* Fix graph reuse when not all GPUs participate in self-attention
2026-03-02 07:48:43 +01:00
firecoperana
8f9e19d57c
server: add checkpoint tolerance and fix grammar_trigger init (#1346)
Co-authored-by: firecoperana <firecoperana>
2026-03-02 07:45:32 +01:00
Kawrakow
04c140fe54
Make vision woork with Qwen-3.5 models (#1345) 2026-03-01 17:44:37 +01:00
Kawrakow
0ff3a43289
Bring back #1333 and #1335 (#1340)
* Bring back fused delta net 3

* Remove autoregressive and chunking
2026-02-28 14:31:42 +01:00
Kawrakow
1922449b2c
Revert delta net 3 (#1339)
* Revert "Simplify delta-net (#1335)"

This reverts commit e5fc30244cf638852293390bfdbda856d6b0869e.

* Revert "Fused delta net 3 (#1333)"

This reverts commit 7b68353e0920c0c472bc28c708e38a6766490eb8.
2026-02-28 13:12:08 +01:00
Kawrakow
e5fc30244c
Simplify delta-net (#1335)
* Simplify delta-net

* Minor

* Minor
2026-02-28 11:12:19 +01:00
Kawrakow
7b68353e09
Fused delta net 3 (#1333)
* This is better than chunked

* Keep the state in registers

* Cleanup

* Remove unused stuff

* Minor

* Make fused delta-net the default

* Fix race
2026-02-27 15:02:56 +01:00
firecoperana
3fac78c48b
server: enable checkpoint for recurrent models (#1310)
* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>
2026-02-26 06:51:18 +01:00
Kawrakow
c77ec4b8b8
Fused delta-net (#1315)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name
2026-02-25 14:12:48 +01:00
Nexes the Elder
170467e835
Llama-quantize: Partial requant feature (#1313)
* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup
2026-02-25 07:25:15 +01:00
Joshua Jolley
68431b049a
server: propagate task index to response objects for batch requests (#1303)
When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>
2026-02-24 15:39:38 +01:00
Kawrakow
cfb6747776
llama-quantize: --dry-run option (#1309) 2026-02-24 15:21:52 +01:00
Samuel Oliveira Alves
09a88c9ae5
Add MTP decoding support for GLM-4.x MoE (#1270)
* wip: port MTP architecture

Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.

Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.

* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).

* core: enable hybrid outputs (logits + embeddings) for MTP support

* fix(mtp): correct KV-cache slot finding for updates

* fix(mtp): persist hidden states to prevent context corruption during drafting

* refactor(mtp): clean unused code

* fix(mtp): update server to new functions name

* fix(mtp): fix graph and save hidden state

* mtp: refactor integration, context params and kv cache search

* mtp: fix hidden state extraction and speculative acceptance flow

* server: fix MTP warmup for long prompts and reset token buffer

* llama: refactor MTP operation state to context parameters

* server: fix n_past calculation in MTP acceptance

* llama: fix mtp enable flags

* speculative: refactor MTP to use common_speculative interface

* context: remove unused signatures

* clip: fix deprecated enum-enum conversion warning

* common: fix format string crash in help message

* context: fix mtp activation logic
2026-02-22 18:14:39 +01:00
firecoperana
66323b92f7
Qwen3.5-MoE: fix regenerating message error (#1295)
Co-authored-by: firecoperana <firecoperana>
2026-02-21 18:24:12 +01:00
dungquixote42
0f411b02e2
Fix adaptive p sampler bug with string ban (#1287)
* adaptive p: upadte internal state only if not rewinding

* adaptive p: conditional update for speculative decoding

* adaptive p: refactor to rewind instead of update

* adaptive p fix: better comments

* fix rewind check

* add record to handle multi-token rewind

* better comment
2026-02-20 07:11:36 +01:00
rkozuch
b855bf92de
Fix slot prompt updating. (#1285)
Co-authored-by: Rkozuch <you@example.com>
2026-02-19 08:15:49 +01:00
Samuel Oliveira Alves
51df09be8a
Feat - add kimi 2.5 Vision (#1280)
* port kimi 25-vision  from upstream

* feat(clip): add support for Kimi K2.5 vision model
2026-02-19 08:15:03 +01:00
Samuel Oliveira Alves
88f98c891d
server: add string ban in speculative path (#1274) 2026-02-17 12:33:28 +01:00
RodriMora
102f77b7d3
server: add /v1/responses support (#1184)
* server: add /v1/responses support

* server: fix Responses API model fallback and SSE branching
2026-02-14 08:30:18 +01:00
firecoperana
1cb7e1bf39
spec : add self speculative decoding, ngram and refactor (#1261)
* spec : add self speculative decoding and ngram-mod and refactor

common : use common_ prefix for common library function

llama : use LLAMA_TOKEN_NULL

spec : add self speculative decoding (no draft model required) + refactor

spec : add ngram-mod

spec : various improvements ton ngram-map + docs

spec : fix the check-rate logic of ngram-simple

common : add common_speculative_is_compat()

spec : simplify time measurement using common_time_meas

refactor common_sampler_init

refactor common_token_to_piece

refactor and fix cur_p bug

clean up

* spec : remove check rate

* spec: show warnings instead of abort

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>
2026-02-13 19:04:55 +01:00
firecoperana
f1ccf340dd
fix model name missing in final response (#1250)
Co-authored-by: firecoperana <firecoperana>
2026-02-07 18:31:39 +02:00
firecoperana
8d952ff183
Server: add string ban (#1185)
* server: add string ban

* increase rewind limit

* init n_buffer

---------

Co-authored-by: firecoperana <firecoperana>
2026-02-05 08:12:34 +02:00
gapeleon
17d101863d
server: add dynamic control vector management endpoints (#1223)
This implements the ability to load, unload, and scale control vectors
(representation engineering) mid-inference, following the existing
task-queue pattern used by LoRA adapters.

New Endpoints:
- GET  /control-vectors
- POST /control-vectors/load
- POST /control-vectors/unload
- POST /control-vectors/apply (handles scaling)

Technical Notes:
- Centralizes vector aggregation logic to share implementation between
  load, unload, and apply tasks.
- Vectors are applied globally to the model context.
- Enforces dimension validation on load to safely reject incompatible
  vectors.

Co-authored-by: Gapeleon <gapeleon@users.noreply.github.com>
2026-02-04 16:07:18 +02:00
firecoperana
7e8d444033
llama : add token matching support to llama-grammar (#1220)
* llama : add token matching support to llama-grammar

llama : add token matching support to llama-grammar (#17816)

common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342)

* disable tests and fix warnings

---------

Co-authored-by: firecoperana <firecoperana>
2026-02-03 07:57:17 +02:00
Kawrakow
573e23679d
sweep_bench: set number of repetions (#1176) 2026-01-22 12:28:30 +02:00
firecoperana
d71a3ec315
Server: refactor and rename functions (#1151)
* Server: rename functions and refactor code

rename functions

refactor update slots

rename params_base

rename timings

* change

* Revert kv cache name changes

* Revert 2

* fix test build error

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-18 08:16:57 +02:00
firecoperana
ee463b079e
Webui: add text completions and adaptive_p sampling (#1153)
* Webui: add text completions and adaptive_p sampling

* update description

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-17 08:37:07 +02:00