44 Commits

Author SHA1 Message Date
Kawrakow
0d59973e4a
Fix MTP warmup for GLM models (#1992) 2026-06-19 08:59:55 +02:00
Kawrakow
f5e5753c32
Fix Qwen35 mtp warmup (#1987)
* Use hidden state from prev token from qwen mtp

* Fix Qwen35 MTP warmup

* Cleanup + remove unnecessary crippling performance by not using accept to sample draft token

* Provide API to gtet the model arch string

---------

Co-authored-by: SamuelOliveirads <samueloliveira32df@gmail.com>
2026-06-18 09:03:40 +02:00
SamuelOliveirads
6cae8c7ba2 clean logs 2026-06-14 21:07:57 -03:00
SamuelOliveirads
0d75eee35a remove duplicated code and unnecesary refactor 2026-06-14 16:02:02 -03:00
SamuelOliveirads
3b1a0f88d5 Add logging for DFlash statistics and clean up workspace handling 2026-06-13 20:14:08 -03:00
SamuelOliveirads
3a1d46c4d1 Merge remote-tracking branch 'origin/main' into feat/dflash-implementation
# Conflicts:
#	common/common.cpp
#	common/speculative.cpp
#	convert_hf_to_gguf.py
#	examples/server/server-context.cpp
#	examples/server/server-context.h
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama.cpp
2026-06-13 17:27:52 -03:00
Samuel Oliveira Alves
8a38025174
Refactor: Move spec outside server (#1949)
* Refactor speculative decoding: move logic outside of server

* remove duplicated tokens in mtp kv cache

* narrow to only discard draft cells in MTP

* revert mtp_speculative_gen_draft
2026-06-12 18:12:39 +02:00
Kawrakow
366e478cb6
Bug fixes (#1940)
* Bug fixes

* More
2026-06-10 07:45:49 +02:00
Samuel Oliveira Alves
007d640098
Standardize speculative decoding arguments on the server (#1908)
* refactor spec args

* add shell-safe quoting of string-valued stage keys in speculative decoding
2026-06-04 15:44:57 +02:00
SamuelOliveirads
dc43cdf06b move dflash for it own file 2026-06-02 10:22:13 -03:00
SamuelOliveirads
3d73312d9d apply workspace support for KV cache 2026-06-01 09:55:34 -03:00
SamuelOliveirads
ed403dca27 Use windows update in kv cache 2026-05-31 14:51:21 -03:00
SamuelOliveirads
1369e68471 fix graph mask, swa layers and tokens positions 2026-05-31 11:12:03 -03:00
SamuelOliveirads
532499836e improve DFlash caching and profiling capabilities 2026-05-30 21:36:10 -03:00
SamuelOliveirads
9f5f70cf7e implement target position tracking and context management 2026-05-29 23:11:38 -03:00
SamuelOliveirads
82cff238fe Initial dflash implementation 2026-05-28 18:57:58 -03:00
Kawrakow
3f45ba9387
MTP tweaks 3 (#1862) 2026-05-23 07:23:20 +03:00
Samuel Oliveira Alves
11a1fea9e2
Move embedding management to speculative (#1825)
* refactor speculative decoding with companion context and draft result structures

* feat: add common speculative feature handling in server context

* refactor: move embedings outside server

* feat: harden draft input hidden state in llama context

* remove unused functions

* refactor: streamline speculative feature handling and remove unused code

* remove redundant code

* remove more unused variables

* refactor: implement speculative feature handling
2026-05-20 17:42:48 +03:00
firecoperana
104846ddee
spec : disacard last drafted token with low prob (#1820)
* spec : disacard last drafted token with low prob

* Apply suggestion from @ikawrakow

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-05-19 08:35:35 +03:00
Samuel Oliveira Alves
f4f4b3ff26
Allow dual speculative decoding (#1789)
* wip: test logic to use multiple specs

* feat: introduce composite speculative decoding stages

* handle MTP context and draft invalidation

* fix: allow gemma mtp for speculative stages

* fix: normalize spec stage keys

* refactor: remove enable_mtp flag and improve speculative stage handling

* fix: update cached text tokens handling for stage chains

* feat: implement sync for external MTP after non-MTP accept
2026-05-15 10:10:40 +03:00
Kawrakow
949bb8f1d6
More MTP tweaks (#1792) 2026-05-13 17:55:43 +03:00
Samuel Oliveira Alves
be8435793e
Pre-allocate buffers for hybrid model checkpoints (#1774)
* hybrid-spec: improve recurrent checkpoint handling in speculative decoding

* change per-step save to support scheduling and asynchronous tensor operations

* remove redudant backend tensor fallback

* improve recurrent tensor handling for split graph
2026-05-12 07:21:25 +03:00
Lingfeng Ren
c2f498ab4c
MTP: use target slot position for drafting (#1781) 2026-05-12 07:21:03 +03:00
Lingfeng Ren
35845dd975
server : support MTP with multimodal prompts (#1758)
Synchronize MTP state after mtmd decode batches so multimodal prompt chunks do not desync the draft context.
2026-05-11 09:51:07 +03:00
Samuel Oliveira Alves
c2b8bca807
Add MTP Support for Gemma 4 (#1744)
* gemma-mtp: build the arch to load the MTP model

* gemma-mtp: fix mtp kv state

* gemma-mtp: refactor some functions and create gguf

* gemma-mtp: make usable for embeddings models variant

* gemma-mtp: fix qwen mtp load in graph split

* gemma-mtp: refactor tensor creation and adjust output tensor handling

* Gemma 4 MTP: improve tensor handling, and adjust split mode logic
2026-05-10 07:44:20 +03:00
Kawrakow
96127976f2
Use AVX2 when available for greedy speculative sampling (#1761)
* Use AVX2 when available for greedy speculative sampling

* Avoid some code duplication
2026-05-09 08:32:20 +03:00
Kawrakow
9f60de9cc5
Fix discarding tokens from the KV cache during MTP drafting (#1757) 2026-05-09 08:31:25 +03:00
Kawrakow
e722f0bb73
MTP tweaks (#1741) 2026-05-06 08:35:11 +03:00
Kawrakow
8b56d813a9
MTP improvements (#1736)
* MTP improvements

* Cleanup
2026-05-05 08:05:24 +03:00
dmaivel
1b14f56693
speculative: keep MTP draft hidden state alive across steps (#1718) 2026-05-02 16:05:41 +03:00
Paul Dubs
0a167082a3
Reset i_last when low acceptance streak occurs (#1701)
By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
2026-04-27 14:05:36 +02:00
Samuel Oliveira Alves
ea94afe777
Speculative checkpoints for recurrent models (#1669)
* server: spec checkpoints for recurrent models

* fix: save/restore sampler state during speculative checkpoint

When speculative decoding rejects draft tokens and restores the
recurrent state checkpoint, the sampler (RNG, grammar, prev tokens)
must also be restored to maintain consistency. Without this, the
sampler state reflects the rejected draft tokens, leading to
potential divergence.

Uses common_sampler_clone() to snapshot the sampler before the
speculative batch decode, and restores it on rejection.

* server: snapshot recurrent state in tensor

* reset ngram mod state for rejected tokens

* server: refactor checkpoint state logic

* speculative: fix sampler for checkpoints

* recurrent model: implement recurrent kernel checkpoint

* recurrent model: refactor api

* spec: free rbudget before overwriting
2026-04-24 09:59:30 +02:00
Samuel Oliveira Alves
260622faf6
Self-decoding: Adds support for suffix decoding (#1646)
* speculative: implement suffix-tree decoder

* speculative: add support to cache and tuner
2026-04-18 16:10:10 +02:00
Samuel Oliveira Alves
470d3a3b5b
Add support for parallel graphs to GLM MTP (#1637)
* mtp: fix split graph assert

* Add mtp split graph mode

* remove unused ffn function for unsupported mtp

* revert cuda context syncronization
2026-04-16 08:05:34 +02:00
Samuel Oliveira Alves
557b674f63
Add llama_context to MTP (#1601)
* wip: separate llama_context for MTP with graph reuse

* wip: fix KV cache desync with separate MTP context

* refactor: remove dead mtp logic code, encapsulate KV mirroring

* mtp-context: derive args directly from the main model's context

* mtp: fix kv cache positions

* clean small comments

* minor refactor for context shift
2026-04-09 15:33:56 +02:00
Samuel Oliveira Alves
3de81530c5
Allow tuning of the best args for speculative decoding. (#1595)
* wip: build spec tuner for spefic args

* wip: test different reward system

* spec-tune: fix the reward to find best params given a good TPS

* spec-tune: refactor logic for its own file

* minor clean for comments and modules
2026-04-08 08:02:42 +02:00
Samuel Oliveira Alves
1f3e832cb3
Improve mtp acceptance rate (#1499)
* wip: port MTP architecture

Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.

Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.

* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).

* core: enable hybrid outputs (logits + embeddings) for MTP support

* fix(mtp): correct KV-cache slot finding for updates

* fix(mtp): persist hidden states to prevent context corruption during drafting

* refactor(mtp): clean unused code

* fix(mtp): update server to new functions name

* fix(mtp): fix graph and save hidden state

* mtp: refactor integration, context params and kv cache search

* mtp: fix hidden state extraction and speculative acceptance flow

* server: fix MTP warmup for long prompts and reset token buffer

* llama: refactor MTP operation state to context parameters

* server: fix n_past calculation in MTP acceptance

* llama: fix mtp enable flags

* speculative: refactor MTP to use common_speculative interface

* context: remove unused signatures

* clip: fix deprecated enum-enum conversion warning

* common: fix format string crash in help message

* context: fix mtp activation logic

* llamat: always use the extracted embedding

* llama: get all embeddings to kv cache

* llama: revert logit to not run mtp for not supported arch

* llama: allocate all the n_outputs for MTP

* wip

* server-context: get only the last embedding for hidden state

* ggml-backend: fix array of bounds in debug build

* server-context: run mt kv update to each prompt batch

* revert segmentation fault fixes

* glm-mtp(feat): optimize graph embedding and recursive drafting
2026-03-25 10:20:22 +01:00
Samuel Oliveira Alves
09a88c9ae5
Add MTP decoding support for GLM-4.x MoE (#1270)
* wip: port MTP architecture

Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.

Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.

* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).

* core: enable hybrid outputs (logits + embeddings) for MTP support

* fix(mtp): correct KV-cache slot finding for updates

* fix(mtp): persist hidden states to prevent context corruption during drafting

* refactor(mtp): clean unused code

* fix(mtp): update server to new functions name

* fix(mtp): fix graph and save hidden state

* mtp: refactor integration, context params and kv cache search

* mtp: fix hidden state extraction and speculative acceptance flow

* server: fix MTP warmup for long prompts and reset token buffer

* llama: refactor MTP operation state to context parameters

* server: fix n_past calculation in MTP acceptance

* llama: fix mtp enable flags

* speculative: refactor MTP to use common_speculative interface

* context: remove unused signatures

* clip: fix deprecated enum-enum conversion warning

* common: fix format string crash in help message

* context: fix mtp activation logic
2026-02-22 18:14:39 +01:00
firecoperana
1cb7e1bf39
spec : add self speculative decoding, ngram and refactor (#1261)
* spec : add self speculative decoding and ngram-mod and refactor

common : use common_ prefix for common library function

llama : use LLAMA_TOKEN_NULL

spec : add self speculative decoding (no draft model required) + refactor

spec : add ngram-mod

spec : various improvements ton ngram-map + docs

spec : fix the check-rate logic of ngram-simple

common : add common_speculative_is_compat()

spec : simplify time measurement using common_time_meas

refactor common_sampler_init

refactor common_token_to_piece

refactor and fix cur_p bug

clean up

* spec : remove check rate

* spec: show warnings instead of abort

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>
2026-02-13 19:04:55 +01:00
firecoperana
d71a3ec315
Server: refactor and rename functions (#1151)
* Server: rename functions and refactor code

rename functions

refactor update slots

rename params_base

rename timings

* change

* Revert kv cache name changes

* Revert 2

* fix test build error

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-18 08:16:57 +02:00
firecoperana
904e994bfb Support --device and --device-draft parameter (#866)
* add --device and --device-draft parameter

* don't print debug message in release mode

* fix

* bug fix to throw exception when no device specified

* add const

---------

Co-authored-by: firecoperana <firecoperana>
2025-10-27 18:13:28 +02:00
firecoperana
d7882c3cf8 Tool calls support from mainline (#723)
* Tool calls support from mainline

* update cmake

* revert api for /completions

* Fix broken thinking process for gpt-oss

* add missing args and fix webui bugs

* add missing args and fix webui bugs2

* Fix reasoning format error

* add usage

* change default post_sampling_probs to true

* add back generated_text

* Remove server endpoints tests

* add log

* Chat fixes

* Remove logs

* webui: revert extra handling of thinking process

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-01 08:38:49 +03:00
g2mt
06bed7e01b Port universal assisted decoding to llama-server (#699)
* port universal assisted decoding to server

* fix calls

* fix LOG_INFO

* fix llama_detokenize call

* use emplace_back
2025-08-18 09:22:23 +03:00
g2mt
b6bc5eedad Port speculative decoding from upstream to llama-server (#645)
* server : integrate speculative decoding

* server: Fix field names

* server: fix include, whitespace

* fix compile errors in speculative.cpp

* add llama_sampling_sample_and_accept_n to sampling

* finish porting speculative decoding in server

* port functions from common/speculative, common/sampling

* remove arg

* fix function names

* init params_dft to none

* correct value for n_ctx

* prefix kv cache tensors with model name to avoid conflict

* fix call arguments

* fix spec decoding args

* correct slot.id

* use n_max

* port the rest of sampling funcs

* fix func arguments

* slot.id starts at 1?

* Revert "prefix kv cache tensors with model name to avoid conflict"

This reverts commit fbd5dfd8660ced64a05a23fe3d5526ded635eb4b.

* disable draft logging

* disable logging in speculative.cpp

in mainline, these would be LOG_DEBUG, but since ik_llama doesnt support
it, logging is disabled entirely

* add more draft model parameters

* fix

* pass flash_attn

* add speculative params for parity

* set speculative params in launch_slot_with_task instead
2025-08-16 07:26:44 +03:00