8 Commits

Author SHA1 Message Date
Ruixiang Wang
88a39274ec
spec: add EAGLE3 speculative decoding support (#18039)
* llama : enable layer input extraction

* spec: support eagle3

* eagle3: fix params bug

* eagle3: support Gemma4 eagle3 from RedHatAI

* eagle3: set sync when get features from target

Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com>

* eagle3 : fix ubatch handling in embd_layer_inp extraction and encoder

Co-authored-by: Doğaç Eldenk <dogacel@gmail.com>

* eagle3: adapt to upstream changes

* eagle3: fix rebase issues and adapt to upstream changes

* eagle3:exclude the eagle3 arch from test-llama-archs

* eagle3: fix editorconfig check failures

* eagle3: fix multi-seq issue in d2t vocab mapping

* cont : minor style / clean-up

* spec : remove `common_speculative_setup_draft_model()`

* llama : clean-up unused API

* eagle3: set d2t vocab mapping in decode graph

* cont : assert layer inputs are configured

* hparams : use n_embd_inp instead of n_embd_target_features

* eagle3: make output.weight optional and inherit from target model when needed

* haparams : generic norm-before-residual param

* llama-ext : consistent names

* cont : fix

* hparams : remove target_hidden_size

* cparams : rename output_layer_inp -> embeddings_layer_inp

* arch : reuse ATTN_NORM_2 instead of adding new hidden norm

* llama : clean-up names

* cont : add assert + comment

* Update conversion/llama.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com>
Co-authored-by: Doğaç Eldenk <dogacel@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-06-12 10:21:06 +03:00
Aman Gupta
04eb4c446d
llama : add Gemma4 MTP (#23398) 2026-06-07 20:50:54 +08:00
Georgi Gerganov
7acb4e8cd2
hparams : refactor hparams.n_layer (#24060)
* hparams : refactor hparams.n_layer

* cont : remove `n_layer_kv()`, use n_layer_all instead

* cont : type consistency

* pi : update SYSTEM.md

* models : fix Step3.5 MTP

* cont : remove duplicate switch cases

* cont : explicitly set `false` to extra layers for `is_swa` and `is_recr`

* cont : fix nextn layer count handling

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-06-05 11:09:36 +03:00
Xuan-Son Nguyen
a731805ced
mtmd, model: allow skip build_vit() (#24077)
* add model

* nits
2026-06-03 17:10:35 +02:00
Georgi Gerganov
06938ac129
tests : add support for qwen3 SSM archs (#24031)
* tests : add support for qwen3 SSM archs

* arch : add LLM_KV_ATTENTION_RECURRENT_LAYERS

* cont : naming + TODOs
2026-06-03 10:15:27 +03:00
ynankani
42928bc14d
model : NvFP4 quantized LM head support (#23046)
* NvFP4 quantized LM head support

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Add assert for NvFp4 lm head and tied embeddings

Signed-off-by: ynankani <ynankani@nvidia.com>

* Address review commnets

Signed-off-by: ynankani <ynankani@nvidia.com>

* Create output_s tensor only when LM head NvFp4

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
2026-05-16 11:09:27 +02:00
ynankani
9f5f0e689c
model : support Gemma4_26B_A4B_NVFP4 (#22804)
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes

Signed-off-by: ynankani <ynankani@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address review comments

Signed-off-by: ynankani <ynankani@nvidia.com>

* fix CRLF

Signed-off-by: ynankani <ynankani@nvidia.com>

* Lint error fix

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-08 20:42:09 +02:00
Xuan-Son Nguyen
994118a183
model: move load_hparams and load_tensors to per-model definition (#22004)
* git-friendly migration

* add build_graph

* nits

* exclude old code from build

* wip

* add llm_arch_model_i

* prepare downstream functions

* nits

* nits

* wip

* wip

* add back create_tensor_qkv

* fix files missing include

* enforce one llm_build per arch

* cmake: use glob

* missing model params

* nits

* wip

* wip (2)

* wip (3)

* test-llama-archs is happy

* improve switch case

* move more stuff into llm_arch_model_i

* fix downstream code

* nits

* nits (2)

* fix order

* llama_model_base

* LLAMA_LOAD_LOCALS

* small fix

* fix build errors

* auto

* rm migration script and ifdef
2026-05-04 12:36:59 +02:00