Georgi Gerganov
06938ac129
tests : add support for qwen3 SSM archs ( #24031 )
...
* tests : add support for qwen3 SSM archs
* arch : add LLM_KV_ATTENTION_RECURRENT_LAYERS
* cont : naming + TODOs
2026-06-03 10:15:27 +03:00
Mikhail Podvitskii
4fb16eccce
model: add Mellum architecture ( #23966 )
...
* model: support for Mellum architecture
* model: improve mellum.py formatting
* model: improve mellum.py formatting once again
* deps: downgrade transformers to 4.57.6 (to fix CI)
* deps: remove huggingface_hub dependency
* deps: remove huggingface_hub from test requirements
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-06-02 22:11:12 +03:00
ynankani
42928bc14d
model : NvFP4 quantized LM head support ( #23046 )
...
* NvFP4 quantized LM head support
Signed-off-by: ynankani <ynankani@nvidia.com>
* Address review commnets
Signed-off-by: ynankani <ynankani@nvidia.com>
* Add assert for NvFp4 lm head and tied embeddings
Signed-off-by: ynankani <ynankani@nvidia.com>
* Address review commnets
Signed-off-by: ynankani <ynankani@nvidia.com>
* Create output_s tensor only when LM head NvFp4
Signed-off-by: ynankani <ynankani@nvidia.com>
---------
Signed-off-by: ynankani <ynankani@nvidia.com>
2026-05-16 11:09:27 +02:00
AesSedai
8e52631d55
model: Add Mimo v2.5 model support ( #22493 )
...
* add mimo-v2.5 support
* mimo-v2.5: fix modify_tensors row split
* mimi-v2.5: forgot `add_attn_value_scale` plumbing
* mimi-v2.5: fix tp dequant to detect tp rows
* mimo-v2.5: fix TP iteration to be descending
* mimo-v2.5: fix comment
* mimo-v2.5: retain fused qkv
* mimo-v2.5: missed the attn_value scale during merge
* mimo-v2.5: fused QKV needs contiguous for scaling attention value
* mimo-v2.5: move `speech_embeddings.` to TextModel filter_tensors
* Update src/llama-hparams.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/mimo2.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/mimo2.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/mimo2.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* mimo-v2.5: include MTP weights in gguf
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-07 13:21:58 +02:00
Johannes Gäßler
36dafba5c4
llama: fix llama-model-saver ( #20503 )
...
* llama : add fd-based model loading via llama_model_load_from_fd
* llama : address review feedback for fd-based model loading
* llama : use FILE pointer instead of fd in public API
* llama : use FILE pointer consistently, address review feedback
* fixup
* fix tensor names
* fix llama-model-saver
* roundtrip tests
* fixup
* refactor tests
* fix prints
* fix model saving
* fix CI, disable Chameleon
* print seed
---------
Co-authored-by: Siddhesh2377 <siddheshsonar2377@gmail.com>
2026-03-25 12:53:16 +02:00
Xuan-Son Nguyen
59db9a357d
llama: dynamic head_dim and n_rot for SWA ( #20301 )
...
* llama: dynamic head_dim and n_rot for SWA
* also add gguf_writer wrappers
* fix build
* build_rope_shift arg reorder
2026-03-09 22:22:39 +01:00
Johannes Gäßler
a976ff081b
llama: end-to-end tests ( #19802 )
...
* tests: add end-to-end tests per model architecture
* fixup for rebase
* fix use-after-free in llama-model-loader.cpp
* fix CI
* fix WebGPU
* fix CI
* disable CI for macOS-latest-cmake-arm64
* use expert_weights_scale only if != 0.0f
* comments
2026-03-08 12:30:21 +01:00
Ryan Mangeno
c0d0430340
model : full modern bert support ( #18330 )
...
* full modern bert support
* added gelu op in rank pooling for modern bert
* still working on stuff, added mean calculation before classifier head
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* first layer is dense, as per modern bert research paper
* Update src/llama-graph.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fixed set input for mean pooling to check if pooling type is ranking since modern bert does mean & rank
* Update src/llama-graph.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-19 08:52:21 +01:00
Georgi Gerganov
d9c6ce46f7
kv-cache : support V-less cache ( #19067 )
...
* kv-cache : support V-less cache
* cuda : better check for V_is_K_view
* cuda : improve V_is_K_view check
* graph : add comments
* hparams : refactor
2026-01-25 15:48:56 +02:00
Tarek Dakhran
73d284a250
model : add LFM2-ColBert-350M ( #18607 )
...
* model : add LFM2-ColBert-350M
* llama_model_n_embd_out() - returns `hparams.n_embd_out` if set and fallbacks to `hparams.n_embd`
2026-01-05 19:52:56 +01:00
Sigbjørn Skjæret
88fc854b4b
llama : improve sep token handling ( #14272 )
2025-06-20 14:04:09 +02:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support ( #10544 )
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00