140 Commits

Author SHA1 Message Date
SamuelOliveirads
ad24046b51 minor refactor in DFlash kv cache graph 2026-06-15 18:22:56 -03:00
SamuelOliveirads
3a1d46c4d1 Merge remote-tracking branch 'origin/main' into feat/dflash-implementation
# Conflicts:
#	common/common.cpp
#	common/speculative.cpp
#	convert_hf_to_gguf.py
#	examples/server/server-context.cpp
#	examples/server/server-context.h
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama.cpp
2026-06-13 17:27:52 -03:00
Joel Farthing
4a1e2eaa69
model: add Cohere2-MoE North Mini Code support (#1945)
* Add Cohere2 MoE North Mini Code support

* Fix Cohere2 MoE expert tensor emission

* Enhance Cohere2-MoE support by modifying tensor handling and configuration logic

* Fix Cohere2-MoE graph split reduce handling

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-10 15:28:27 +02:00
Joel Farthing
bbe1a511ee
model: add Poolside Laguna XS.2 support (#1911)
* llama: register Laguna architecture

* llama: add Laguna graph support

* llama: place Laguna MoE tensors for cpu-moe

* gguf: add Laguna metadata and tokenizer ids

* convert: support Poolside Laguna XS.2

* model: align Laguna RoPE and graph semantics

* model: align Laguna partial offload with review feedback

* model: localize Laguna SWA YaRN defaults

* model: localize Laguna SWA RoPE constants

---------

Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-08 18:33:12 +02:00
Joel Farthing
dc51c6f9b2
Add Mellum2 architecture support (#1919)
Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>
2026-06-04 14:28:02 +02:00
SamuelOliveirads
1369e68471 fix graph mask, swa layers and tokens positions 2026-05-31 11:12:03 -03:00
SamuelOliveirads
82cff238fe Initial dflash implementation 2026-05-28 18:57:58 -03:00
Samuel Oliveira Alves
c2b8bca807
Add MTP Support for Gemma 4 (#1744)
* gemma-mtp: build the arch to load the MTP model

* gemma-mtp: fix mtp kv state

* gemma-mtp: refactor some functions and create gguf

* gemma-mtp: make usable for embeddings models variant

* gemma-mtp: fix qwen mtp load in graph split

* gemma-mtp: refactor tensor creation and adjust output tensor handling

* Gemma 4 MTP: improve tensor handling, and adjust split mode logic
2026-05-10 07:44:20 +03:00
saood06
2161ee01cb
Vibe coded script + constants from mainline + pip requirements (#1405) 2026-03-11 15:41:17 +01:00
saood06
8ba7e2b40c
Add support for Seed-OSS (#1218)
* it compiles

* Fix constants.py
2026-02-03 07:39:45 +02:00
Kawrakow
9337229274 Add MXFP4 to gguf-py constants (#1007)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-24 15:43:49 +01:00
Kawrakow
60227f4433 Make gguf-py stuff work with numpy 2.0 (#991)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-20 10:20:55 +01:00
Kawrakow
263be6670b Add support for SmolLM3 (#934)
* Convert from HF

* Model loading and compute graph

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 15:40:12 +02:00
firecoperana
e15a215e6b model : Port Minimax M2 from mainline (#907)
Co-authored-by: firecoperana <firecoperana>
2025-11-06 18:09:24 +02:00
Kawrakow
f7adde1043 Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833)
* Adding Ling/Ring (a.k.a., Bailing-MoE2)

* Add expert group selection (not working, so turned off)

* BailingMoE2 conversion

* WIP

* Bits and pieces

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-15 14:20:40 +03:00
firecoperana
079231c291 model : add grok-2 support (#782)
Co-authored-by: firecoperana <firecoperana>
2025-09-23 16:31:01 +02:00
firecoperana
426032c27a Add Ernie 4.5 MOE and 0.3B Support (#759)
* Add Ernie4_5MoeModel

* add ernie 4.5 0.3B model

---------

Co-authored-by: firecoperana <firecoperana>
2025-09-05 11:54:35 +02:00
Thireus ☠
d65d5fe29e Add support for GLM-4.5 models (#668)
* GLM-4.5

* GLM-4.5

* GLM-4.5

* convert_hf_to_gguf.py compatibility bugfix with GLM-4.5

From @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3145913701

* Add ubergarm comments + my own

* Revert to llama.cpp script version that produced good BF16

See: https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3147374559

* Support for jinja chat templates

See https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3148109962

* GLM-4.5 llama.cpp final port

* Handle TENSOR_SKIP

Ported the hanges from:

f129567dc0
dcbbd2cb05

Except op info since ik_llama.cpp doesn't support this operation.

* Bugfix for TENSOR_SKIP

skip loading if a tensor has the TENSOR_SKIP flag - @ubergarm via https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155297198

* Update llama.cpp

Restore original GGLM_ASSERT

* Fix chat template detection

Changes suggested by @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155927840

* Revert to original GGML_ASSERT
2025-08-07 07:55:00 +03:00
Kawrakow
e1164e1fd8 Adding IQ1_KT - 1.75 bpw SOTA quants (#616)
* iq1_kt: basics

* iq1_kt: CUDA dequantize

Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL
as iq2_xxs, so about 0.2 bpw fewer bits for the same quality.

* iq1_kt: CUDA MMQ

* iq1_kt: CUDA MMVQ

* iq1_kt: AVX2 GEMM/GEMV

* iq1_kt: convert/repack to q8_0_r8 (AVX2)

* iq1_kt: slightly faster GEMV

18.6 t/s -> 19.4 t/s

* iq1_kt: NEON GEMM/GEMV

Pathetic as usual

* iq1_kt: slightly faster NEON - still pathetic

* iq1_kt: tiny bit better GEMV on NEON

* iq1_kt: convert/repack to q8_0_r8 (NEON)

* iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON

* Adding frgotten file

* iq1_kt: add to constants.py

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-20 10:05:23 +02:00
Kawrakow
f375799f17 Adding IQ2_KL (#602)
* Experiments for 2.6875 bpw quants

At least according to rmse, this is significantly better than
q2_K, while using only 1/16 more bits per weight.

* iq2_kl: basics

* iq2_kl: CUDA dequantize

* iq2_kl: small improvement in PPL

Also check the two neighbouring values for the block scale
and use the one that minimizes RMSE.

* iq2_kl: MMQ

Quite good: PP-512(L3-8B) = 8472 t/s.

* iq2_kl: MMVQ

We get PP-128(L3-8B) = 162 t/s.
Which means that this is not quite as good as it should be as
(almost) same bpq q2_K is at 170 t/s.

* iq2_kl: Zen4 GEMM/GEMV

Not particularly fast. I may need to think about rearranging the bits.

* iq2_kl: better Zen4

* iq2_kl: convert/repack to q8_k_r8 (AVX2)

* iq2_kl: AVX2 GEMM/GEMV

* iq2_kl: WIP NEON

The compiler started crashing!!!

* iq2_kl: NEON

Had to work around a compiler crash when using vzip2q_u8 using
vqtbl2q_u8.

* iq2_kl: convert/repack to q8_k_r8 (NEON)

* iq2_kl: Metal dequantize

* iq2_kl: Metal GEMV - pretty slow

* iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s)

* iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s)

* iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s)

* iq2_kl: slightly better Metal dequantize

PP-512 goes to 476 t/s up from 466 t/s.

* iq2_kl: slightly better Metal dequantize

PP-512 goes to 492 t/s up from 476 t/s.

* Add iq2_kl to constants.py

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-14 18:55:08 +02:00
Kawrakow
4f56069442 Add iq3_ks to constants.py (#606)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-13 19:14:26 +02:00
saood06
02d675717e Support for dots.llm1 models (#573)
* Add llama.cpp changes for dots1 support

* Add python changes for dots1 support

* Fix to make it convert

* Remove V reshaping, remove BOS by default for dots1 and fix warmup to handle models without BOS

* Minor fix

* Remove commented lines
2025-07-10 02:37:36 -05:00
Nexes the Elder
7c5d9aba86 convert_hf_to_gguf.py : conversion from hf weights to Q6_0 (#483)
* Direct conversion from fp16 to Q6_0

* forgotten comma

* More precise infos
2025-06-03 09:30:30 +03:00
Nexes the Elder
63a8b2260e forgotten refs and typo (#478) 2025-05-31 07:36:50 +03:00
Kawrakow
639aee23c5 Add missing gguf-py constants (#458)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-25 09:55:36 +03:00
saood06
a7e5b01540 Fix missing rope_freqs with convert_hf_to_gguf (#402)
* lora : fix llama conversion script with ROPE_FREQS

* convert : refactor rope_freqs generation

This should also fix vocab-only conversion for Phi-3.

* convert : adapt MiniCPM3 to separate rope_freqs insertion

MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.

gguf-py : add long and short RoPE factors to tensor mappings

Empty, but the key names are used to populate the mappings.

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2025-05-09 09:17:41 -05:00
saood06
87bfad8437 Support for Llama-3-Nemotron models (#377)
* conflict resolution

* Changes to make work and add longrope support

* Changes to n_attention_wv rule

* Untested support of 253B

* DeciLMCausalModel now reads rope_theta from config.json properly

* Remove errant Granite mentions

* Better n_attention_vw rule

* Update vocab.py

---------

Co-authored-by: Yee Man Chan <ymchan@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09 10:09:59 +03:00
Kawrakow
71bc74d738 Add missing enum values for qwen3 and qwen3moe (#356)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29 10:05:38 +02:00
Ben Harris
8b62ee32ca Apply Qwen3 PR from llama.cpp (#355) 2025-04-29 10:02:08 +02:00
saood06
bf095b682f Update gguf-py constants (#298)
* Update GGMLQuantizationType

* Update LlamaFileType

* Update GGML_QUANT_SIZES
2025-04-24 00:34:10 -05:00
saood06
e6c85a5b95 Add support for bitnet2b_2501 model (#337)
* add support for bitnet2b_2501 model

* Fixes

* Support both model names

---------

Co-authored-by: potassiummmm <zhou.hansong@outlook.com>
2025-04-22 08:34:13 +02:00
Kawrakow
42b0e3921b Add Gemma3 support (text only) (#276)
* WIP Gemma3: not working

* gemma3: build_gemma3 seems to be working now

* Revert changes to convert_hf_to_gguf.py

It wasn't working, so I guess, it is better to leave the
conversion up tp upstream.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22 08:05:10 +01:00
Kawrakow
3e536b95b0 Add optional MLA (#188)
* Deepseek MLA Optimizations

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* Make MLA optional

* Remove some unnecessary copies in the MLA attention

* Deepseek MLA Optimizations V2 (#195)

* Avoid allocating MHA KV cache when MLA is turned on

* Added missing gguf-py file

* Added final optimizations

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* Make sure we do have wk_b and wv_b before enabling MLA

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Use type_k and type_v to set the types of the MLA caches

They were hard-coded at f16.
On my Ryzen-7950X with native bf16 support I get a fairly
significant PP performance boost with bf16 KV-cache:
PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache.

* Better gemm strategy when nth > nhead

It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads
(with or without MLA).
Before this commit, when nth > nhead heads were processed
sequentially with all nth threads participating in each
matrix multiplication. Now we ind the gcd of nhead and
nth and split threads into nth/gcd groups, each group
processing nhead/gcd heads.

---------

Co-authored-by: Saood Karim <saood05@gmail.com>
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-09 19:48:44 +02:00
saood06
5c0a01bdaf Deepseek V3 support added (#176)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-23 18:24:10 +02:00
Kawrakow
1a4cfbcc53 Merge mainline - Aug 12 2024 (#17)
* Merge mainline

* Fix after merge

* Remove CI check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-12 15:14:32 +02:00
Kawrakow
0ceeb11721 Merge mainline llama.cpp (#3)
* Merging mainline - WIP

* Merging mainline - WIP

AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.

* Merging mainline - fix Metal

* Remove check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27 07:55:01 +02:00
Kawrakow
81576cdcac bitnet: python + llama 2024-06-22 12:02:51 +03:00
Ștefan-Gabriel Muscalu
89d2889200 update: support Qwen2-57B-A14B (#7835)
* update: convert-hf-to-gguf.py to support Qwen2-57B-A14B

* fix: QWEN2MOE support for expert_feed_forward_length

previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH

n_ff_exp and n_ff_shared_exp are now properly calculated

* update: convert-hf-to-gguf.py cleanup for Qwen2MoeForCausalLM

* fix: QWEN2MOE support for expert_feed_forward_length

previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH

n_ff_exp and n_ff_shexp are now properly calculated
2024-06-17 21:08:46 +02:00
Brian
69a2e7318d gguf-dump.py: add --markdown dump output (#7853)
* gguf-dump.py: add --markdown dump output

* gguf-dump.py: Add toc

* gguf-dump.py: use standard tensor name lookup. Also add tensor ID field

* gguf-dump.py: Add tensor overview count

* gguf-dump.py: fix array preview

* gguf-dump.py: markdownTableWithAlignmentSupport() added

* Add type hints and spacing

Co-authored-by: compilade <git@compilade.net>

* gguf-dump.py: prettyfy dimention

* gguf-dump: right align element count

* gguf-dump.py: element count autosizing

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
2024-06-17 15:25:20 +10:00
compilade
bad6961237 gguf-py : decouple adding metadata from writing in GGUFWriter (#7827)
Main changes of this PR is to consolidate GGUFWriter.add_key and GGUFWriter.add_val into GGUFWriter.add_key_value. 

In addition use_temp_file is now opt-in instead of opt-out defaulting to False.

Also GGUFWriter now does not require output file name until when actually writing to it.

And GGUFWriter doesn't really need to eagerly prepare the data layout of the metadata
2024-06-09 12:34:29 +10:00
Joan Fontanals
add6ba8d05 llama : add jina v2 base code (#7596)
* feat: add changes to handle jina v2 base code

* fix: do not complicate things

* fix: fix the usage of the code model

* fix: fix comments

* fix: fix linting issues

* fix: remove ollama patches

* style : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-06 10:22:41 +03:00
zhangkaihuo
5d9a2f038f llama : MiniCPM support tied embeddings (#7664)
* support lm_head

* remove the code block

---------

Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>
2024-06-03 10:49:30 +03:00
Galunid
81e37e4303 Move convert.py to examples/convert-legacy-llama.py (#7430)
* Move convert.py to examples/convert-no-torch.py

* Fix CI, scripts, readme files

* convert-no-torch -> convert-legacy-llama

* Move vocab thing to vocab.py

* Fix convert-no-torch -> convert-legacy-llama

* Fix lost convert.py in ci/run.sh

* Fix imports

* Fix gguf not imported correctly

* Fix flake8 complaints

* Fix check-requirements.sh

* Get rid of ADDED_TOKENS_FILE, FAST_TOKENIZER_FILE

* Review fixes
2024-05-30 21:40:00 +10:00
Galunid
771aed5905 gguf-py : Add tokenizer.ggml.pre to gguf-new-metadata.py (#7627) 2024-05-30 02:10:40 +02:00
fairydreaming
e354ad8256 Add support for DeepseekV2ForCausalLM (#7519)
* common : increase max number of experts to 160

* common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture

* common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier

* convert-hf : add model conversion support for DeepseekV2ForCausalLM

* llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models

* llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor)

* llama : add inference support for LLM_ARCH_DEEPSEEK2

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-28 17:07:05 +02:00
compilade
2e044457a1 gguf-py : fix and simplify quantized shape round-trip (#7483)
* gguf-py : fix and simplify quantized shape round-trip

* gguf-py : remove unused import
2024-05-25 11:11:48 +10:00
fairydreaming
0682aaed8d Add support for ArcticForCausalLM (#7020)
* common : increase max number of experts to 128

* common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn

* gguf-py : add architecture-specific block mappings that override selected general block mappings

* convert-hf : add model conversion support for ArcticForCausalLM

* convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM)

* llama : add inference support for LLM_ARCH_ARCTIC

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-24 14:31:13 +02:00
Georgi Gerganov
a90628d8a0 ggml : drop support for QK_K=64 (#7473)
* ggml : drop support for QK_K=64

ggml-ci

* opencl : restore QK_K=256 define
2024-05-23 10:00:21 +03:00
liuwei-git
c1a6ad7577 llama : add phi3 128K model support (#7225)
* add phi3 128k support in convert-hf-to-gguf

* add phi3 128k support in cuda

* address build warnings on llama.cpp

* adjust index value in cuda long rope freq factors

* add long rope support in ggml cpu backend

* make freq factors only depend on ctx size

* remove unused rope scaling type 'su' frin gguf converter

* fix flint warnings on convert-hf-to-gguf.py

* set to the short freq factor when context size is small than trained context size

* add one line of comments

* metal : support rope freq_factors

* ggml : update ggml_rope_ext API to support freq. factors

* backends : add dev messages to support rope freq. factors

* minor : style

* tests : update to use new rope API

* backends : fix pragma semicolons

* minor : cleanup

* llama : move rope factors from KV header to tensors

* llama : remove tmp assert

* cuda : fix compile warning

* convert : read/write n_head_kv

* llama : fix uninitialized tensors

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-21 23:28:32 +03:00
Georgi Gerganov
60faeefff0 llama : remove Persimmon (#7408)
* llama : remove Persimmon

* requirements : remove
2024-05-21 02:35:28 +10:00