ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

Author	SHA1	Message	Date
Joel Farthing	bdf5c081dc	DFlash: enable sliding-window attention for draft models (#2021 ) * DFlash: bound intra-block draft tokens to the SWA window The SWA mask builder applied the sliding-window distance check only to the cross-context section; the intra-block draft-token loop masked causal-only, so a draft token could attend to earlier block tokens beyond n_swa. Apply the same window bound ((j - block_k) < swa_window) in both the F16 and F32 paths so it matches the cross-context section. Behavior-neutral for dense models: the SWA mask tensor is only allocated when the model has SWA layers (build_dflash.cpp needs_swa_mask gate), so for dense targets the changed block is unreachable. * DFlash: enable sliding-window attention for draft models DFlash drafts can be trained with sliding-window attention for long context, but the runtime ignored it: the draft loader never read the window keys and the converter never emitted them, so SWA-trained drafts always ran full-attention. Enable it end to end and fix the dormant SWA graph path it exposes: - convert_hf_to_gguf.py (DFlashDraftModel): emit attention.sliding_window + an all-layers sliding_window_pattern when the source config sets use_sliding_window. - llama-hparams.cpp (LLM_ARCH_DFLASH_DRAFT): read sliding_window + pattern into n_swa / swa_layers. - build_dflash.cpp + llama-dflash.cpp: the SWA mask path had never run; an all-SWA draft turned the full kq_mask into a dead graph node the scheduler never backs with a buffer, then the input-set wrote it unconditionally (GGML_ASSERT buf!=NULL). Create + set each mask only when a layer uses it; derive mask dims from whichever mask is live. Dense/mixed drafts are byte-identical. Validated on gemma-4-26B-A4B at long context (cross_ctx 8176 > window 2048): no crash, no short-context regression, SWA-on recovers long-context draft acceptance. * DFlash: derive draft SWA pattern from layer_types The converter emitted an all-layers SWA pattern ([True]*n_layers). The z-lab DFlash drafts are sliding-window on every layer except a final full-attention (global) layer, so this ran that global layer as sliding-window and clipped its long-context view. Read layer_types and emit the matching per-layer pattern (sliding_attention -> True), falling back to all-SWA only when layer_types is absent. --------- Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-25 09:06:54 +02:00
empty-quiver	b47b90d0be	Add Laguna M.1 GGUF support (#2003 )	2026-06-22 16:53:10 +02:00
SamuelOliveirads	0d75eee35a	remove duplicated code and unnecesary refactor	2026-06-14 16:02:02 -03:00
SamuelOliveirads	3a1d46c4d1	Merge remote-tracking branch 'origin/main' into feat/dflash-implementation # Conflicts: # common/common.cpp # common/speculative.cpp # convert_hf_to_gguf.py # examples/server/server-context.cpp # examples/server/server-context.h # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama.cpp	2026-06-13 17:27:52 -03:00
Joel Farthing	4a1e2eaa69	model: add Cohere2-MoE North Mini Code support (#1945 ) * Add Cohere2 MoE North Mini Code support * Fix Cohere2 MoE expert tensor emission * Enhance Cohere2-MoE support by modifying tensor handling and configuration logic * Fix Cohere2-MoE graph split reduce handling --------- Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-10 15:28:27 +02:00
Joel Farthing	bbe1a511ee	model: add Poolside Laguna XS.2 support (#1911 ) * llama: register Laguna architecture * llama: add Laguna graph support * llama: place Laguna MoE tensors for cpu-moe * gguf: add Laguna metadata and tokenizer ids * convert: support Poolside Laguna XS.2 * model: align Laguna RoPE and graph semantics * model: align Laguna partial offload with review feedback * model: localize Laguna SWA YaRN defaults * model: localize Laguna SWA RoPE constants --------- Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-08 18:33:12 +02:00
Joel Farthing	2f2ca7adb1	convert: support Gemma4UnifiedAssistantForCausalLM (#1925 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-08 07:43:43 +02:00
Joel Farthing	dc51c6f9b2	Add Mellum2 architecture support (#1919 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-04 14:28:02 +02:00
SamuelOliveirads	1250f522ed	add qwen, gemma and kimi dflash support	2026-06-01 17:14:25 -03:00
SamuelOliveirads	1369e68471	fix graph mask, swa layers and tokens positions	2026-05-31 11:12:03 -03:00
SamuelOliveirads	82cff238fe	Initial dflash implementation	2026-05-28 18:57:58 -03:00
Samuel Oliveira Alves	c2b8bca807	Add MTP Support for Gemma 4 (#1744 ) * gemma-mtp: build the arch to load the MTP model * gemma-mtp: fix mtp kv state * gemma-mtp: refactor some functions and create gguf * gemma-mtp: make usable for embeddings models variant * gemma-mtp: fix qwen mtp load in graph split * gemma-mtp: refactor tensor creation and adjust output tensor handling * Gemma 4 MTP: improve tensor handling, and adjust split mode logic	2026-05-10 07:44:20 +03:00
saood06	8ba7e2b40c	Add support for Seed-OSS (#1218 ) * it compiles * Fix constants.py	2026-02-03 07:39:45 +02:00
Kawrakow	263be6670b	Add support for SmolLM3 (#934 ) * Convert from HF * Model loading and compute graph --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 15:40:12 +02:00
firecoperana	e15a215e6b	model : Port Minimax M2 from mainline (#907 ) Co-authored-by: firecoperana <firecoperana>	2025-11-06 18:09:24 +02:00
ubergarm	32540ac619	Ling-1T convert fixup (#837 ) * Conditionally write moe_shared_expert_intermediate_size Ling-1T config.json does not have `moe_shared_expert_intermediate_size`. Ling-flash-2.0a does have it. This small patch just makes the gguf_writer conditionally detect as needed. * Fix Ling-1T missing moe_shared_expert_intermediate_size Thanks CISC for the proper patch to include the needed values!	2025-10-17 07:52:31 +03:00
Kawrakow	f7adde1043	Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833 ) * Adding Ling/Ring (a.k.a., Bailing-MoE2) * Add expert group selection (not working, so turned off) * BailingMoE2 conversion * WIP * Bits and pieces --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-15 14:20:40 +03:00
firecoperana	079231c291	model : add grok-2 support (#782 ) Co-authored-by: firecoperana <firecoperana>	2025-09-23 16:31:01 +02:00
firecoperana	de97c33b40	fix convert error for ernie 4.5 (#774 )	2025-09-11 07:59:24 +02:00
firecoperana	426032c27a	Add Ernie 4.5 MOE and 0.3B Support (#759 ) * Add Ernie4_5MoeModel * add ernie 4.5 0.3B model --------- Co-authored-by: firecoperana <firecoperana>	2025-09-05 11:54:35 +02:00
Thireus ☠	d65d5fe29e	Add support for GLM-4.5 models (#668 ) * GLM-4.5 * GLM-4.5 * GLM-4.5 * convert_hf_to_gguf.py compatibility bugfix with GLM-4.5 From @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3145913701 * Add ubergarm comments + my own * Revert to llama.cpp script version that produced good BF16 See: https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3147374559 * Support for jinja chat templates See https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3148109962 * GLM-4.5 llama.cpp final port * Handle TENSOR_SKIP Ported the hanges from: `f129567dc0` `dcbbd2cb05` Except op info since ik_llama.cpp doesn't support this operation. * Bugfix for TENSOR_SKIP skip loading if a tensor has the TENSOR_SKIP flag - @ubergarm via https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155297198 * Update llama.cpp Restore original GGLM_ASSERT * Fix chat template detection Changes suggested by @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155927840 * Revert to original GGML_ASSERT	2025-08-07 07:55:00 +03:00
ubergarm	5e357db589	Fixup kimi-k2 convert indentation (#617 )	2025-07-16 15:24:20 +02:00
ubergarm	d3ed217798	kimi-k2 convert script and chat template (#612 ) * convert_hf_to_gguf for Kimi-K2-Instruct Adapt mainline `PR14653` for tokenizer while maintaining proper MLA tensors. Tested with this workflow using deepseek fp8_cast_bf16.py and triton-cpu to upcast the fp8 safetensors to bf16 safetensors then used this convert_hf_to_gguf. * Add Kimi-K2 chat template moonshotai/Kimi-K2-Instruct https://github.com/ikawrakow/ik_llama.cpp/pull/609#issuecomment-3071259454 * kimi-k2 add ass to template to get response	2025-07-15 19:54:04 +02:00
saood06	02d675717e	Support for dots.llm1 models (#573 ) * Add llama.cpp changes for dots1 support * Add python changes for dots1 support * Fix to make it convert * Remove V reshaping, remove BOS by default for dots1 and fix warmup to handle models without BOS * Minor fix * Remove commented lines	2025-07-10 02:37:36 -05:00
Fizz~	27ff5bf57e	Special handling of Seed Coder FIM tokens (#585 ) * Special handling of Seed Coder FIM tokens * vocab: Add Seed Coder pretokenizer * Formatting fix * Update llama.h	2025-07-06 12:13:55 +02:00
Nexes the Elder	7c5d9aba86	convert_hf_to_gguf.py : conversion from hf weights to Q6_0 (#483 ) * Direct conversion from fp16 to Q6_0 * forgotten comma * More precise infos	2025-06-03 09:30:30 +03:00
Nexes the Elder	86170b2048	Legacy quants conversion schemes in convert_hf_to_gguf.py (#449 ) * Legacy quants conversion schemes in convert_hf_to_gguf.py This, notably in order to make smaller conversions to generate an iMatrix file. `Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0. `Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0. Adapted from the following llama.cpp mainline PR : https://github.com/ggml-org/llama.cpp/pull/9022 Original author @chentyjpm Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file. * forgotten IQ5_KS case mention	2025-05-24 11:49:10 +03:00
saood06	a7e5b01540	Fix missing rope_freqs with convert_hf_to_gguf (#402 ) * lora : fix llama conversion script with ROPE_FREQS * convert : refactor rope_freqs generation This should also fix vocab-only conversion for Phi-3. * convert : adapt MiniCPM3 to separate rope_freqs insertion MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid having to run its custom Python code which mixes tokenization in the same file as tool calls. gguf-py : add long and short RoPE factors to tensor mappings Empty, but the key names are used to populate the mappings. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2025-05-09 09:17:41 -05:00
saood06	87bfad8437	Support for Llama-3-Nemotron models (#377 ) * conflict resolution * Changes to make work and add longrope support * Changes to n_attention_wv rule * Untested support of 253B * DeciLMCausalModel now reads rope_theta from config.json properly * Remove errant Granite mentions * Better n_attention_vw rule * Update vocab.py --------- Co-authored-by: Yee Man Chan <ymchan@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-09 10:09:59 +03:00
Ben Harris	8b62ee32ca	Apply Qwen3 PR from llama.cpp (#355 )	2025-04-29 10:02:08 +02:00
saood06	e6c85a5b95	Add support for bitnet2b_2501 model (#337 ) * add support for bitnet2b_2501 model * Fixes * Support both model names --------- Co-authored-by: potassiummmm <zhou.hansong@outlook.com>	2025-04-22 08:34:13 +02:00
Kawrakow	3e536b95b0	Add optional MLA (#188 ) * Deepseek MLA Optimizations Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Make MLA optional * Remove some unnecessary copies in the MLA attention * Deepseek MLA Optimizations V2 (#195) * Avoid allocating MHA KV cache when MLA is turned on * Added missing gguf-py file * Added final optimizations Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * Make sure we do have wk_b and wv_b before enabling MLA --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * Use type_k and type_v to set the types of the MLA caches They were hard-coded at f16. On my Ryzen-7950X with native bf16 support I get a fairly significant PP performance boost with bf16 KV-cache: PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache. * Better gemm strategy when nth > nhead It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads (with or without MLA). Before this commit, when nth > nhead heads were processed sequentially with all nth threads participating in each matrix multiplication. Now we ind the gcd of nhead and nth and split threads into nth/gcd groups, each group processing nhead/gcd heads. --------- Co-authored-by: Saood Karim <saood05@gmail.com> Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-02-09 19:48:44 +02:00
saood06	5c0a01bdaf	Deepseek V3 support added (#176 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2025-01-23 18:24:10 +02:00
Kawrakow	1a4cfbcc53	Merge mainline - Aug 12 2024 (#17 ) * Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-12 15:14:32 +02:00
Kawrakow	0ceeb11721	Merge mainline llama.cpp (#3 ) * Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-27 07:55:01 +02:00

35 Commits