ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

Author	SHA1	Message	Date
Justin Martin	40d8cb196a	llama-quantize: enable --extra-output-tensor with COPY (#1871 )	2026-05-23 13:52:34 +03:00
Kawrakow	1f8c603d9c	Quantize: add extra output tensor for MTP (#1810 ) * Quantize: add extra output tensor for MTP * Consistently use --mtp-requantize-output-tensor	2026-05-17 13:59:56 +03:00
Kawrakow	3e573cfea6	MTP: option to use re-quantized output tensor for better TG performance (#1809 ) * Option to use re-quantized output tensor for MTP * Remove quantize extra output option * Handle interleaved types	2026-05-16 14:40:18 +03:00
Kawrakow	4e1851b01a	imatrix: use data for ffn_up when data for ffn_gate is missing (#1805 )	2026-05-15 07:28:34 +03:00
Kawrakow	e5355e9895	Quantization options (#1677 )	2026-04-23 09:05:39 +02:00
dmaivel	4f4bcfbe67	Add --defer-experts flag to defer expert mmap residency on Linux (#1634 ) * Add --defer-experts flag to defer expert mmap residency on Linux * Disable warmup when defer-experts is enabled	2026-04-16 08:54:44 +02:00
Kawrakow	90ec1b80c4	Bonsai support (AVX2, generic) (#1570 ) * Bonsai support (AVX2, generic) * Fix ARM build --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-04-02 16:54:08 +02:00
Kawrakow	8b575c4b1f	Fix re-quantizing a model using row-interleaved quants (#1561 )	2026-03-31 15:35:10 +02:00
Kawrakow	9b7db9bc3f	Better --n-cpu-moe (#1464 )	2026-03-19 06:57:01 +01:00
Kawrakow	54bcafee16	Allow using -rtr and -muge together (#1444 )	2026-03-16 18:26:26 +01:00
Kawrakow	c2b8e95700	Be able to use imatrix computed with merged ffn_gate_up_exps (#1419 ) * Be able to use imatrix computed with merged ffn_gate_up_exps * Also the other way around	2026-03-13 17:57:56 +01:00
Kawrakow	3208660d20	Be able to quantize mmproj files (#1367 )	2026-03-06 07:25:40 +01:00
Kawrakow	87b35dac0c	Faster quantization for MoE models with many experts (#1322 )	2026-02-26 06:52:28 +01:00
Nexes the Elder	170467e835	Llama-quantize: Partial requant feature (#1313 ) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup	2026-02-25 07:25:15 +01:00
Kawrakow	cfb6747776	llama-quantize: --dry-run option (#1309 )	2026-02-24 15:21:52 +01:00
Kawrakow	d81cde5cea	Fix very low bpw missing imatrix check (#1284 )	2026-02-19 08:15:26 +01:00
Kawrakow	c03c2d7cc6	Merge ffn_up and ffn_gate experts tensors (#1137 ) * WIP - not working * WIP - not working * WIP - GPT-OSS working However, extremely stupid. The only way I could correctly repack the up/gate experts is to copy up and gate into host buffers, repack into another host buffer, copy back into the ffn_up_gate_exps tensor. This is going to be very slow for giant 500 GB models. My attempts to do this via a compute graph on the backend holding the tensors was unsuccessful. For GPT-OSS-20B I see ~6-7% better PP when using the original ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when using the small batch size implementation. Other models are not working yet on CUDA as I need to fix the fused mul-unary implementation. * WIP * WIP - Qwen3-MoE (and hopefully all others) working But when I say here and in the previous commit "working", I mean PP is working. TG is still broken. * WIP: TG seems to be working * Minor * Add command line option to merge experts up/gate * Add merge up/gate command line parameter to llama-bench * Turn off merge_up_gate_exps if split mode graph It is not yet implemented * When no bias, allow merging up/gate with tensor overrides * Arghh, we need to increase the context size again * Cleanup	2026-01-12 18:30:53 +02:00
Kawrakow	bf12f502a4	Fix requatizing from row-interleaved quants (#992 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-20 11:50:09 +01:00
Kawrakow	e68f50be9a	Allow quantization of ffn_gate_inp (#896 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:44:32 +02:00
Kawrakow	56fc5454ff	Merge Q, K, V (#878 ) * POC: merge Q, K, V into a single, contiguous tensor Done just for Qwen3-MoE, where I see a 4% uplift in TG. PP performance gain is sub-percent, if any. Still, it seems it makes sense to do it in general given the TG performance gain. * WIP * merge_qkv: it works for gpt-oss ...but we see a smaller TG gain (~1.5%) * WIP * Don't ignore the return value of create_tensors() else, when q, k, v get merged and we are running on the CPU, we get a crash because the backend is trying to use mmap, but that no longer works. * merge_qkv: bias can be required, optional, or mandatory * merge_qkv: glm4.5moe * merge_qkv: add command loine argument to enable * merge_qkv: fix tensor dimensions * merge_qkv: llama-4 * merge_qkv: qwen3 (dense) * merge_qkv: simplify build_qwen3moe * cohere2 - simplify graph building --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-30 10:49:48 +02:00
Kawrakow	21a0bfb1c0	Fix PATH_MAX not defined on Windows (#828 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-13 09:25:57 +03:00
Kawrakow	4daff01b39	Refactor file llama.cpp (#823 ) * llama_model and llama_hparams * llama_build_context Surprisingly small reduction in llama.cpp compile time given the reduction in LOCs (22k -> 14k) * LLM_TN llama.cpp compilation: 50 s -> 33 s * llama_quantize * arch names * All graph building is now in llm-build-context.cpp * hparams loading llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile. * We are now at 6 seconds to build the src folder * load -> create We are not actually loading the tensors, but just creating them. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-11 11:35:20 +03:00

22 Commits