22 Commits

Author SHA1 Message Date
Justin Martin
40d8cb196a
llama-quantize: enable --extra-output-tensor with COPY (#1871) 2026-05-23 13:52:34 +03:00
Kawrakow
1f8c603d9c
Quantize: add extra output tensor for MTP (#1810)
* Quantize: add extra output tensor for MTP

* Consistently use --mtp-requantize-output-tensor
2026-05-17 13:59:56 +03:00
Kawrakow
3e573cfea6
MTP: option to use re-quantized output tensor for better TG performance (#1809)
* Option to use re-quantized output tensor for MTP

* Remove quantize extra output option

* Handle interleaved types
2026-05-16 14:40:18 +03:00
Kawrakow
4e1851b01a
imatrix: use data for ffn_up when data for ffn_gate is missing (#1805) 2026-05-15 07:28:34 +03:00
Kawrakow
e5355e9895
Quantization options (#1677) 2026-04-23 09:05:39 +02:00
dmaivel
4f4bcfbe67
Add --defer-experts flag to defer expert mmap residency on Linux (#1634)
* Add --defer-experts flag to defer expert mmap residency on Linux

* Disable warmup when defer-experts is enabled
2026-04-16 08:54:44 +02:00
Kawrakow
90ec1b80c4
Bonsai support (AVX2, generic) (#1570)
* Bonsai support (AVX2, generic)

* Fix ARM build

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-04-02 16:54:08 +02:00
Kawrakow
8b575c4b1f
Fix re-quantizing a model using row-interleaved quants (#1561) 2026-03-31 15:35:10 +02:00
Kawrakow
9b7db9bc3f
Better --n-cpu-moe (#1464) 2026-03-19 06:57:01 +01:00
Kawrakow
54bcafee16
Allow using -rtr and -muge together (#1444) 2026-03-16 18:26:26 +01:00
Kawrakow
c2b8e95700
Be able to use imatrix computed with merged ffn_gate_up_exps (#1419)
* Be able to use imatrix computed with merged ffn_gate_up_exps

* Also the other way around
2026-03-13 17:57:56 +01:00
Kawrakow
3208660d20
Be able to quantize mmproj files (#1367) 2026-03-06 07:25:40 +01:00
Kawrakow
87b35dac0c
Faster quantization for MoE models with many experts (#1322) 2026-02-26 06:52:28 +01:00
Nexes the Elder
170467e835
Llama-quantize: Partial requant feature (#1313)
* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup
2026-02-25 07:25:15 +01:00
Kawrakow
cfb6747776
llama-quantize: --dry-run option (#1309) 2026-02-24 15:21:52 +01:00
Kawrakow
d81cde5cea
Fix very low bpw missing imatrix check (#1284) 2026-02-19 08:15:26 +01:00
Kawrakow
c03c2d7cc6
Merge ffn_up and ffn_gate experts tensors (#1137)
* WIP - not working

* WIP - not working

* WIP - GPT-OSS working

However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.

My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.

For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.

Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.

* WIP

* WIP - Qwen3-MoE (and hopefully all others) working

But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.

* WIP: TG seems to be working

* Minor

* Add command line option to merge experts up/gate

* Add merge up/gate command line parameter to llama-bench

* Turn off merge_up_gate_exps if split mode graph

It is not yet implemented

* When no bias, allow merging up/gate with tensor overrides

* Arghh, we need to increase the context size again

* Cleanup
2026-01-12 18:30:53 +02:00
Kawrakow
bf12f502a4 Fix requatizing from row-interleaved quants (#992)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-20 11:50:09 +01:00
Kawrakow
e68f50be9a Allow quantization of ffn_gate_inp (#896)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:44:32 +02:00
Kawrakow
56fc5454ff Merge Q, K, V (#878)
* POC: merge Q, K, V into a single, contiguous tensor

Done just for Qwen3-MoE, where I see a 4% uplift in TG.
PP performance gain is sub-percent, if any.
Still, it seems it makes sense to do it in general given
the TG performance gain.

* WIP

* merge_qkv: it works for gpt-oss

...but we see a smaller TG gain (~1.5%)

* WIP

* Don't ignore the return value of create_tensors()

else, when q, k, v get merged and we are running on the CPU,
we get a crash because the backend is trying to use mmap,
but that no longer works.

* merge_qkv: bias can be required, optional, or mandatory

* merge_qkv: glm4.5moe

* merge_qkv: add command loine argument to enable

* merge_qkv: fix tensor dimensions

* merge_qkv: llama-4

* merge_qkv: qwen3 (dense)

* merge_qkv: simplify build_qwen3moe

* cohere2 - simplify graph building

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-30 10:49:48 +02:00
Kawrakow
21a0bfb1c0 Fix PATH_MAX not defined on Windows (#828)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-13 09:25:57 +03:00
Kawrakow
4daff01b39 Refactor file llama.cpp (#823)
* llama_model and llama_hparams

* llama_build_context

Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)

* LLM_TN

llama.cpp compilation: 50 s -> 33 s

* llama_quantize

* arch names

* All graph building is now in llm-build-context.cpp

* hparams loading

llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.

* We are now at 6 seconds to build the src folder

* load -> create

We are not actually loading the tensors, but just creating them.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-11 11:35:20 +03:00