126 Commits

Author SHA1 Message Date
Kawrakow
d5507e33ae
Split mode graph for dense Gemma4 assistant (#2022)
* WIP: Split mode graph for Gemma4 assistant

Something is not right - acceptance drops to nearly zero.

* Per model CUDA contexts

Still not working!?

* This works

The issue was that I was not correctly calculating the number
of KV heads for the split KV cache.

* Compiler warnings

* It is better to use llama_context pointers as keys
2026-06-24 18:29:32 +02:00
Nexes the Elder
9eaf86a7c7
Fix minor CUDA discrepencies (#2005)
* CUDA : typo

* CUDA: Add missing GGML_CALL to function definition

* CUDA: only log GGML_CUDA_FORCE_MMQ/CUBLAS when enabled

* CUDA: Fix softcap bug in flash_attn_tile_ext_f16

The else branch (softcap != 0) incorrectly called launch_fattn_tile_f16_64_128
with use_softcap=false instead of true, causing logit softcap to be silently
ignored for the col_per_block=32, parallel_blocks=1 path.
2026-06-23 09:37:48 +02:00
magikRUKKOLA
72440a19fc
on-demand tensor reload (#1989)
* host-swap tensor loop

the host-swap functionality is only triggered when the certain env. variables are declared

* target_include_directories tweak

* hot-swap tensor support

two intrusions:
1.) at the model loading to collect the snapshot
2.) the modification of the `/health` HTTP endpoint to be able to trigger the hot-swap via sending the `llama-server` the HTTP-request.
*both a braced by the specific env. variables

* hot-swap tensor support; graph invalidation

ggml_backend_cuda_invalidate_graphs export

* hot-swap tensor support

graph invalidation implementation;  extended debug output (commented out)

* llama_reload_changed_tensors export

* tensor hot-swap on-demand reload

cpu-only/hybrid/gpu-only with split mode layer/graph full support implementation

* docs

* reuse the gguf parsing from llama.cpp

gguf_init_from_file, gguf_find_tensor, ggml_get_tensor

* remove the manual scheduling for hybrid inference

* update docs

* tensor shape validation

* update docs

* update docs

accidentally wiped the previous changes;  so recovered them

* revert the GGML_CUDA_MAX_DEVICES to 16

* update llama_reload_changed_tensor

update llama_reload_changed_tensor, revert CMakeLists.txt

* update llama_reload_changed_tensor

* GGML_MAX_SRC

GGML_MAX_SRC compile-time definition support

* GGML_MAX_SRC

GGML_MAX_SRC compile-time definition support

* GGML_MAX_SRC

GGML_MAX_SRC compile-time definition support

* llama_reload_changed_tensor

update llama_reload_changed_tensor definition

* refactory

move the tensor-reloading implementation to llama-reload.cpp, llama-reload-info.h;  some bugfixes and code reduction

* revert

added back the missing newline

* update docs

* reload_info constructor

* bugfix: cpu-only

TODO: improve the working environment by compiling for multiple hardware configurations;  possibly make a test pipeline

* cpu-only bugfix

set the fix again after unsuccessful sync with main

* windows os compilation fix

#include <string>

* fix windows os build

error C2039: 'string': is not a member of 'std'

* remove dead file

* implement perplexity in server

* Revert "implement perplexity in server"
2026-06-22 16:36:34 +02:00
Kawrakow
b21653a56f
Fully remove any BLAS remnants (#2001)
* Fully remove any BLAS remnants

* Also these
2026-06-19 17:26:09 +02:00
Kawrakow
71af16a6b7
Fix DFlash oerformance with split mode graph (#1980) 2026-06-17 18:40:02 +02:00
David Young
aefb8bdd99
MLA TP -khad: ggml_dequant_hadamard fused op + wv_b/wk_b_pp Hadamard fold (#1852)
* ggml: ggml_dequant_hadamard fused op for MLA -khad path

Adds a new ggml op that fuses (ggml_cast -> F32) + (ggml_hadamard) into a
single kernel. Reads a quantized (or F16/F32) source and produces a per-
Hadamard-block F32 chunk with the inverse transform applied, without
materializing a full-size F32 intermediate buffer.

Motivation: the MLA pp_opt path in build_deepseek2.cpp un-encodes the
H-applied cache_nope view at every PP call. Today that runs as a cast
(quant -> F32) followed by a separate ggml_hadamard kernel, costing two
full-size F32 passes per layer per rank per call. Fusing them halves
the bandwidth on the un-encode and removes one kernel launch.

CUDA kernels in dequant_hadamard.cu lift the Walsh-Hadamard butterfly
from hadamard.cu and dequant helpers from dequantize.cuh:

  * qr=1 layout (q8_0): consecutive dequant pair, stage 1 fused with load
  * qr=2 layout (q4_0 / q4_1 / q5_0 / q5_1 / q6_0 / iq4_nl): dequant pair
    at stride qk/2, explicit stage 1 after sync
  * F16 has a dedicated kernel
  * F32 source falls back to the standalone Hadamard op

CPU impl in iqk_cpu_ops.cpp composes the existing type_traits.to_float
dequant with fast_ht for graph completeness. nh in {64, 128, 256, 512}.

* MLA-TP: Hadamard pretransform of wv_b/wk_b_pp for -khad

Fold the 64-block orthonormal Hadamard into wv_b and wk_b_pp once at
context init so the pp_opt mul_mats consume the K cache in its on-disk
encoded basis. The per-PP-call cache_nope un-Hadamard is then skipped
(rope half still un-applied — it goes to FA via concat, no wk_b multiply).

Math is identity by H^T H = I: mul_mat(H@wv_b, H@cache) = wv_b^T @ cache.
For mla=2/3 absorb, composes correctly with the existing post-FA
ggml_hadamard(kqv_compressed, 64).

All-or-nothing across layers under a castable type-allowlist (excludes
1-3 bpw IQ types whose requant blows up beyond PPL noise). Models with
ineligible weights fall back to the runtime un-Hadamard path unchanged.

Composes with the fused ggml_dequant_hadamard op (prior commit): with the
fold active only the rope half still runs the runtime transform, via the
fused kernel.

* MLA-TP: fix TG with -khad after wv_b/wk_b_pp fold

The absorb branch of build_deepseek2_tp_attention applies
ggml_hadamard to kqv_compressed after FA, then multiplies by
wv_b. Pre-fold this was needed because wv_b was un-encoded; with
the wv_b fold (prior commit) the mul_mat already expects
H-encoded kqv_compressed:

  mul_mat(H @ wv_b, kqv_encoded) = wv_b^T @ H @ H @ kqv_unencoded
                                 = wv_b^T @ kqv_unencoded   (H @ H = I)

Skip the post-FA hadamard when model.khad_pretransformed is set
so the two H applications cancel instead of double-applying.

Affects the absorb branch: TG (n_tokens=1), short-context PP
(n_kv < 1024), and models without wk_b_pp. Long-context PP goes
through the pp_opt branch and is unrelated/unchanged.

Reported by @ikawrakow on PR 1852. Verified across mla={1,2,3} x
khad={on,off} x -ctk={q8_0,q4_0} on GLM-4.7-Flash IQ5_K and the
unsloth IQ4_XS variant ik used to reproduce.

* ggml_hadamard: accept F16 and quant sources; drop GGML_OP_DEQUANT_HADAMARD

Per @ikawrakow review on PR 1852: subsume the per-source-type dispatch
into the existing GGML_OP_HADAMARD instead of carrying a separate enum
entry, op constructor, and standalone files.

ggml_hadamard's API is unchanged from the call-site perspective. The
constructor's F32-only assertion is dropped; ggml_cuda_op_hadamard and
iqk_hadamard now dispatch internally:

  - F32 source: existing F32 butterfly (unchanged)
  - F16 source: dedicated kernel
  - q8_0 / q4_0 / q4_1 / q5_0 / q5_1 / q6_0 / iq4_nl: fused dequant +
    butterfly kernel (lifted from the deleted dequant_hadamard.cu)
  - CPU side composes traits.to_float with fast_ht

Net diff: -80 lines. Removes dequant_hadamard.{cu,cuh}, the enum entry,
op table rows, ggml_dequant_hadamard constructor, dispatch cases, and
the DEQUANT_HADAMARD supports_op block.

Verified clean build + TG smoke (mla=3 +khad q8 on GLM-4.7-Flash-IQ4_XS,
same coherent output as prior commit on feat/dequant-hadamard).
2026-05-21 07:29:15 +03:00
David Young
c07a052315
MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) (#1821)
* MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4)

Extends -sm graph (split-mode graph) to MLA-style attention across the
DEEPSEEK2, GLM_DSA, and MISTRAL4 architectures. Previously these archs
fell back to -sm layer regardless of the user's flag.

Implementation:
- Per-rank attention build in build_deepseek2_tp_attention with
  view-sliced FlashAttention, split-buffer output projection, and
  ggml_reduce across devices
- wk_b / wv_b absorbed weights replicated per device via materialize()
  in llm_prepare_mla (these can't live in a split buffer)
- KV cache replication path (replicated_k_l) for graph-mode TP
- distribute_mla_tensors_for_split_mode_graph routes attention/norm
  tensors into ctx_split; expert tensors stay per-layer
- Implements ggml_backend_cuda_split_buffer_get_tensor for the
  replicated / row-split / col-split inverse paths
- Early-reject guard in src/llama.cpp that auto-downgrades -sm graph
  to -sm layer (with a warning) when incompatible loader flags are set:
  -ncmoe, -cmoe, -ot, -rtr, -muge

New CLI flag:
- -gap | --graph-attn-precision <f16|f32>  (default f16)

See the PR description for the full validation matrix (3 archs x 2/4/8
GPU counts), perf numbers, VRAM accounting, and known limitations.

* Some tweaks

* materialize lambda: per-head split for graph-mode tp_replicate

7dd19e19 changed wk_b/wv_b distribution from mirror to per-head split
(split_dim=2) via prepare_split_tensors. That path only fires when
wk_b/wv_b are loaded from GGUF.

Models that store only wkv_b in GGUF derive wk_b/wv_b at load via
llm_prepare_mla, going through the materialize lambda, which was
untouched and still produced mirror replicas (split_dim=-1, full n_head
per device).

build_deepseek2_tp_attention now does mul_mat(wk_b_local, q_nope_perm)
without the prior view_3d slice, so a mirror replica passes an n_head
tensor where the kernel expects n_head_local. Result: silent SIGSEGV
right after model load.

Mirror logic in materialize is replaced with the same per-head split as
prepare_split_tensors: head_offsets derived from wo split, each rank
gets a tensor with ne[2]=n_head_local, data copied from the appropriate
source byte slice. Singular `computed` tensor keeps full metadata for
tensors_by_name lookups.

Tested: 8x3090, -sm graph -mla 3 -fa on now boots cleanly and
sweep-benches without crash. Log confirms new path: "Computed
blk.X.attn_k_b.weight ... split across N devices on dim=2".

* cleanup: indent fix + remove dead view_3d slicing and debug printf

- build_deepseek2.cpp: re-indent the self_attention block in
  build_deepseek2_layer_attention (lines 253-670). Block was at column 0
  inside a function body; now at the expected 4/8-space indent.
- build_deepseek2.cpp: drop the commented-out view_3d slicing and debug
  printfs left over after 7dd19e19's switch to direct mul_mat on
  per-rank wk_b_local / wv_b_local. Update the stale 'wk_b is
  replicated (split_dim=-1)' comment to match the new split_dim=2
  reality.
- ggml-cuda.cu: remove the leftover debug printf in
  ggml_backend_cuda_split_buffer_get_tensor.

No behavior change. Verified with a clean rebuild and DSV2.5 +
GLM-4.7-Flash sweep-bench runs.

* llm_load_tensors: gate incompatible-flag warning to MLA archs

The -ncmoe / -rtr / -muge / -ot warning under -sm graph currently fires
for all archs that support graph mode. That's an over-reach: the
incompatibility is specific to the MLA TP paths (DEEPSEEK2, GLM_DSA,
MISTRAL4) — Gemma4 graph mode existed pre-PR and works with those flags.
Gate the warning to MLA archs only.

Also refreshes two stale comments left over from the wk_b/wv_b
mirror -> per-head-split rewrite:
- src/llama.cpp llm_prepare_mla: "Replicate wk_b/wv_b ..." now reads
  "Per-head split wk_b/wv_b ..." to match what the materialize lambda
  actually does post-823a39e2.
- src/llama-load-tensors.cpp distribute_mla_tensors_for_split_mode_graph:
  drop the wkv_b row-split mention (wkv_b is no longer created under
  graph mode after 7dd19e19) and correct the wk_b/wv_b distribution
  description (per-head split, not per-device replicated).

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-05-19 08:36:17 +03:00
Kawrakow
397150caa2
MTP: faster recurrent state restore (#1791)
* MTP: store ready per step convolution states

* Cleanup
2026-05-13 11:00:24 +03:00
Kawrakow
86b5d076c5
Gemma4 MTP: avoid casting KV cache to f32 (#1786) 2026-05-13 09:11:27 +03:00
Kawrakow
cec1a6c1f5
MTP: Reuse graphs (again) (#1780) 2026-05-12 07:36:12 +03:00
Kawrakow
eb570eb966
MTP: Avoid per step SSM copy (#1778)
* Avoid copying the per-step SSM state (CUDA)

* Avoid copying the per-step SSM state (CPU)

* Allocate only what is necessary for per-step SSM state

* Cleanup
2026-05-11 18:15:55 +03:00
Kawrakow
3557b446f8
Avoid recurrent state copy (#1777) 2026-05-11 13:13:59 +03:00
Samuel Oliveira Alves
ea94afe777
Speculative checkpoints for recurrent models (#1669)
* server: spec checkpoints for recurrent models

* fix: save/restore sampler state during speculative checkpoint

When speculative decoding rejects draft tokens and restores the
recurrent state checkpoint, the sampler (RNG, grammar, prev tokens)
must also be restored to maintain consistency. Without this, the
sampler state reflects the rejected draft tokens, leading to
potential divergence.

Uses common_sampler_clone() to snapshot the sampler before the
speculative batch decode, and restores it on rejection.

* server: snapshot recurrent state in tensor

* reset ngram mod state for rejected tokens

* server: refactor checkpoint state logic

* speculative: fix sampler for checkpoints

* recurrent model: implement recurrent kernel checkpoint

* recurrent model: refactor api

* spec: free rbudget before overwriting
2026-04-24 09:59:30 +02:00
Kawrakow
e5355e9895
Quantization options (#1677) 2026-04-23 09:05:39 +02:00
Kawrakow
55d3c05bf7
Fused fused_rms_norm + fused_rms_norm + add (#1627)
* Fuse fused_rms + fused_rms + add

* Dedicated fused_rms_norm + fused_rms_norm + add op

* Cleanup
2026-04-13 13:24:39 +02:00
Kawrakow
90ec1b80c4
Bonsai support (AVX2, generic) (#1570)
* Bonsai support (AVX2, generic)

* Fix ARM build

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-04-02 16:54:08 +02:00
Kawrakow
b56a3c2dc9
Better barrier (#1456) 2026-03-19 07:33:00 +01:00
Kawrakow
fd16a418de
Fix clang warnings on macOS (#1354)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-03 16:27:16 +01:00
Nexes the Elder
d4ac5f1566
gguf-split: fix the split output files naming (#1336)
* Fix gguf-split.cpp splits output naming

With this fix, the initial extension of the source .gguf file is not included in the naming of the output file before the numeration of the splits.

ex:

No more model.gguf-00001-of-00200.gguf
Instead, model-00001-of-00200.gguf

* increase ggml_max_context to 2048

* Revert GGML_MAX_CONTEXTS to 64
2026-03-02 08:43:47 +01:00
Kawrakow
c77ec4b8b8
Fused delta-net (#1315)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name
2026-02-25 14:12:48 +01:00
Kawrakow
7065488135
Slightly better graph parallel for Qwen3-Next (#1307)
* Make sure we pick the reduced tensor from the right GPU

* Minor
2026-02-24 15:22:30 +01:00
Samuel Oliveira Alves
51df09be8a
Feat - add kimi 2.5 Vision (#1280)
* port kimi 25-vision  from upstream

* feat(clip): add support for Kimi K2.5 vision model
2026-02-19 08:15:03 +01:00
Kawrakow
e30198a553
WIP: Qwen3Next (#1266)
* qwen3next: add architecture support and recurrent-state fixes

* qwen3next: optimize broadcast sub and single-seq ssm conv

* cuda: build MoE row mapping on device in mul_mat_id

* cuda: add guarded multi-seq fast path for ssm_conv

* docs: update qwen3next perf report for cuda MoE/SSM tuning

* cuda: reduce qwen3next moe/ssm sync overhead and refresh eval

* qwen3next: split cpu/cuda eval builds and tune PP scheduling

* qwen3next: harden seq-state flow and support optional dense FFN layers

* qwen3next: trim delta-net graph overhead in chunking path

* qwen3next: remove redundant v_conv cont in delta path

* qwen3next: avoid extra cont on linear attention output

* qwen3next: drop redundant cont before recurrent state flatten

* qwen3next: keep recurrent state in 4d layout through delta path

* qwen3next: add fused delta-net op and wire model path

* tests: add backend-op coverage for ggml_delta_net

* qwen3next: add runtime switch for fused delta-net path

* docs: refresh qwen3next perf review and benchmark matrix

* qwen3next: default fused delta-net off and document quality checks

* qwen3next: add decode-only fused delta mode

* qwen3next: make fused delta safe by default and fix fused tensor layout

* qwen3next: warn when forcing fused decode mode

* qwen3next: add fused-delta regression runner script

* qwen3next: integrate fused regression into eval harness

* qwen3next: clean up chunked delta-net shape handling

* qwen3next: add absolute sanity guards to fused regression

* qwen3next: add unified regression runner script

* qwen3next: disable flash-attn for cpu-only contexts

* docs: reconcile qwen3next status and remaining upstream gaps

* common: add qwen3next fused-delta runtime flag

* cuda: add qwen3next delta-net kernel dispatch override

* docs: update qwen3next quality and serving baseline findings

* qwen3next: keep fused delta on safe path and remove PR artifacts

* qwen3next: align autoregressive delta-net decode layout

* Revert "qwen3next: align autoregressive delta-net decode layout"

This reverts commit 9241164a5ea9e032a2456fbf2dd0bf798b264fd7.

* cuda: port solve-tri fast-paths for qwen3next delta-net

* qwen3next: add fused-delta runtime flag and drop env toggle

* qwen3next: make fused delta single-flag and default on

* Account for GPU arch differences

* Revert "cuda: build MoE row mapping on device in mul_mat_id"

This reverts commit 89e9ecfa840b04e88699ab3803eb732cd78727f9.

* qwen3next: drop non-essential MoE scheduling and split heuristics

* qwen3next: avoid generic ggml_sub broadcast changes

* llama: restore only_active_experts log message

* Remove unnecessary hacks, disable fusion for now.

* qwen3next: port hybrid recurrent state memory semantics

* qwen3next: clean up recurrent state slot plumbing

* qwen3next: fix hybrid V-cache layout plumbing

* qwen3next: guard recurrent state slots against kv capacity

* qwen3next: persist recurrent state in session data

- serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches

* qwen3next: drop unused fused-delta builder path

- remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member

* qwen3next: remove unused fused-delta CLI/context plumbing

- drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init

* ggml: remove unused DELTA_NET operator stack

* Missing include

* Reorder ops/unary ops

So we don't change again the enum values of the mul mat ops

* Minor

* Discard unnecessary changes in llama-build-context.cpp

* Minor

* Revert "Discard unnecessary changes in llama-build-context.cpp"

This reverts commit edadb80ed68c4c0831e9c22609a9a3af19be9735.

* Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches

* Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next

* Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next

It was single-threaded and was taking ~25% of the computation time
during TG. It is now down to 2%.

Strangely enough, I measure 13.6 t/s with llama-bench, but if I
let the model give me an actual response with llama-cli, I get close
to 17 t/s.

* Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next

For Qwen3Next there is a scale op on a largish tensor (548k elements)
that has a single row for TG, so was done in a single thread.
We now simply use blocks of 1024 elements.

* Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next

* CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next

* Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512

* Multithreading for OP_SUB

* Don't commit with timing trace on

* Multithread neg and sigmoid

* Be able to turn on/off fusion more easily (CPU)

* Name the mul_mat ops so we know where the time goes

* WIP

* Much better PP on CUDA

* CUDA: fuse transpose -> cont -> sum_rows -> transpose

Needs non-coontiguous variant of sum_rows.
On the CPU this gave 30+% improvement in TG performance,
on CUDA ist is disapointing 6-7%. I guess, this is because
Georgi's cont CPU implementation was so bad that skipping
it made such a big difference.

* CUDA: faster mul for special case relevant for Qwen3Next

Worth 1% in TG

* Fix CPU OP_CONT

---------

Co-authored-by: yurko <yurko@local>
Co-authored-by: Yurko <yurko@example.com>
Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net>
Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>
2026-02-16 06:50:28 +01:00
Kawrakow
2a7cc09149
Remove llamafile remnants (#1179) 2026-01-22 13:20:23 +02:00
firecoperana
c03ee1a4d2 server: improve speed of speculative decoding (#1119)
* server: improve speed of speculative decoding

change logs

rpc: add recompute

spec dec fix

* Fix n_batch_size not set to context size for draft model

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-10 08:01:22 +02:00
Kawrakow
519405dc97 Async compute graph evaluation (2 or more GPUs) (#1089)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Simple async

* This sync seems enough

* Only do async for 4 or more backends

With 2 GPUs (so, 3 backends) not using async is slightly faster

* Scheduler changes

* Use OpenMP if available

Surprisingly (at least to me), this is quite a bit faster than
std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now
at 105 t/s at zero context!

* Do not use OpenMP if there are tensor overrides

* Set omp max active levels

* Be more careful with having set the device before using a stream

* Command line option to turn on async. Set to false by defualt for now

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-27 08:18:06 +01:00
Kawrakow
ada5cc1523 Fused norm (#1086)
* Adding fused_norm - same idea as fused_rms_norm

* Avoid computing the attention reduce op for cohere2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 15:22:43 +01:00
Kawrakow
0d7eb34185 Graph parallel: the next generation (#1080)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Implement the reduce op without NCCL available

* Be able to build without NCCL

cmake -DGGML_NCCL=OFF disables it

* Make --max-gpu work again

* Slightly better for 4 GPUs without NCCL

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 08:31:48 +01:00
Kawrakow
5585ac2aa8 Better PP performance with split mode "graph" and 3+ GPUs (#1069)
* This should do the trick for PP

* Command line option to set max. extra VRAM that the scheduler can use

* Fix bug and cleanup

* Looks like with this change it is working with tensor overrides

* Nah, it is not working

* OK, this seems to be working

* Disable split scheduling with tensor overrides

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-17 07:40:25 +01:00
Kawrakow
f65fefa36c Slightly faster TG for split mode "graph" (#1057)
* Rearrange graph nodes

So that we can do graph portions that are the same on 2 or more
GPUs at the same time.

* Separate graph compute implementation for split mode graph

* This is better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 07:54:37 +01:00
Kawrakow
18fdd80eaf Hadamard transforms for K-cache - CPU only (#1033)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-04 06:51:11 +01:00
Kawrakow
a719349982 POC: CUDA tensor parallel (MoE models) (#1022)
* Remove most of split mode row

* WIP

* WIP: also allocate the KV cache using tensor split

* WIP: it runs with wrong result

But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
     entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
  wait for GPU 0 to finish its entore FFN calculation before it can
  start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling

* WIP

* This works, but it is slow

* This is slightly better

the graph is still not being computed in parallel.
Why? Because the scheduler creates graph splits where the
result of the computation on one GPU becomes an input for the
other split. Hence, to trigger the computation on the second GPU
one needs to wait for the computation on the first GPU to finish,
even thiough the two can be done in parallel up to the sunchronization
point. So, all that is left to do is to trick the scheduler to create
to splits that can be done in parallel, and then have a graph split
where the results get combined.

* Playing games with the scheduler

This change tricks it into doing the right thing^TM.
Still quite a bit slower than split mode layer for the 8B LlaMA model.
But for the 70B LlaMA it now beats split mode layer for TG:
28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s.
In comparison, split mode "row" in mainline gets
484 t/s PP and 19.3 t/s TG.

* Fix attn split

Granularity for Wq, Wo is not just head size, but
head size * gqa_ratio.
Else the Wk, Wv tensors end up not being a multiple of the
head size when we divide the split determined by Wo with
the gqa_ratio.

* Show memory used per device

* Make it work with partial offload

but no tensor overrides yet, just ngl < num_layers.

* Allow for f16 source in fused_rms_norm

* This results in faster PP.

Now PP is faster than split mode layer for L3-70B.

* Rename split mode "row" to split mode "graph"

* Leave FFN partial results as f16

* WIP GLM4.5 - runs with wrong results

* WIP GLM4.5 - this works

PP is already better than split mode layer, but TG for zero context
is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer
at around 20k tokens. PP at 26k tokens is 1.55X of sm layer.

* Work around compiler bug

It issues a warning that there is an extra semicolon outside of a function,
but there isn't. If I remove the anonymous namespace and turn the
functions inside into static, the warning disapears, so clearly
a compiler bug.

* Make graph reuse work with split mode graph

* Remove more split mode row remnants

* WIP tensor overrides

Runs with wrong results, don't see where the issue could be.

* This works but is slow

Still does not work for row-interleaved quants

* Slightly better

* Slightly better

* Row-interleaved quants work

* Better

* Minor

* Guarad against using split mode "graph" for unsupported models

* Guards against using merge_qkv with split mode "graph"

* WIP split mode attn

Works for LlaMA models, but not for GLM-4.5.
Doesn't seem to improve performance, so I guess no point in trying to
fix it.

* Split mode graph for qwen3moe

* Try to better distribute the splits

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-01 19:25:40 +01:00
firecoperana
15771072c7 RPC: support multiple devices including cpu (#1024)
* RPC support multiple devices

* rpc : update documentation (#16441)

Update the README file to match the newly added functionality of
exposing multiple devices from a single server.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

# Conflicts:
#	examples/rpc/README.md

* Remove memory settings

* rpc : cache and reuse compute graphs (#15405)

Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.

* Add -cpu to include cpu backend

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
2025-11-30 18:48:02 +01:00
firecoperana
869557c8fd Update mtmd to improve accuracy of M-RoPE (#993)
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)

* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* add test model

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	convert_hf_to_gguf_update.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/gguf_writer.py
#	src/llama-vocab.cpp
#	src/llama-vocab.h

* mtmd : support home-cooked Mistral Small Omni (#14928)

* model : add LightOnOCR-1B model (#16764)

* model : add LightOnOCR-1B model

* add test
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py

* mtmd : fix idefics3 preprocessing (#16806)

* mtmd : fix idefics3 preprocessing

* disable granite test

* fix test for granite

* model: Add support for CogVLM model (#15002)

* Added GGUF mappings for CogVLM model

* Add tensor mapping for CogVLM visual encoder

* Add CogVLM to conversion script, no vision part yet

* Added CogVLM vision model to conversion script

* Add graph for CogVLM CLIP model

* Add graph for CogVLM

* Fixes for CogVLM. Now compiles.

* Model now runs

* Fixes for cogvlm graph

* Account for graph context change after rebase

* Changes for whitespace

* Changes in convert script according to comments

* Switch CogVLM LLM graph to merged QKV tensor

* Use rope_type variable instead of direct definition

* Change CogVLM CLIP encoder to use SWIGLU

* Switch CogVLM CLIP to use merged QKV

* Apply rebase edits and remove ggml_cont call that is now unnecessary

* clean up

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	examples/mtmd/clip.cpp
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama-model.h

* mtmd: refactor preprocessing + support max/min pixels (#16878)

* mtmd: refactor preprocessing + support max/min pixels

* fix mlp type

* implement mix/max pixels

* improve hparams

* better image preproc for qwen

* fix

* fix out of bound composite

* fix (2)

* fix token calculation

* get_merge_kernel_size()

* fix llama4 and lfm2

* gonna fix them all

* use simple resize for qwen

* qwen: increase min tokens

* no resize if dst size == src size

* restore to initial min/max tokens value for qwen
# Conflicts:
#	examples/mtmd/clip.cpp

* clip : use FA (#16837)

* clip : use FA

* cont : add warning about unsupported ops

* implement "auto" mode for clip flash attn

* clip : print more detailed op support info during warmup

* cont : remove obsolete comment [no ci]

* improve debugging message

* trailing space

* metal : remove stray return

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* model: add Janus Pro for image understanding (#16906)

* Add support for Janus Pro

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address reviewer suggestions

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add JANUS_PRO constant

* Update clip model handling

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Refactor JANUS_PRO handling in clip.cpp

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* em whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py

* mtmd: pad mask for qwen2.5vl (#16954)

* mtmd: pad mask for qwen2.5vl

* improve

* mtmd: add --image-min/max-tokens (#16921)

* mtmd: improve struct initialization (#16981)

* mtmd: allow QwenVL to process larger image by default (#17020)

* Disable flash attention

* mtmd : fix embedding size for image input (#17123)

* mtmd: fix patch_size initialized to random value in audio models (#17128)

* mtmd: fix patch_size initialized to random value in audio models

* add default hparams

* add llama_model_n_embd_inp

* Fix load qwen3 vl

Change batch size

* Add description

* Fix cli build error

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: firecoperana <firecoperana>
2025-11-29 07:27:15 +01:00
Kawrakow
532a05e466 CUDA: set compute parameters via command line arguments (#910)
* cuda: set compute parameters via command line arguments

* Also llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 07:11:23 +02:00
Thireus ☠
86597623a5 Port of Qwen3-VL support from mainline (#883)
* Port of Qwen3-VL for latest ik_llama.cpp

- convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead
- sysl and metal support for imrope not added
- Vulkan support for imrope not tested
- Code not tested

* Bugfix n_embd was declared multiple times

https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655

* Fix n_embd issue with qwen3vl

* model.output tensor not required

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389

* Improved logic for qkv combined tensors

59ceaf8fcb (r2480395800)
59ceaf8fcb (r2480398187)

* Fix n_embd for merge_qkv() + cleaner code

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395

* Revert TENSOR_NOT_REQUIRED
2025-11-04 19:20:54 +02:00
Kawrakow
fb0d5a995c RoPE cache (#887)
* Introducing rope cache

When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?

This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.

* cuda: neox works

* WIP

* rope_cache: norm works

* Fused rope+rope

* Fused rope+rope (norm)

* Fused rms+rms+rope+rope (neox) - not working

* WIP

* Also qwen3

* Add command line arg to disable rope cache

* Disable RoPE cache if rope type is not neox or norm

* Add missing break after merge with main

* Fused fused_rms+fused_rms+rope+rope (with -mqkv)

* Fused fused_rms+fused_rms+rope+rope (without -mqkv)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-03 18:42:20 +02:00
Kawrakow
0549be76e5 Fused mul + multi_add op (#858)
* Adding fused mul+multi_add + CPU implementation

* fused mul+multi_add: CUDA

* fused mul+multi_add: command line argument to disable it

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-24 07:40:35 +03:00
Kawrakow
cde642e591 Grouped expert routing (CPU only) (#836)
* Better argsort (CPU)

* Attemt at grouped topk

* This seems to do the trick for grouped experts routing

* Cleanup

* Trying to merge, something is not right

* Working merged grouped top_k (CPU)

* Add command line option to enable grouped expert routing

* Add grouped expert routing option to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-16 14:57:02 +03:00
Kawrakow
f7adde1043 Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833)
* Adding Ling/Ring (a.k.a., Bailing-MoE2)

* Add expert group selection (not working, so turned off)

* BailingMoE2 conversion

* WIP

* Bits and pieces

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-15 14:20:40 +03:00
Kawrakow
4e24d48e63 Attention mask tweaks for better long context performance (#825)
* Parallelize mask

We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.

* Whith FA on, create mask as f16 directly

* WIP

* Reduce KQ mask padding to 16

Why was it 64 in the first place?

I don't observe any issues, while TG performance
for long contexts improves by 2-4%.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-13 14:01:11 +03:00
Kawrakow
c1a0e15377 Port mdmd from mainline + Qwen2/2.5-VL support (#798)
* Add mtmd: the beginning

* Add mtmd: mtmd.cpp compiles

* Add mtmd: clip initialization compiles

* Add mtmd: clip.cpp compiles

* Add mtmd: builds successfully

* Add CPU implementation for GGML_OP_GLU

* Add CUDA implementation for GGML_OP_GLU

* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add mtmd: refresh CPU rope

* Add mtmd: refresh CUDA rope

* Add mtmd: add Qwen2-VL

* Add mtmd: Qwen2.5-VL text seems to work with this change

* Add mtmd: fix swiglu

* Add mtmd: use LOG_TEE so generated tokens show up in terminal

* Add mtmd: do not attempt to load a GPU backend if none are available

* GLU, not GPU

* Fix typo

* Fix new/free mismatch

* LOG stuff

* Add mtmd: this fixes gibberish on second image

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 08:45:29 +02:00
Kawrakow
13c3b6412e Offload only activated experts to the GPU (#698)
* Offload only activated experts

* This seems to do the trick for -fmoe

* Do not recalculate activated expers for fused up/gate

* Log out of bounds access details

* Add a command line argument

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 12:22:30 +02:00
Kawrakow
8de297b795 Fused FFN_UP+FFN_GATE op (#741)
* Fused up+gate+unary for regular (not MoE) FFN - CPU

* WIP CUDA

* Seems to be working on CUDA

For a dense model we get 2-3% speedup for PP and ~0.6% for TG.

* Add command line option

This time the option is ON by default, and one needs to turn it
off via -no-fug or --no-fused-up-gate

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-31 18:16:36 +03:00
Kawrakow
ca8c72ff1a AVX512+AVXVNNI GEMM implementation for quants using Q8_K for activations (#710)
* q8_k_r16: basics

* q8_k_r16: iq4_xs now uses q8_k_r16 on Zen4+

PP performance is about the same as using q8_k_r8 on the Ryzen-7950X,
so we expect nice gains on Zen5, and we don't need to wory about
using 2 different q8_k_r8 implementations for fancy SIMD.

* q8_k_r16: iq2_xxs now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_xs now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_s now uses q8_k_r16 on Zen4+

* q8_k_r16: iq3_xxs now uses q8_k_r16 on Zen4+

* q8_k_r16: iq3_s now uses q8_k_r16 on Zen4+

* q8_k_r16: iq1_s and iq1_m now uses q8_k_r16 on Zen4+

* q8_k_r16: q2_K and q3_K now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_ks and iq2_k now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_kl now uses q8_k_r16 on Zen4+

* q8_k_r16: iq3_ks and iq3_k now uses q8_k_r16 on Zen4+

* q8_k_r16: iq4_kss, iq4_ks, and iq4_k now use q8_k_r16 on Zen4+

* q8_k_r16: iq5_ks, iq5_k, and iq6_k now use q8_k_r16 on Zen4+

* Fix AVX2

* Just always set num_rows to 16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-22 06:27:07 +03:00
Kawrakow
6b2c84b099 Revert "Better CPU prompt processing performance for SWA models (#696)" (#701)
This reverts commit 93a4f6089f583207b233c98617bf1d0c0d3b9d83.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-17 15:44:02 +03:00
Kawrakow
d4d017766e Better CPU prompt processing performance for SWA models (#696)
* This does the trick for PP

* Compute mask bounds when creating the mask

* Set mask bounds for all supported SWA models

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-17 10:30:27 +03:00
Kawrakow
633e0617b0 Enable CUDA graphs for MoE models + GPT-OSS support (#689)
* gmp-oss: common

* gpt-oss: attnetion sinks, swiglu_oai

* gpt-oss: WIP llama

Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?

* gpt-oss: CPU seems to be working

It was the SWA thta was missing in the previous commit.

There are issues with EOG tokens, so this still needs to be added.

* CUDA: ADD_ID

Just a copy from mainline

* gpt-oss: Seems to be working on CUDA

* gpt-oss: add sinks to the attn-vec kernels

* CUDA: add head size of 64 to new mma

Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.

* gpt-oss: add ability to use -fmoe (only CUDA for now)

* Move row sums to the write place

* Add sinks to iqk flash attention

* gpt_oss: Implement -fmoe on the CPU

* Simdify swiglu_oai

Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.

* llama: factor out model loader

* Builds successfully

* It runs, but mmap does not work

* Fix llama_mmap so mmap works

* Minor

* Fix CUDA after latest changes

* Attempt to use CUDA graphs with MoE models - not working

* CUDA graphs WIP - still not working

* CUDA graphs - seems to be working

Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.

* Make q8_0 cache work for DeepSeek models with CUDA graphs

* cuda: cpy for q6_0

* Fix llama_mmap on non-Linux platforms

* Adding forgotten file

* Iterating on Windows build failures

* cuda: re-add q8_0 -> q8_0 transpose

so mla = 2 can be used with CUDA graphs and q8_0 cache.

* Disable graphs without -fmoe

* Minor

* Turn graphs on by default

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-15 09:18:07 +03:00
Kawrakow
e23b2a7cc9 MXFP4 (#682)
* mxfp4: basics

* mxfp4: Zen4 GEMM

* mxfp4: repacked GEMM (AVX2/Zen4)

* mxfp4: AVX2 GEMM

* mxfp4: NEON GEMM

* mxfp4: repacked GEMM (NEON)

* mxfp4: Metal

* Fix quantized K cache without FA (#680)

* Prevent assert with quantized K cache and no FA

* Fix MMQ when running with quantized K cache without FA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Fix for Deepseek r1 parsing (#676)

* Implement function calling / tools for ik_llama.cpp for Kimi K2

* Implement basic tool choice

* Backport llama.cpp tool calls support

* Enhance function calls with improved chat parser and string utilities

- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components

* Enhance function calling with unified streaming and parser improvements

- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation

* Replace hardcoded values in kimi_k2_parser.hpp with named constants

- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser

* Fix duplicate common_chat_parse definition

- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse

* Fix JSON assertion failure in function call parsing

- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures

* Add comprehensive Qwen3 XML tool calling support with unit tests

- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils

* Add DeepSeek R1 function calling support with comprehensive unit tests

- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility

Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support

* Add partial parsing support for JSON and regex

- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality

* Add format_chat integration tests for Qwen3 tool injection

- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation

Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.

* Fix Qwen3 tool call parsing - pass model name to parser

Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.

* Fix non-streaming path to use model-specific parsing

Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.

* Update Qwen3 function call handling in server and tests

- Enhanced server function call detection and response formatting
- Improved test coverage for Qwen3 tool call scenarios
- Refined XML parsing for better tool execution support

* Add DeepSeek-R1 function call parsing support

Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats:
- Format 1: Standard function call syntax (already supported)
- Format 2: Alternative function call patterns (already supported)
- Format 3: Tools array format - function\n```json\n{"tools": [...]}
- Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call>

Key changes:
- Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern
- Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns
- Integrated both parsers into exception handling chain for robust fallback
- Added comprehensive TDD test coverage for all formats
- Anonymized all confidential information while preserving functionality

Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls
but server failed to parse them correctly.

* Update function_calls.md documentation for DeepSeek-R1 Format 4

- Added Format 4 (XML wrapped) documentation with examples
- Updated implementation notes with correct parser order (3→4→1→2)
- Marked all DeepSeek-R1 formats as working (July 2025 update)
- Updated test status for Format 3 and 4 as passing
- Added parse_deepseek_r1_xml_wrapped() function reference
- Corrected implementation file line numbers

* Fix merge conflict in test-function-calls.cpp

- Removed incomplete merge conflict marker from line 3027
- Ensured all tests compile and pass successfully
- All DeepSeek-R1 formats (1-4) working correctly
- All streaming and content cleaning tests passing

* Fix DeepSeek R1 parsing issue with responses wrapped in think tags

Restore missing consume_rest() call from working PR #648 implementation.
When responses don't contain tool calls, remaining content after reasoning
parsing must be preserved as displayable content.

Fixes issue where entire responses wrapped in <think> tags resulted in
empty content output.

* Implement proper reasoning handling following original llama.cpp patterns

- Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax
- Update try_parse_reasoning to match original llama.cpp logic exactly
- Add TDD test case with reasoning_in_content=true for DeepSeek R1
- Following TDD: test should now pass with proper syntax configuration

Based on original llama.cpp implementation patterns.

* TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue

 Test passes with reasoning_in_content=true configuration
- Content properly preserved: '<think>content</think>' displays fully
- Reasoning field empty as expected
- Following TDD: test-first approach validates the fix

Next: Update server to automatically apply this configuration.

* Complete server integration fix for DeepSeek R1 thinking tag termination

- Server now automatically sets reasoning_in_content=true for DeepSeek R1 models
- Fixes issue where responses wrapped in <think> tags appear empty to users

* Add TDD test case for DeepSeek R1 thinking tag termination issue

- Test reproduces the exact failure scenario reported by user
- Validates that reasoning_in_content=true fixes the issue
- Demonstrates empty content problem and working solution

* Add remaining TDD test changes for DeepSeek R1 thinking tag fix

* Add debug output after upstream merge

* Remove temporary benchmark and debug files

- Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality)
- Remove tests/reproduce_bug.sh (debugging script, not needed for PR)

* Port cpu moe options from mainline (#672)

* Port cpu moe options from mainline

* Use strdup and int32_t to follow coding guidelines

* maxfp4: CUDA dequantize

* mxfp4: CUDA GEMV

* mxfp4: CUDA MMQ

* mxfp4: minor CUDA tweaks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Anton Sokolchenko <wsevendays@gmail.com>
Co-authored-by: Parsa <61601745+TheLegendOfKitty@users.noreply.github.com>
2025-08-09 08:40:18 +03:00
Kawrakow
e1164e1fd8 Adding IQ1_KT - 1.75 bpw SOTA quants (#616)
* iq1_kt: basics

* iq1_kt: CUDA dequantize

Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL
as iq2_xxs, so about 0.2 bpw fewer bits for the same quality.

* iq1_kt: CUDA MMQ

* iq1_kt: CUDA MMVQ

* iq1_kt: AVX2 GEMM/GEMV

* iq1_kt: convert/repack to q8_0_r8 (AVX2)

* iq1_kt: slightly faster GEMV

18.6 t/s -> 19.4 t/s

* iq1_kt: NEON GEMM/GEMV

Pathetic as usual

* iq1_kt: slightly faster NEON - still pathetic

* iq1_kt: tiny bit better GEMV on NEON

* iq1_kt: convert/repack to q8_0_r8 (NEON)

* iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON

* Adding frgotten file

* iq1_kt: add to constants.py

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-20 10:05:23 +02:00