ik_llama.cpp/src at b26521b9ef213bfa136ad34bc1ae7986bd51cb49 - ik_llama.cpp - Jared's Git Server

jdelony/ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

History

thad0ctor b26521b9ef

Fix raw-vs-local device id confusion under -dev/-devd subsets (#1826 )

llm_load_tensors stores `default_layer_device[i]` as a local index into
`model.devices` (consistent with `device_mem[]`, `model.splits[]`, and
all graph-building consumers), but the four
`llama_default_buffer_type_offload(model, default_layer_device[i])`
callsites passed it through as if it were a raw post-CVD device id.
Under `-dev`/`-devd` subsets where `model.devices != {0..N-1}`, this
selected the wrong buffer type. Wrap with `model.devices[...]` to match
the existing `model.devices[main_gpu]` pattern on the adjacent lines.

llama_init_from_model has the same bug for `main_gpu`: every consumer
(auto-fit override at line 3428, MTP clamp, the `model.devices[main_gpu]`
translations at lines 3678/3682, and graph-building `splits[main_gpu]`)
treats it as a local index, but the five single-GPU backend init paths
(CUDA, Vulkan, SYCL, Kompute, CANN) pass `model->main_gpu` straight to
the backend init, which expects a raw device id. e.g. `-dev CUDA1` with
default `--main-gpu 0` and `split_mode=NONE` called
`ggml_backend_cuda_init(0)` instead of `cuda_init(1)`. Compute
`main_gpu_id` once and use it for all five paths.

2026-05-22 08:32:52 +03:00

..

Fix Gemma4-E4B compute graph (#1855 )

2026-05-21 12:46:28 +03:00

CMakeLists.txt

Move embedding management to speculative (#1825 )

2026-05-20 17:42:48 +03:00

llama-arch.cpp

Add MTP Support for Gemma 4 (#1744 )

2026-05-10 07:44:20 +03:00

llama-arch.h

Add MTP Support for Gemma 4 (#1744 )

2026-05-10 07:44:20 +03:00

llama-build-context.cpp

Enable split mode graph for MLA models and partial offload (#1835 )

2026-05-20 07:13:55 +03:00

llama-build-context.h

MLA TP prompt processing optimisation (#1841 )

2026-05-20 17:03:05 +03:00

llama-context.h

Move embedding management to speculative (#1825 )

2026-05-20 17:42:48 +03:00

llama-cparams.h

MLA tensor parallelism under -sm graph (DEEPSEEK2/GLM_DSA/MISTRAL4) (#1821 )

2026-05-19 08:36:17 +03:00

llama-delta-net.cpp

MTP: faster recurrent state restore (#1791 )

2026-05-13 11:00:24 +03:00

llama-delta-net.h

MTP: faster recurrent state restore (#1791 )

2026-05-13 11:00:24 +03:00

llama-expert-io.h

Add --defer-experts flag to defer expert mmap residency on Linux (#1634 )

2026-04-16 08:54:44 +02:00

llama-grammar.cpp

common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#1822 )

2026-05-19 08:36:49 +03:00

llama-grammar.h

llama : add token matching support to llama-grammar (#1220 )

2026-02-03 07:57:17 +02:00

llama-hparams.cpp

Add MTP Support for Gemma 4 (#1744 )

2026-05-10 07:44:20 +03:00

llama-hparams.h

MTP: ebable per step recurrent state for split mode graph (#1773 )

2026-05-11 12:40:04 +03:00

llama-impl.h

Full graph parallel for Qwen3.5 (dense and MoE) (#1388 )

2026-03-10 09:08:24 +01:00

llama-load-tensors.cpp

Fix Gemma4-E4B compute graph (#1855 )

2026-05-21 12:46:28 +03:00

llama-mmap.cpp

Add --defer-experts flag to defer expert mmap residency on Linux (#1634 )

2026-04-16 08:54:44 +02:00

llama-mmap.h

Add --defer-experts flag to defer expert mmap residency on Linux (#1634 )

2026-04-16 08:54:44 +02:00

llama-model-loader.cpp

fix: use int8_t for GGUF bool array loading instead of platform-dependent bool (#1648 )

2026-04-17 07:25:07 +02:00

llama-model-loader.h

MTP: option to use re-quantized output tensor for better TG performance (#1809 )

2026-05-16 14:40:18 +03:00

llama-model.cpp

Add MTP Support for Gemma 4 (#1744 )

2026-05-10 07:44:20 +03:00

llama-model.h

MLA TP -khad: ggml_dequant_hadamard fused op + wv_b/wk_b_pp Hadamard fold (#1852 )

2026-05-21 07:29:15 +03:00

llama-quantize.cpp

Quantize: add extra output tensor for MTP (#1810 )

2026-05-17 13:59:56 +03:00

llama-quantize.h

Allow using -rtr and -muge together (#1444 )

2026-03-16 18:26:26 +01:00

llama-sampling.cpp

Log probabilities on token sampling crash (#1519 )

2026-03-26 14:34:41 +01:00

llama-sampling.h

Add adaptive sampling clone and free functions to manage memory (#1851 )

2026-05-21 08:11:17 +03:00

llama-spec-features.cpp

Move embedding management to speculative (#1825 )

2026-05-20 17:42:48 +03:00

llama-spec-features.h

Move embedding management to speculative (#1825 )

2026-05-20 17:42:48 +03:00

llama-vocab.cpp

Gemma4 tokenizer fixes (#1603 )

2026-04-09 15:33:28 +02:00

llama-vocab.h

Initial Gemma4 support (#1581 )

2026-04-06 10:01:08 +02:00

llama.cpp

Fix raw-vs-local device id confusion under -dev/-devd subsets (#1826 )

2026-05-22 08:32:52 +03:00

unicode-data.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

unicode-data.h

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

unicode-script-data.cpp

Add Unicode allowlist (#1597 )

2026-04-10 18:22:57 +02:00

unicode.cpp

Gemma4 tokenizer fixes (#1603 )

2026-04-09 15:33:28 +02:00

unicode.h

Add Unicode allowlist (#1597 )

2026-04-10 18:22:57 +02:00