ik_llama.cpp/common at 397150caa282a2c4436ac2ee58f76b82f5104c60 - ik_llama.cpp - Jared's Git Server

jdelony/ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

History

ubergarm f478a3ec0b

fix: only inflate n_batch for GPU-offloaded mmproj, not CPU (#1788 )

The get_batch_ubatch() function unconditionally inflated n_batch and
n_ubatch whenever --mmproj was specified, regardless of whether the
mmproj model actually ran on the GPU. This boosted batch size applies
to both the main context and the MTP draft context, since
params_base.speculative.cparams_dft is derived from
common_context_params_to_llama(params_base).

When mmproj runs on CPU (--no-mmproj-offload), this batch inflation
is unnecessary for mmproj itself (CPU compute is sized by image
dimensions independently), but it still inflates the MTP compute buffer
proportionally. For large images (e.g. --image-max-tokens 4096), the
MTP compute buffer ballooned to ~2020 MiB and triggered an OOM even
though the mmproj model was fully on CPU and should have saved VRAM.

Restrict the batch inflation to !params.mmproj.path.empty() &&
params.mmproj_use_gpu so it only triggers when mmproj actually occupies
GPU memory. When mmproj runs on CPU, the existing per-chunk decode
splitting in mtmd_helper_decode_image_chunk_impl handles large images
correctly with the default batch size.

AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev

2026-05-13 09:08:42 +03:00

..

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

base64.hpp

llava : expose as a shared library for downstream projects (#3613 )

2023-11-07 00:36:23 +03:00

build-info.cpp.in

build : link against build info instead of compiling against it (#3879 )

2023-11-02 08:50:16 +02:00

chat-auto-parser-generator.cpp

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

chat-auto-parser-helpers.cpp

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

chat-auto-parser-helpers.h

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

chat-auto-parser.h

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

chat-diff-analyzer.cpp

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

chat-peg-parser.cpp

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

chat-peg-parser.h

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

chat.cpp

fix: Kimi-K2 parser ignores enable_thinking=false, response goes to reasoning_content (#1686 )

2026-04-24 17:37:29 +02:00

chat.h

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

CMakeLists.txt

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

common.cpp

fix: only inflate n_batch for GPU-offloaded mmproj, not CPU (#1788 )

2026-05-13 09:08:42 +03:00

common.h

Add Expiring Logit Bias (#1731 )

2026-05-06 09:25:38 +03:00

console.cpp

check C++ code with -Wmissing-declarations (#3184 )

2023-09-15 15:38:27 -04:00

console.h

gguf : new file format with flexible meta data (beta) (#2398 )

2023-08-21 23:07:43 +03:00

json-partial.cpp

common : introduce composable PEG parser combinators for chat parsing and new jinja template engine (#1369 )

2026-03-09 11:03:33 +01:00

json-partial.h

Move minja and nlohmann/json to vendor (#802 )

2025-09-27 09:12:35 +02:00

json-schema-to-grammar.cpp

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

json-schema-to-grammar.h

common : introduce composable PEG parser combinators for chat parsing and new jinja template engine (#1369 )

2026-03-09 11:03:33 +01:00

llguidance.cpp

Tool calls support from mainline (#723 )

2025-09-01 08:38:49 +03:00

log.cpp

Refactor chat and server file (#1062 )

2025-12-15 08:27:20 +01:00

log.h

Server: refactor and rename functions (#1151 )

2026-01-18 08:16:57 +02:00

ngram-cache.cpp

spec : add self speculative decoding, ngram and refactor (#1261 )

2026-02-13 19:04:55 +01:00

ngram-cache.h

spec : add self speculative decoding, ngram and refactor (#1261 )

2026-02-13 19:04:55 +01:00

ngram-map.cpp

Speculative checkpoints for recurrent models (#1669 )

2026-04-24 09:59:30 +02:00

ngram-map.h

spec : add self speculative decoding, ngram and refactor (#1261 )

2026-02-13 19:04:55 +01:00

ngram-mod.cpp

spec : add self speculative decoding, ngram and refactor (#1261 )

2026-02-13 19:04:55 +01:00

ngram-mod.h

spec : add self speculative decoding, ngram and refactor (#1261 )

2026-02-13 19:04:55 +01:00

peg-parser.cpp

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

peg-parser.h

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

reasoning-budget.cpp

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

reasoning-budget.h

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

regex-partial.cpp

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

regex-partial.h

Tool calls support from mainline (#723 )

2025-09-01 08:38:49 +03:00

sampling.cpp

Use AVX2 when available for greedy speculative sampling (#1761 )

2026-05-09 08:32:20 +03:00

sampling.h

Add Expiring Logit Bias (#1731 )

2026-05-06 09:25:38 +03:00

spec-tuner.cpp

Pre-allocate buffers for hybrid model checkpoints (#1774 )

2026-05-12 07:21:25 +03:00

spec-tuner.h

Pre-allocate buffers for hybrid model checkpoints (#1774 )

2026-05-12 07:21:25 +03:00

speculative.cpp

Pre-allocate buffers for hybrid model checkpoints (#1774 )

2026-05-12 07:21:25 +03:00

speculative.h

MTP: use target slot position for drafting (#1781 )

2026-05-12 07:21:03 +03:00

suffix-tree.cpp

suffix-spec: load corpus in chunks (#1721 )

2026-05-04 07:56:07 +03:00

suffix-tree.h

Self-decoding: Adds support for suffix decoding (#1646 )

2026-04-18 16:10:10 +02:00

train.cpp

Server: refactor and rename functions (#1151 )

2026-01-18 08:16:57 +02:00

train.h

sync : ggml (backend v2) (#3912 )

2023-11-13 14:16:23 +02:00

unicode.cpp

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00

unicode.h

Autoparser - complete refactoring of parser architecture (#1376 )

2026-04-22 10:04:13 +02:00