From f478a3ec0b61725edd1f327f2defc686cff5bc86 Mon Sep 17 00:00:00 2001 From: ubergarm Date: Wed, 13 May 2026 02:08:42 -0400 Subject: [PATCH] fix: only inflate n_batch for GPU-offloaded mmproj, not CPU (#1788) The get_batch_ubatch() function unconditionally inflated n_batch and n_ubatch whenever --mmproj was specified, regardless of whether the mmproj model actually ran on the GPU. This boosted batch size applies to both the main context and the MTP draft context, since params_base.speculative.cparams_dft is derived from common_context_params_to_llama(params_base). When mmproj runs on CPU (--no-mmproj-offload), this batch inflation is unnecessary for mmproj itself (CPU compute is sized by image dimensions independently), but it still inflates the MTP compute buffer proportionally. For large images (e.g. --image-max-tokens 4096), the MTP compute buffer ballooned to ~2020 MiB and triggered an OOM even though the mmproj model was fully on CPU and should have saved VRAM. Restrict the batch inflation to !params.mmproj.path.empty() && params.mmproj_use_gpu so it only triggers when mmproj actually occupies GPU memory. When mmproj runs on CPU, the existing per-chunk decode splitting in mtmd_helper_decode_image_chunk_impl handles large images correctly with the default batch size. AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev --- common/common.cpp | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/common/common.cpp b/common/common.cpp index bb8ed772..9785dcda 100644 --- a/common/common.cpp +++ b/common/common.cpp @@ -3531,8 +3531,8 @@ static std::pair get_batch_ubatch(const gpt_params & params) { if (params.n_ctx > 0) { n_batch = std::min(n_batch, params.n_ctx); } - if (!params.mmproj.path.empty()) { - // temporary fix for qwen mtmd + if (!params.mmproj.path.empty() && params.mmproj_use_gpu) { + // temporary fix for qwen mtmd (only when mmproj is on GPU) n_batch = std::max(n_batch, n_ubatch); n_ubatch = n_batch; fprintf(stdout, "Adjust batch size for mtmd: u_batch = %d, batch = %d\n", n_ubatch, n_batch);