From f478a3ec0b61725edd1f327f2defc686cff5bc86 Mon Sep 17 00:00:00 2001
From: ubergarm <leimgrub@gmail.com>
Date: Wed, 13 May 2026 02:08:42 -0400
Subject: [PATCH] fix: only inflate n_batch for GPU-offloaded mmproj, not CPU
 (#1788)

The get_batch_ubatch() function unconditionally inflated n_batch and
n_ubatch whenever --mmproj was specified, regardless of whether the
mmproj model actually ran on the GPU. This boosted batch size applies
to both the main context and the MTP draft context, since
params_base.speculative.cparams_dft is derived from
common_context_params_to_llama(params_base).

When mmproj runs on CPU (--no-mmproj-offload), this batch inflation
is unnecessary for mmproj itself (CPU compute is sized by image
dimensions independently), but it still inflates the MTP compute buffer
proportionally. For large images (e.g. --image-max-tokens 4096), the
MTP compute buffer ballooned to ~2020 MiB and triggered an OOM even
though the mmproj model was fully on CPU and should have saved VRAM.

Restrict the batch inflation to !params.mmproj.path.empty() &&
params.mmproj_use_gpu so it only triggers when mmproj actually occupies
GPU memory. When mmproj runs on CPU, the existing per-chunk decode
splitting in mtmd_helper_decode_image_chunk_impl handles large images
correctly with the default batch size.

AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev
---
 common/common.cpp | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/common/common.cpp b/common/common.cpp
index bb8ed772..9785dcda 100644
--- a/common/common.cpp
+++ b/common/common.cpp
@@ -3531,8 +3531,8 @@ static std::pair<int, int> get_batch_ubatch(const gpt_params & params) {
     if (params.n_ctx > 0) {
         n_batch = std::min(n_batch, params.n_ctx);
     }
-    if (!params.mmproj.path.empty()) {
-        // temporary fix for qwen mtmd
+    if (!params.mmproj.path.empty() && params.mmproj_use_gpu) {
+        // temporary fix for qwen mtmd (only when mmproj is on GPU)
         n_batch = std::max(n_batch, n_ubatch);
         n_ubatch = n_batch;
         fprintf(stdout, "Adjust batch size for mtmd: u_batch = %d, batch = %d\n", n_ubatch, n_batch);