fix: only inflate n_batch for GPU-offloaded mmproj, not CPU (#1788)

The get_batch_ubatch() function unconditionally inflated n_batch and n_ubatch whenever --mmproj was specified, regardless of whether the mmproj model actually ran on the GPU. This boosted batch size applies to both the main context and the MTP draft context, since params_base.speculative.cparams_dft is derived from common_context_params_to_llama(params_base). When mmproj runs on CPU (--no-mmproj-offload), this batch inflation is unnecessary for mmproj itself (CPU compute is sized by image dimensions independently), but it still inflates the MTP compute buffer proportionally. For large images (e.g. --image-max-tokens 4096), the MTP compute buffer ballooned to ~2020 MiB and triggered an OOM even though the mmproj model was fully on CPU and should have saved VRAM. Restrict the batch inflation to !params.mmproj.path.empty() && params.mmproj_use_gpu so it only triggers when mmproj actually occupies GPU memory. When mmproj runs on CPU, the existing per-chunk decode splitting in mtmd_helper_decode_image_chunk_impl handles large images correctly with the default batch size. AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev
2026-06-28 04:30:15 -05:00 · 2026-05-13 02:08:42 -04:00 · 2026-05-13 02:08:42 -04:00 · f478a3ec0b
commit f478a3ec0b
parent cdc288bc97
1 changed files with 2 additions and 2 deletions
--- a/common/common.cpp
+++ b/common/common.cpp
@ -3531,8 +3531,8 @@ static std::pair<int, int> get_batch_ubatch(const gpt_params & params) {
    if (params.n_ctx > 0) {
        n_batch = std::min(n_batch, params.n_ctx);
    }
-    if (!params.mmproj.path.empty()) {
-        // temporary fix for qwen mtmd
+    if (!params.mmproj.path.empty() && params.mmproj_use_gpu) {
+        // temporary fix for qwen mtmd (only when mmproj is on GPU)
        n_batch = std::max(n_batch, n_ubatch);
        n_ubatch = n_batch;
        fprintf(stdout, "Adjust batch size for mtmd: u_batch = %d, batch = %d\n", n_ubatch, n_batch);