From c713bd599b9519c70ab8525312ae9f400c783db2 Mon Sep 17 00:00:00 2001
From: mb8565 <mabovermobile@gmail.com>
Date: Fri, 26 Jun 2026 01:47:19 -0500
Subject: [PATCH] llama : fix CPU-only load crash on a CUDA build (device_mem
 out-of-bounds) (#2037)

Loading a model with no GPU layers on a binary built with CUDA crashes in
`llm_load_tensors`. The GPU-fit block is guarded by `if (device_count > 0)`, but
`device_count` comes from `model.splits`, which always has at least one entry
(`{1.0f}`). The memory array it indexes, `device_mem`, is sized by `model.devices`,
which is empty when no GPU is present or when the model is loaded with `-ngl 0`. So the
block runs with `device_count >= 1` and reads `device_mem[0]` out of bounds.

Repro: build with `-DGGML_CUDA=ON` on a host that has no usable GPU, or hide the GPUs
with `CUDA_VISIBLE_DEVICES=""`, then load any model. The load segfaults inside the fit
loop (confirmed with DeepSeek-V2-Lite-Q4_K_M). With a real GPU present `model.devices`
is non-empty even at `-ngl 0`, so the crash needs the empty-device case.

The fix is to also require `!model.devices.empty()` before entering the GPU-fit block.
CPU-only placement is already handled earlier, all layers go to the CPU when there are
no GPU layers, so skipping this block on a CPU-only load is correct.

GPU loads still take the block since `model.devices` is non-empty. CPU-only loads on a
CUDA build now finish and decode normally instead of crashing.

Co-authored-by: local-llm <local-llm@local-llm-R740.cruvis.org>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
---
 src/llama.cpp | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/llama.cpp b/src/llama.cpp
index 245509e2..5b665c74 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -3564,7 +3564,12 @@ static bool llm_load_tensors(
     model.default_layer_device = std::vector<int32_t>(hparams.n_layer+1, device_count-1);
     int act_gpu_layers = std::min(n_gpu_layers, (int)n_layer + 1);
     std::vector<llama_model_tensor_buft_override> overrides;
-    if (device_count > 0) {
+    // device_count comes from model.splits (at least 1), but device_mem below is sized by
+    // model.devices, which is empty on a CPU-only run of a CUDA build (no GPU present or
+    // -ngl 0). This block indexes device_mem[id] for id < device_count, so it reads out of
+    // bounds and crashes unless we also require a non-empty GPU device list. CPU-only
+    // placement is already handled above, so skipping this block is safe.
+    if (device_count > 0 && !model.devices.empty()) {
         std::vector<expert_tensors> experts;
         auto [layer_sizes, max_compute] = get_layer_sizes(ml, model, cache_type_k, cache_type_v, max_ctx_size, mla_attn, n_seq_max, n_ubatch,
                 amb, worst_case_tokens, flash_attn, experts);