From c713bd599b9519c70ab8525312ae9f400c783db2 Mon Sep 17 00:00:00 2001 From: mb8565 Date: Fri, 26 Jun 2026 01:47:19 -0500 Subject: [PATCH] llama : fix CPU-only load crash on a CUDA build (device_mem out-of-bounds) (#2037) Loading a model with no GPU layers on a binary built with CUDA crashes in `llm_load_tensors`. The GPU-fit block is guarded by `if (device_count > 0)`, but `device_count` comes from `model.splits`, which always has at least one entry (`{1.0f}`). The memory array it indexes, `device_mem`, is sized by `model.devices`, which is empty when no GPU is present or when the model is loaded with `-ngl 0`. So the block runs with `device_count >= 1` and reads `device_mem[0]` out of bounds. Repro: build with `-DGGML_CUDA=ON` on a host that has no usable GPU, or hide the GPUs with `CUDA_VISIBLE_DEVICES=""`, then load any model. The load segfaults inside the fit loop (confirmed with DeepSeek-V2-Lite-Q4_K_M). With a real GPU present `model.devices` is non-empty even at `-ngl 0`, so the crash needs the empty-device case. The fix is to also require `!model.devices.empty()` before entering the GPU-fit block. CPU-only placement is already handled earlier, all layers go to the CPU when there are no GPU layers, so skipping this block on a CPU-only load is correct. GPU loads still take the block since `model.devices` is non-empty. CPU-only loads on a CUDA build now finish and decode normally instead of crashing. Co-authored-by: local-llm Co-authored-by: Claude Opus 4.8 --- src/llama.cpp | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/llama.cpp b/src/llama.cpp index 245509e2..5b665c74 100644 --- a/src/llama.cpp +++ b/src/llama.cpp @@ -3564,7 +3564,12 @@ static bool llm_load_tensors( model.default_layer_device = std::vector(hparams.n_layer+1, device_count-1); int act_gpu_layers = std::min(n_gpu_layers, (int)n_layer + 1); std::vector overrides; - if (device_count > 0) { + // device_count comes from model.splits (at least 1), but device_mem below is sized by + // model.devices, which is empty on a CPU-only run of a CUDA build (no GPU present or + // -ngl 0). This block indexes device_mem[id] for id < device_count, so it reads out of + // bounds and crashes unless we also require a non-empty GPU device list. CPU-only + // placement is already handled above, so skipping this block is safe. + if (device_count > 0 && !model.devices.empty()) { std::vector experts; auto [layer_sizes, max_compute] = get_layer_sizes(ml, model, cache_type_k, cache_type_v, max_ctx_size, mla_attn, n_seq_max, n_ubatch, amb, worst_case_tokens, flash_attn, experts);