llama : fix CPU-only load crash on a CUDA build (device_mem out-of-bounds) (#2037)

Loading a model with no GPU layers on a binary built with CUDA crashes in `llm_load_tensors`. The GPU-fit block is guarded by `if (device_count > 0)`, but `device_count` comes from `model.splits`, which always has at least one entry (`{1.0f}`). The memory array it indexes, `device_mem`, is sized by `model.devices`, which is empty when no GPU is present or when the model is loaded with `-ngl 0`. So the block runs with `device_count >= 1` and reads `device_mem[0]` out of bounds. Repro: build with `-DGGML_CUDA=ON` on a host that has no usable GPU, or hide the GPUs with `CUDA_VISIBLE_DEVICES=""`, then load any model. The load segfaults inside the fit loop (confirmed with DeepSeek-V2-Lite-Q4_K_M). With a real GPU present `model.devices` is non-empty even at `-ngl 0`, so the crash needs the empty-device case. The fix is to also require `!model.devices.empty()` before entering the GPU-fit block. CPU-only placement is already handled earlier, all layers go to the CPU when there are no GPU layers, so skipping this block on a CPU-only load is correct. GPU loads still take the block since `model.devices` is non-empty. CPU-only loads on a CUDA build now finish and decode normally instead of crashing. Co-authored-by: local-llm <local-llm@local-llm-R740.cruvis.org> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 04:30:15 -05:00 · 2026-06-26 01:47:19 -05:00 · 2026-06-26 01:47:19 -05:00 · c713bd599b
commit c713bd599b
parent 0ffdf509ab
1 changed files with 6 additions and 1 deletions
--- a/src/llama.cpp
+++ b/src/llama.cpp
@ -3564,7 +3564,12 @@ static bool llm_load_tensors(
    model.default_layer_device = std::vector<int32_t>(hparams.n_layer+1, device_count-1);
    int act_gpu_layers = std::min(n_gpu_layers, (int)n_layer + 1);
    std::vector<llama_model_tensor_buft_override> overrides;
-    if (device_count > 0) {
+    // device_count comes from model.splits (at least 1), but device_mem below is sized by
    // model.devices, which is empty on a CPU-only run of a CUDA build (no GPU present or
    // -ngl 0). This block indexes device_mem[id] for id < device_count, so it reads out of
    // bounds and crashes unless we also require a non-empty GPU device list. CPU-only
    // placement is already handled above, so skipping this block is safe.
    if (device_count > 0 && !model.devices.empty()) {
        std::vector<expert_tensors> experts;
        auto [layer_sizes, max_compute] = get_layer_sizes(ml, model, cache_type_k, cache_type_v, max_ctx_size, mla_attn, n_seq_max, n_ubatch,
                amb, worst_case_tokens, flash_attn, experts);