llama : fix CPU-only load crash on a CUDA build (device_mem out-of-bounds) (#2037)

Loading a model with no GPU layers on a binary built with CUDA crashes in
`llm_load_tensors`. The GPU-fit block is guarded by `if (device_count > 0)`, but
`device_count` comes from `model.splits`, which always has at least one entry
(`{1.0f}`). The memory array it indexes, `device_mem`, is sized by `model.devices`,
which is empty when no GPU is present or when the model is loaded with `-ngl 0`. So the
block runs with `device_count >= 1` and reads `device_mem[0]` out of bounds.

Repro: build with `-DGGML_CUDA=ON` on a host that has no usable GPU, or hide the GPUs
with `CUDA_VISIBLE_DEVICES=""`, then load any model. The load segfaults inside the fit
loop (confirmed with DeepSeek-V2-Lite-Q4_K_M). With a real GPU present `model.devices`
is non-empty even at `-ngl 0`, so the crash needs the empty-device case.

The fix is to also require `!model.devices.empty()` before entering the GPU-fit block.
CPU-only placement is already handled earlier, all layers go to the CPU when there are
no GPU layers, so skipping this block on a CPU-only load is correct.

GPU loads still take the block since `model.devices` is non-empty. CPU-only loads on a
CUDA build now finish and decode normally instead of crashing.

Co-authored-by: local-llm <local-llm@local-llm-R740.cruvis.org>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
mb8565 2026-06-26 01:47:19 -05:00 committed by GitHub
parent 0ffdf509ab
commit c713bd599b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -3564,7 +3564,12 @@ static bool llm_load_tensors(
model.default_layer_device = std::vector<int32_t>(hparams.n_layer+1, device_count-1); model.default_layer_device = std::vector<int32_t>(hparams.n_layer+1, device_count-1);
int act_gpu_layers = std::min(n_gpu_layers, (int)n_layer + 1); int act_gpu_layers = std::min(n_gpu_layers, (int)n_layer + 1);
std::vector<llama_model_tensor_buft_override> overrides; std::vector<llama_model_tensor_buft_override> overrides;
if (device_count > 0) { // device_count comes from model.splits (at least 1), but device_mem below is sized by
// model.devices, which is empty on a CPU-only run of a CUDA build (no GPU present or
// -ngl 0). This block indexes device_mem[id] for id < device_count, so it reads out of
// bounds and crashes unless we also require a non-empty GPU device list. CPU-only
// placement is already handled above, so skipping this block is safe.
if (device_count > 0 && !model.devices.empty()) {
std::vector<expert_tensors> experts; std::vector<expert_tensors> experts;
auto [layer_sizes, max_compute] = get_layer_sizes(ml, model, cache_type_k, cache_type_v, max_ctx_size, mla_attn, n_seq_max, n_ubatch, auto [layer_sizes, max_compute] = get_layer_sizes(ml, model, cache_type_k, cache_type_v, max_ctx_size, mla_attn, n_seq_max, n_ubatch,
amb, worst_case_tokens, flash_attn, experts); amb, worst_case_tokens, flash_attn, experts);