graph TD
START([Start]) --> ENV{LLAMA_HOTSWAP_ENABLED?}
ENV -->|No| ENDD([End])
ENV -->|Yes| LOAD[Registration Phase
llama_model_load]
subgraph Load_Time [Load Time]
LOAD --> REG[Populate model.reload->tensor_reload_sources
path / offset / mtime / nbytes]
end
REG --> CALL([User calls
llama_reload_changed_tensors])
CALL --> SNAP{Snapshots
done?}
SNAP -->|No| EAGER[snapshot_all_reload_tensors
Capture original_buffer / data / type / ne / nb
Capture original_splits
Discover MoE siblings via populate_moe_siblings]
SNAP -->|Yes| DET
subgraph Detection [Detection Phase]
DET[reload_changed_tensors] --> STAT[For each registered tensor:
stat source file]
STAT --> CHG{mtime / mtime_ns
changed?}
CHG -->|No| SKIP[Skip]
CHG -->|Yes| META[gguf_find_tensor_meta
Parse GGUF header only
Get new offset / type / size / ne]
META --> DIM{"model ne[i] == file ne[i]?"}
DIM -->|No| SKIP2[Skip: dimension mismatch]
DIM -->|Yes| JOB[Add to job list
Mark returning =
new_type == original_type]
end
JOB --> SORT[Sort jobs
Returning to original FIRST]
subgraph Per_Tensor_Reload [Per-Tensor Reload Loop]
SORT --> LOOP[For each job:
reload_tensor name]
LOOP --> RET{Returning to
original?}
RET -->|Yes| OG_SPLIT{Is split tensor?
tensor->extra != nullptr}
OG_SPLIT -->|Yes| REATT_SP[reattach_split_tensor_to_shared
Restore original_buffer / data / extra
Restore original_splits
Free old private buffers ONLY]
OG_SPLIT -->|No| REATT_NS[Restore original_buffer / data
Restore original_type / ne / nb]
REATT_SP --> ST_ORIG[Set state = ON_ORIGINAL]
REATT_NS --> ST_ORIG
ST_ORIG --> MT[Update file mtime]
RET -->|No| TCHG{Type changed
from snapshot?}
TCHG -->|Yes| APPLY["apply_tensor_type_change
Update tensor->type / nb[]
If split & blck_size>1:
Re-round per-device ne[0] to block multiples"]
TCHG -->|No| KEEP[Keep current metadata]
APPLY --> READ[Read new bytes from disk
into host_buf]
KEEP --> READ
READ --> IS_SPLIT{Is split tensor?}
IS_SPLIT -->|Yes| SPATH[Split Path:
reload_tensor_split_path]
SPATH --> F_SP[Free old main & split buffers
ONLY if state != ON_ORIGINAL]
F_SP --> A_SP[Allocate new main buffer
alloc_buffer_fallback
GPU preferred, CPU fallback]
A_SP --> AL_SP[ggml_backend_tensor_alloc]
AL_SP --> C_SP["ggml_backend_tensor_set
host_buf -> device"]
C_SP --> SIB{Has MoE siblings
in this layer?}
SIB -->|Yes| RESYNC[resync_moe_sibling_splits]
SIB -->|No| ST_DET1[Set state = DETACHED]
subgraph MoE_Resync [MoE Sibling Resync]
RESYNC --> RRET{Is reference
returning to original?}
RRET -->|Yes| R_SIB[reattach_split_tensor_to_shared
for each sibling
Zero-copy restore]
RRET -->|No| PHA[Phase A: Detach siblings
Free old non-original buffers
Alloc new main handles
data = 0x1 dummy]
PHA --> PHB["Phase B: Propagate ref dimensions
to siblings
Recompute nb[] via temp ggml_context"]
PHB --> PHC[Phase C: Alloc per-device
GPU split buffers]
PHC --> PHF{Any GPU alloc
failed?}
PHF -->|Yes| PHD[Phase D: Move ENTIRE layer to CPU
Free GPU splits
Alloc CPU buffer
State = FALLBACK_CPU]
PHF -->|No| PHE[Phase E: ggml_backend_tensor_set
Write sibling data back]
PHD --> PHE
PHE --> ST_DET1
R_SIB --> ST_DET1
end
IS_SPLIT -->|No| NSPATH[Non-Split Path:
reload_tensor_non_split_path]
NSPATH --> F_NS[Free old buffer
ONLY if state != ON_ORIGINAL]
F_NS --> A_NS[Allocate new buffer
alloc_buffer_fallback]
A_NS --> AL_NS[ggml_backend_tensor_alloc]
AL_NS --> C_NS["ggml_backend_tensor_set
host_buf -> device"]
C_NS --> ST_DET2[Set state = DETACHED]
ST_DET2 --> MT
ST_DET1 --> MT
end
MT --> MORE{More jobs?}
MORE -->|Yes| LOOP
MORE -->|No| RELOADED{Any tensor
actually reloaded?}
RELOADED -->|No| ENDD
RELOADED -->|Yes| INV[ggml_backend_cuda_invalidate_graphs
Clear cuda_graphs on ALL devices]
INV --> CTX["Reset cached compute graphs
ctx->prev.reset()
ctx->prev_mtp.reset()"]
CTX --> REUSE[can_reuse_graph sees no cached graph
Forces full graph rebuild
on next eval]
REUSE --> ENDD