graph TD START([Start]) --> ENV{LLAMA_HOTSWAP_ENABLED?} ENV -->|No| ENDD([End]) ENV -->|Yes| LOAD[Registration Phase
llama_model_load] subgraph Load_Time [Load Time] LOAD --> REG[Populate model.reload->tensor_reload_sources
path / offset / mtime / nbytes] end REG --> CALL([User calls
llama_reload_changed_tensors]) CALL --> SNAP{Snapshots
done?} SNAP -->|No| EAGER[snapshot_all_reload_tensors
Capture original_buffer / data / type / ne / nb
Capture original_splits
Discover MoE siblings via populate_moe_siblings] SNAP -->|Yes| DET subgraph Detection [Detection Phase] DET[reload_changed_tensors] --> STAT[For each registered tensor:
stat source file] STAT --> CHG{mtime / mtime_ns
changed?} CHG -->|No| SKIP[Skip] CHG -->|Yes| META[gguf_find_tensor_meta
Parse GGUF header only
Get new offset / type / size / ne] META --> DIM{"model ne[i] == file ne[i]?"} DIM -->|No| SKIP2[Skip: dimension mismatch] DIM -->|Yes| JOB[Add to job list
Mark returning =
new_type == original_type] end JOB --> SORT[Sort jobs
Returning to original FIRST] subgraph Per_Tensor_Reload [Per-Tensor Reload Loop] SORT --> LOOP[For each job:
reload_tensor name] LOOP --> RET{Returning to
original?} RET -->|Yes| OG_SPLIT{Is split tensor?
tensor->extra != nullptr} OG_SPLIT -->|Yes| REATT_SP[reattach_split_tensor_to_shared
Restore original_buffer / data / extra
Restore original_splits
Free old private buffers ONLY] OG_SPLIT -->|No| REATT_NS[Restore original_buffer / data
Restore original_type / ne / nb] REATT_SP --> ST_ORIG[Set state = ON_ORIGINAL] REATT_NS --> ST_ORIG ST_ORIG --> MT[Update file mtime] RET -->|No| TCHG{Type changed
from snapshot?} TCHG -->|Yes| APPLY["apply_tensor_type_change
Update tensor->type / nb[]
If split & blck_size>1:
Re-round per-device ne[0] to block multiples"] TCHG -->|No| KEEP[Keep current metadata] APPLY --> READ[Read new bytes from disk
into host_buf] KEEP --> READ READ --> IS_SPLIT{Is split tensor?} IS_SPLIT -->|Yes| SPATH[Split Path:
reload_tensor_split_path] SPATH --> F_SP[Free old main & split buffers
ONLY if state != ON_ORIGINAL] F_SP --> A_SP[Allocate new main buffer
alloc_buffer_fallback
GPU preferred, CPU fallback] A_SP --> AL_SP[ggml_backend_tensor_alloc] AL_SP --> C_SP["ggml_backend_tensor_set
host_buf -> device"] C_SP --> SIB{Has MoE siblings
in this layer?} SIB -->|Yes| RESYNC[resync_moe_sibling_splits] SIB -->|No| ST_DET1[Set state = DETACHED] subgraph MoE_Resync [MoE Sibling Resync] RESYNC --> RRET{Is reference
returning to original?} RRET -->|Yes| R_SIB[reattach_split_tensor_to_shared
for each sibling
Zero-copy restore] RRET -->|No| PHA[Phase A: Detach siblings
Free old non-original buffers
Alloc new main handles
data = 0x1 dummy] PHA --> PHB["Phase B: Propagate ref dimensions
to siblings
Recompute nb[] via temp ggml_context"] PHB --> PHC[Phase C: Alloc per-device
GPU split buffers] PHC --> PHF{Any GPU alloc
failed?} PHF -->|Yes| PHD[Phase D: Move ENTIRE layer to CPU
Free GPU splits
Alloc CPU buffer
State = FALLBACK_CPU] PHF -->|No| PHE[Phase E: ggml_backend_tensor_set
Write sibling data back] PHD --> PHE PHE --> ST_DET1 R_SIB --> ST_DET1 end IS_SPLIT -->|No| NSPATH[Non-Split Path:
reload_tensor_non_split_path] NSPATH --> F_NS[Free old buffer
ONLY if state != ON_ORIGINAL] F_NS --> A_NS[Allocate new buffer
alloc_buffer_fallback] A_NS --> AL_NS[ggml_backend_tensor_alloc] AL_NS --> C_NS["ggml_backend_tensor_set
host_buf -> device"] C_NS --> ST_DET2[Set state = DETACHED] ST_DET2 --> MT ST_DET1 --> MT end MT --> MORE{More jobs?} MORE -->|Yes| LOOP MORE -->|No| RELOADED{Any tensor
actually reloaded?} RELOADED -->|No| ENDD RELOADED -->|Yes| INV[ggml_backend_cuda_invalidate_graphs
Clear cuda_graphs on ALL devices] INV --> CTX["Reset cached compute graphs
ctx->prev.reset()
ctx->prev_mtp.reset()"] CTX --> REUSE[can_reuse_graph sees no cached graph
Forces full graph rebuild
on next eval] REUSE --> ENDD