on-demand tensor reload (#1989)

* host-swap tensor loop

the host-swap functionality is only triggered when the certain env. variables are declared

* target_include_directories tweak

* hot-swap tensor support

two intrusions:
1.) at the model loading to collect the snapshot
2.) the modification of the `/health` HTTP endpoint to be able to trigger the hot-swap via sending the `llama-server` the HTTP-request.
*both a braced by the specific env. variables

* hot-swap tensor support; graph invalidation

ggml_backend_cuda_invalidate_graphs export

* hot-swap tensor support

graph invalidation implementation;  extended debug output (commented out)

* llama_reload_changed_tensors export

* tensor hot-swap on-demand reload

cpu-only/hybrid/gpu-only with split mode layer/graph full support implementation

* docs

* reuse the gguf parsing from llama.cpp

gguf_init_from_file, gguf_find_tensor, ggml_get_tensor

* remove the manual scheduling for hybrid inference

* update docs

* tensor shape validation

* update docs

* update docs

accidentally wiped the previous changes;  so recovered them

* revert the GGML_CUDA_MAX_DEVICES to 16

* update llama_reload_changed_tensor

update llama_reload_changed_tensor, revert CMakeLists.txt

* update llama_reload_changed_tensor

* GGML_MAX_SRC

GGML_MAX_SRC compile-time definition support

* GGML_MAX_SRC

GGML_MAX_SRC compile-time definition support

* GGML_MAX_SRC

GGML_MAX_SRC compile-time definition support

* llama_reload_changed_tensor

update llama_reload_changed_tensor definition

* refactory

move the tensor-reloading implementation to llama-reload.cpp, llama-reload-info.h;  some bugfixes and code reduction

* revert

added back the missing newline

* update docs

* reload_info constructor

* bugfix: cpu-only

TODO: improve the working environment by compiling for multiple hardware configurations;  possibly make a test pipeline

* cpu-only bugfix

set the fix again after unsuccessful sync with main

* windows os compilation fix

#include <string>

* fix windows os build

error C2039: 'string': is not a member of 'std'

* remove dead file

* implement perplexity in server

* Revert "implement perplexity in server"
This commit is contained in:
magikRUKKOLA 2026-06-22 14:36:34 +00:00 committed by GitHub
parent 6c00e87ac8
commit 72440a19fc
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
19 changed files with 1578 additions and 16 deletions

View File

@ -0,0 +1,343 @@
# On-Demand Tensor Reload
## Overview
This patch introduces **selective tensor hot-swapping** for `ik_llama.cpp` models, now with full support for `graph`/`layer` split mode.
It allows individual tensors (or groups of tensors) to be reloaded from their original on-disk GGUF files **without tearing down the process, the `llama_model`, or the `llama_context`**. Tensors may reside on any backend—GPU, CPU, or split across multiple GPUs—and the reload logic preserves that placement.
This is primarily intended for:
* Iterative experimentation and LoRA-like surgical updates.
* Dynamic MoE (Mixture-of-Experts) expert swapping.
* **Mixed-quantization perplexity benchmarks**, where the bulk of a model lives in one quant (e.g., Q4_X) on GPU while individual experts are hot-swapped one-by-one into a different quant (e.g., IQ1_KT) to measure isolated quality impact.
---
## Motivation
Standard `ik_llama.cpp` workflows require restarting the entire executable to pick up new weights. For large models distributed across multiple GPUs—or models that spill into CPU memory—this incurs significant downtime. This patch solves that by:
1. **Tracking provenance**: At load time, every tensor is mapped back to its source GGUF shard, byte offset, and modification time.
2. **Detecting changes**: At runtime, it cheaply `stat()`s the source files to see if a tensors backing data has changed.
3. **Surgical replacement**: Only the changed tensors are re-mapped/re-allocated. The rest of the model stays resident in GPU/CPU memory.
4. **Graph safety**: Cached CUDA graphs are invalidated and the contexts cached compute graphs (`ctx->prev` / `ctx->prev_mtp`) are reset so that the next evaluation rebuilds the graph with the new buffer pointers, sizes, or types.
---
## High-Level Architecture
The patch adds a `reload_info` registry to `llama_model` (defined in `src/llama-reload-info.h`). The lifecycle has five phases:
### 1. Registration Phase (`llama_model_load`)
During model loading, every weight that is successfully mapped gets an entry in `model.reload->tensor_reload_sources` **only when the environment variable `LLAMA_HOTSWAP_ENABLED` is set**:
```cpp
struct tensor_reload_source {
std::string path; // Absolute path to the GGUF shard
size_t data_offset; // Byte offset of the tensor data in the file
size_t nbytes; // Current byte size
int64_t last_mtime; // Last modification time (seconds)
int64_t last_mtime_ns; // Nanosecond precision on Linux
// Snapshots of the *original* loaded state so we can reattach later
ggml_backend_buffer_t original_buffer;
void * original_data;
ggml_type original_type;
int64_t original_ne[GGML_MAX_DIMS];
size_t original_nb[GGML_MAX_DIMS];
ggml_split_tensor_t * original_extra;
std::vector<split_info> original_splits;
std::vector<std::string> sibling_names; // MoE siblings
reload_state state;
};
```
### 2. Snapshot Phase (`snapshot_all_reload_tensors`)
The first time a reload is requested, an **eager snapshot** is taken of every registered tensor and its MoE siblings. This captures the original buffer handles, split descriptors, and strides. This snapshot is essential for:
* **Reattachment**: If a tensor was detached to a private buffer because it grew, but later shrinks back to its original size/type, it can be reattached to the original shared buffer, avoiding memory fragmentation.
* **MoE consistency**: MoE layers often have three sibling tensors (`ffn_down_exps`, `ffn_up_exps`, `ffn_gate_exps`) that must share the same split topology across GPUs.
### 3. Detection Phase (`reload_changed_tensors`)
When the user (or the server health-check loop) calls `llama_reload_changed_tensors()`:
1. It iterates over the registry and `stat()`s each source file.
2. If `mtime` (or `mtime_ns`) differs, it re-parses the GGUF header (`gguf_find_tensor_meta`) to get the new `offset`, `nbytes`, `ggml_type`, and on-disk shape (`ne`).
3. **Shape verification**: If the on-disk dimensions differ from the model tensor (`file_ne[i] != tensor->ne[i]`), the tensor is skipped entirely; the reload logic refuses to change logical shapes.
4. It builds a **sorted job list**: tensors that are **returning to their original snapshot** are processed first. This maximizes the chance of freeing private buffers before allocating new ones, reducing memory pressure.
### 4. Reload Phase (`reload_tensor`)
For each changed tensor, the patch performs a careful in-place update.
#### 0. Shape Verification
Before any metadata or buffer changes, the code verifies that the on-disk `ne[0..3]` exactly match the current model tensor. If any dimension differs, the reload is aborted with a log message and the tensor is left untouched.
#### A. Returning Check
The first decision is whether the tensor's new on-disk type matches its **original** snapshot type (`curr_type == src.original_type`).
* **Returning to original**: The tensor is reattached to its original shared buffer and original split descriptors. Any private buffer allocated during a previous reload is freed (only if the tensor's state is `DETACHED` or `FALLBACK_CPU`). State becomes `ON_ORIGINAL`.
* **Changed**: Proceed to metadata update and buffer reallocation.
#### B. Metadata Update & Block-Size Alignment
If the tensors `ggml_type` changed (e.g., Q4_X → IQ1_KT), the main tensor descriptor and all its split descriptors are updated with new `type` and `nb` values. The logical shape (`ne`) is guaranteed unchanged by the preceding shape verification. However, for fused/multi-GPU splits the per-device boundaries must be recalculated.
**Critical constraint for fused/multi-GPU splits:**
Different quants use different block sizes:
* **Q4_X / Q4_0**: block size **32**
* **IQ1_KT**: block size **256**
When a tensor changes between these types, `apply_tensor_type_change()` re-rounds every GPU slices `ne[0]` to the nearest multiple of the new block size. If this redistribution is not propagated to all siblings in the same MoE layer, the CUDA split backend dispatches rows to the wrong devices and **matmul fails**.
#### C. Buffer Lifecycle
The patch tracks each tensor with a `reload_state` enum (`UNINITIALIZED`, `ON_ORIGINAL`, `DETACHED`, `FALLBACK_CPU`). Buffers are only freed if the state is not `ON_ORIGINAL`, ensuring shared original buffers are never corrupted.
| Scenario | Action |
|----------|--------|
| Returning to original snapshot | **Reattach** to `original_buffer`, restore original splits, free old private buffer if any. |
| Changed type/size while previously on original | **Detach** from the shared buffer to a newly allocated private buffer so the shared region isnt corrupted for other tensors. |
| Changed type/size while already detached | Free old private buffer, allocate new one. |
| Allocation fails on target backend | **CPU fallback**: allocate on `ggml_backend_cpu_buffer_type()` and clear split metadata. State becomes `FALLBACK_CPU`. |
#### D. Split Tensor (Multi-GPU) Handling
For split tensors, the patch:
- Recomputes per-device bounds using the new block-size alignment.
- Reallocates per-device split buffers if necessary.
- **Resyncs MoE siblings**: If `ffn_down_exps` changes its split topology, `ffn_up_exps` and `ffn_gate_exps` in the same layer are forced to adopt identical per-device `ne[0]` distributions and strides. This is required by the CUDA split-backend contract.
#### E. Data Copy
Finally, the tensor bytes are read from the updated file and copied into the (possibly new) backend buffer via `ggml_backend_tensor_set`.
---
## Hybrid CPU/GPU Inference
When running with `--split-mode layer --fit --gpu-layers 99` (or any configuration where the model does not fully fit in VRAM), some tensors naturally land in CPU memory. The hot-swap system fully supports this:
* **CPU tensors are reloadable**: The reload logic reads the new data from disk and copies it into the CPU backend buffer exactly as it would for CUDA buffers.
* **Fallback allocator**: If a GPU buffer allocation fails during a reload (e.g., because an IQ1_KT expert is larger than the original Q4_X expert), the system automatically falls back to a CPU buffer for that tensor.
This allows you to keep, for example, 90 % of an MoE model on 13 GPUs while a few large expert tensors cycle through CPU RAM, or to benchmark quants that vary in size per-expert without worrying about exact VRAM fitting.
---
## API & Environment Variables
### Public C API
```cpp
// include/llama.h
LLAMA_API bool llama_reload_changed_tensors(struct llama_context * ctx);
```
Returns `true` if at least one tensor was reloaded. When this happens, the function also resets the contexts cached compute graphs (`ctx->prev` and `ctx->prev_mtp`) so that the next evaluation performs a full graph rebuild with the new tensor pointers.
### Environment Variables
| Variable | Purpose |
|----------|---------|
| `LLAMA_HOTSWAP_ENABLED` | Enables the hot-swap loop in `perplexity` and the health-check hook in `server`. |
| `LLAMA_PERPLEXITY_PRE_RELOAD_SCRIPT` | Path to an executable script run between perplexity iterations (e.g., to regenerate/re-quantize a tensor file). |
---
## Integration Points
### `examples/perplexity/perplexity.cpp`
When `LLAMA_HOTSWAP_ENABLED` is set, the tool runs in a loop:
1. Perform an initial `llama_reload_changed_tensors()` to apply any pending changes before the first evaluation.
2. Compute perplexity (or Hellaswag, etc.).
3. Print timings and write logs.
4. Execute the optional pre-reload script.
5. Call `llama_reload_changed_tensors(ctx)`. If no tensors changed, exit; otherwise repeat from step 2.
### `examples/server/server.cpp`
On every health-check (`/health`) request, if `LLAMA_HOTSWAP_ENABLED` is set, the server calls `llama_reload_changed_tensors()`. This provides a convenient, external trigger: simply `touch` or overwrite a tensors source GGUF file and poll `/health` to apply the change.
---
## MoE Sibling Resync
MoE weights are often stored as three separate tensors that must be split identically across GPUs. The patch automatically detects these families by suffix:
- `.ffn_down_exps.weight`
- `.ffn_up_exps.weight`
- `.ffn_gate_exps.weight`
When one member of the family is reloaded and its per-device split dimensions change—especially when crossing quant types with different block sizes (Q4_X=32 vs IQ1_KT=256)—`resync_moe_sibling_splits()` is invoked. The logic follows these steps:
1. **Fast path**: If the reference tensor is returning to its original snapshot, the siblings are also reattached to their original snapshots via `reattach_split_tensor_to_shared()`—no data movement is required.
2. **Phase A Detach**: Siblings are detached from shared buffers (freeing only non-original buffers) and new main handles are allocated. Split tensors receive a dummy `data` pointer because the split backend uses `extra->splits`.
3. **Phase B Propagate dimensions**: The reference tensors per-device `ne[0]` distribution is copied to the siblings, and strides (`nb[]`) are recomputed using a temporary `ggml_context`. This step is mandatory because the valid split boundaries depend on the quantization block size.
4. **Phase C Allocate GPU splits**: New per-device GPU buffers are allocated for each sibling split.
5. **Phase D CPU fallback (if needed)**: If any GPU allocation fails, the **entire** sibling group is moved to CPU buffers to maintain consistency.
6. **Phase E Write back**: The original sibling data (which has not changed, only the layout) is written back into the new buffers via `ggml_backend_tensor_set`.
---
## Buffer Lifecycle Details
### Reattachment to Shared Buffers
If a tensor was originally loaded in a large shared GGUF buffer alongside other tensors, and it was previously detached because it grew, the patch attempts to **reattach** it when it returns to its original size and type. This is done by restoring:
- `tensor->buffer = original_buffer`
- `tensor->data = original_data`
- `tensor->extra = original_extra` (restoring all split descriptors)
This prevents unbounded memory growth during iterative experiments where tensors oscillate between two states.
### State Machine
Because `ggml` does not provide native reference counting on buffers, the patch uses a per-tensor state machine to avoid corrupting shared allocations:
* `ON_ORIGINAL`: The tensor still lives in its initial shared buffer. This buffer is **never** freed during reload.
* `DETACHED`: The tensor was moved to a privately allocated buffer. This buffer **is** freed before the next reload.
* `FALLBACK_CPU`: The tensor was moved to CPU memory after a GPU allocation failure.
Only buffers belonging to tensors in the `DETACHED` or `FALLBACK_CPU` states are released, ensuring that shared original buffers remain valid for all other tensors that still reference them.
---
## Limitations & Safety Notes
1. **File path stability**: The source file must remain at the same path. Renaming or removing shards will cause `stat()` or `open()` to fail.
2. **No locking**: There is no file-locking protocol. The user must ensure the GGUF file is not being written to while `ik_llama.cpp` is reading it.
3. **Graph rebuild cost**: While cheaper than a full process restart, rebuilding the CUDA graph (or CPU graph) incurs a one-time latency spike after a reload.
4. **Platform specifics**: Nanosecond mtime checks use `st_mtim.tv_nsec` and are guarded by `#ifdef __linux__`.
5. **Thread safety**: `llama_reload_changed_tensors` is **not** thread-safe with active inference. Ensure the context is idle before calling (the perplexity example naturally guarantees this; the server example only invokes it during the synchronous `/health` handler).
---
## Usage Example: Per-Expert Quantization Sweep (Q4_X ↔ IQ1_KT)
This example benchmarks a massive MoE model where the base weights are **Q4_X**. The tool iteratively replaces individual `ffn_down_exps.weight` tensors with **IQ1_KT** equivalents to measure the isolated perplexity impact of each expert's quantization level.
A sanity check is embedded in the source directory: one of the "IQ1_KT" shard files is actually the original **Q4_X** tensor. When the rotation reaches that slot, the reloaded tensor is byte-for-byte identical to the baseline, so the PPL must match exactly—confirming that the hot-swap machinery introduces no loss.
### 1. Helper script (`tensor-swap.sh`)
Place the rotation script in your model directory (e.g., `/opt/THIREUS/Kimi-K2.6/Q4_X/`). It maintains `.bak` files so that each iteration restores the previous tensor before installing the next candidate.
```bash
#!/bin/bash
set -euo pipefail
TARGET_GLOB="*Q4_X*gguf"
SOURCE_DIR="../smol-IQ1-KT-mist.bin"
TENSOR_NAME_PATTERN="blk\.[0-9]+\.ffn_down_exps\.weight"
# ... (see full script in patch) ...
```
The script scans for target files matching `*Q4_X*gguf` containing `blk.[N].ffn_down_exps.weight`, then pulls replacements from `../smol-IQ1-KT-mist.bin/` by matching the `SPECIAL_TENSOR-NNNN-of-XXXX.gguf` shard number.
### 2. Launch perplexity with hot-swap enabled
```bash
ulimit -n 9999
ulimit -l unlimited
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8,9,10,11,12"
export LLAMA_HOTSWAP_ENABLED=1
export LLAMA_PERPLEXITY_PRE_RELOAD_SCRIPT=./tensor-swap.sh
export LLAMA_DEBUG=1
# --offload-policy -1,off \
GGML_CUDA_NO_PINNED=1 \
/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-perplexity \
--chunks 8 \
-f /opt/ik_llama.cpp/wiki.test.raw \
--model /opt/THIREUS/Kimi-K2.6/Q4_X/Kimi-K2.6-THIREUS-Q4_X-SPECIAL_TENSOR-00001-of-01097.gguf \
--alias THIREUS/Kimi-K2.6-Q4_X.bin \
-b 512 -ub 512 \
--ctx-size 512 \
--fit \
--fit-margin 4200 \
--gpu-fit-margin 0,4400,12,4400 \
--temp 0.0 --top-k 0 --top-p 1.0 \
-ctk f16 \
-ctv q8_0 \
-amb 128 \
-mea 128 \
-wgt 1 \
--mlock \
--split-mode layer \
--graph-reduce-type f16 \
--threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
-sas \
--gpu-layers 99 \
--no-offload-only-active-experts \
--host 0.0.0.0 \
--port 8080 \
--log-enable \
--logdir /var/log/ \
--jinja \
--special \
--prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
--slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
--lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
--keep -1 \
--slot-prompt-similarity 0.35 \
--metrics \
-cuda fusion=1
```
### 3. What happens
1. The model loads with **Q4_X** weights distributed across 13 GPUs using layer splitting.
2. The first pass computes the baseline perplexity over 8 chunks.
3. `tensor-swap.sh` runs between iterations:
* Restores the previously swapped tensor from `.bak` to its original Q4_X state.
* Copies the next IQ1_KT expert shard into place.
4. `llama_reload_changed_tensors()` detects the `mtime` changes, re-parses the GGUF headers, and reloads the affected `ffn_down_exps.weight` tensor(s).
* The restored tensor **returns to its original Q4_X snapshot** and reattaches to its shared buffer.
* The newly swapped tensor is loaded into a private buffer with the new IQ1_KT data.
* Because Q4_X and IQ1_KT have different block sizes (32 vs 256), the split backend redistributes per-device boundaries and resyncs the MoE siblings (`ffn_up_exps` and `ffn_gate_exps`) to the same layout.
5. The CUDA graphs are invalidated and the next perplexity iteration begins.
6. When the rotation hits the sanity-check slot (where the source file is actually the original Q4_X tensor), the perplexity returns to the exact baseline value, confirming the reload is lossless.
### 4. Expected behavior
```text
snapshot_all_reload_tensors: eager snapshot of all reload tensors + siblings
perplexity: calculating perplexity over 8 chunks, n_ctx=512, batch_size=512, n_seq=1
[1]1.0622,[2]1.2068,[3]1.2327,[4]1.1873,[5]1.1487,[6]1.1283,[7]1.1214,[8]1.1109,
Final estimate: PPL = 1.1109
main: executing pre-reload script: ./tensor-swap.sh
main: [pre-reload] Swapped index 0 (tensor #00918)
reloaded tensor 'blk.1.ffn_down_exps.weight'
perplexity: calculating perplexity over 8 chunks ...
Final estimate: PPL = 1.1105
main: executing pre-reload script: ./tensor-swap.sh
main: [pre-reload] Restored index 0. Advancing to index 1.
main: [pre-reload] Swapped index 1 (tensor #00921)
reloaded tensor 'blk.1.ffn_down_exps.weight'
reloaded tensor 'blk.2.ffn_down_exps.weight'
perplexity: calculating perplexity over 8 chunks ...
Final estimate: PPL = 1.1080
```
Notice that when the script restores a tensor to its original Q4_X shard, the reload reattaches it to the shared buffer with zero copy. When the sanity-check slot is reached, the PPL returns to the exact baseline, proving the mechanism is sound.
---
## Summary of Changed Files
| File | Change |
|------|--------|
| `examples/perplexity/perplexity.cpp` | Hot-swap loop + pre-reload script execution. |
| `examples/server/server.cpp` | Trigger reload on `/health` when env var is set. |
| `ggml/include/ggml-cuda.h` | Add `ggml_backend_cuda_invalidate_graphs()`. |
| `ggml/include/ggml.h` | Conditional `GGML_MAX_SRC` override. |
| `ggml/src/CMakeLists.txt` | Propagate `GGML_MAX_SRC` compile definition. |
| `ggml/src/ggml-cuda.cu` | Implement graph invalidation; debug prints for split tensors. |
| `ggml/src/ggml.c` | Debug print in `ggml_mul_mat_id` for shape mismatches. |
| `include/llama.h` | Declare `llama_reload_changed_tensors()`. |
| `src/llama-mmap.cpp/h` | Expose `llama_file::get_path()` so reload registry knows the source file path. |
| `src/llama-model.h` | Add `std::unique_ptr<reload_info> reload` to `llama_model`. |
| `src/llama-reload-info.h` | **New.** Defines `tensor_reload_source`, `reload_state`, and `reload_info` registry. |
| `src/llama-reload.cpp` | **New.** Core implementation: GGUF header parser, snapshot, reload, MoE resync, buffer management, CPU fallback, shape verification. |
| `src/llama.cpp` | Wire reload registry into `llama_model_load`; reset cached compute graphs (`ctx->prev` / `ctx->prev_mtp`) on reload; export C API. |
| `src/CMakeLists.txt` | Propagate `GGML_MAX_SRC` compile definition. |

View File

@ -0,0 +1,88 @@
graph TD
START([Start]) --> ENV{LLAMA_HOTSWAP_ENABLED?}
ENV -->|No| ENDD([End])
ENV -->|Yes| LOAD[Registration Phase<br/>llama_model_load]
subgraph Load_Time [Load Time]
LOAD --> REG[Populate model.reload->tensor_reload_sources<br/>path / offset / mtime / nbytes]
end
REG --> CALL([User calls<br/>llama_reload_changed_tensors])
CALL --> SNAP{Snapshots<br/>done?}
SNAP -->|No| EAGER[snapshot_all_reload_tensors<br/>Capture original_buffer / data / type / ne / nb<br/>Capture original_splits<br/>Discover MoE siblings via populate_moe_siblings]
SNAP -->|Yes| DET
subgraph Detection [Detection Phase]
DET[reload_changed_tensors] --> STAT[For each registered tensor:<br/>stat source file]
STAT --> CHG{mtime / mtime_ns<br/>changed?}
CHG -->|No| SKIP[Skip]
CHG -->|Yes| META[gguf_find_tensor_meta<br/>Parse GGUF header only<br/>Get new offset / type / size / ne]
META --> DIM{"model ne[i] == file ne[i]?"}
DIM -->|No| SKIP2[Skip: dimension mismatch]
DIM -->|Yes| JOB[Add to job list<br/>Mark returning = <br/>new_type == original_type]
end
JOB --> SORT[Sort jobs<br/>Returning to original FIRST]
subgraph Per_Tensor_Reload [Per-Tensor Reload Loop]
SORT --> LOOP[For each job:<br/>reload_tensor name]
LOOP --> RET{Returning to<br/>original?}
RET -->|Yes| OG_SPLIT{Is split tensor?<br/>tensor->extra != nullptr}
OG_SPLIT -->|Yes| REATT_SP[reattach_split_tensor_to_shared<br/>Restore original_buffer / data / extra<br/>Restore original_splits<br/>Free old private buffers ONLY]
OG_SPLIT -->|No| REATT_NS[Restore original_buffer / data<br/>Restore original_type / ne / nb]
REATT_SP --> ST_ORIG[Set state = ON_ORIGINAL]
REATT_NS --> ST_ORIG
ST_ORIG --> MT[Update file mtime]
RET -->|No| TCHG{Type changed<br/>from snapshot?}
TCHG -->|Yes| APPLY["apply_tensor_type_change<br/>Update tensor->type / nb[]<br/>If split & blck_size>1:<br/>Re-round per-device ne[0] to block multiples"]
TCHG -->|No| KEEP[Keep current metadata]
APPLY --> READ[Read new bytes from disk<br/>into host_buf]
KEEP --> READ
READ --> IS_SPLIT{Is split tensor?}
IS_SPLIT -->|Yes| SPATH[Split Path:<br/>reload_tensor_split_path]
SPATH --> F_SP[Free old main & split buffers<br/>ONLY if state != ON_ORIGINAL]
F_SP --> A_SP[Allocate new main buffer<br/>alloc_buffer_fallback<br/>GPU preferred, CPU fallback]
A_SP --> AL_SP[ggml_backend_tensor_alloc]
AL_SP --> C_SP["ggml_backend_tensor_set<br/>host_buf -> device"]
C_SP --> SIB{Has MoE siblings<br/>in this layer?}
SIB -->|Yes| RESYNC[resync_moe_sibling_splits]
SIB -->|No| ST_DET1[Set state = DETACHED]
subgraph MoE_Resync [MoE Sibling Resync]
RESYNC --> RRET{Is reference<br/>returning to original?}
RRET -->|Yes| R_SIB[reattach_split_tensor_to_shared<br/>for each sibling<br/>Zero-copy restore]
RRET -->|No| PHA[Phase A: Detach siblings<br/>Free old non-original buffers<br/>Alloc new main handles<br/>data = 0x1 dummy]
PHA --> PHB["Phase B: Propagate ref dimensions<br/>to siblings<br/>Recompute nb[] via temp ggml_context"]
PHB --> PHC[Phase C: Alloc per-device<br/>GPU split buffers]
PHC --> PHF{Any GPU alloc<br/>failed?}
PHF -->|Yes| PHD[Phase D: Move ENTIRE layer to CPU<br/>Free GPU splits<br/>Alloc CPU buffer<br/>State = FALLBACK_CPU]
PHF -->|No| PHE[Phase E: ggml_backend_tensor_set<br/>Write sibling data back]
PHD --> PHE
PHE --> ST_DET1
R_SIB --> ST_DET1
end
IS_SPLIT -->|No| NSPATH[Non-Split Path:<br/>reload_tensor_non_split_path]
NSPATH --> F_NS[Free old buffer<br/>ONLY if state != ON_ORIGINAL]
F_NS --> A_NS[Allocate new buffer<br/>alloc_buffer_fallback]
A_NS --> AL_NS[ggml_backend_tensor_alloc]
AL_NS --> C_NS["ggml_backend_tensor_set<br/>host_buf -> device"]
C_NS --> ST_DET2[Set state = DETACHED]
ST_DET2 --> MT
ST_DET1 --> MT
end
MT --> MORE{More jobs?}
MORE -->|Yes| LOOP
MORE -->|No| RELOADED{Any tensor<br/>actually reloaded?}
RELOADED -->|No| ENDD
RELOADED -->|Yes| INV[ggml_backend_cuda_invalidate_graphs<br/>Clear cuda_graphs on ALL devices]
INV --> CTX["Reset cached compute graphs<br/>ctx->prev.reset()<br/>ctx->prev_mtp.reset()"]
CTX --> REUSE[can_reuse_graph sees no cached graph<br/>Forces full graph rebuild<br/>on next eval]
REUSE --> ENDD

View File

@ -20,11 +20,15 @@
#include <array>
#include <fstream>
#include <sstream>
#include <cstdlib>
#if defined(_MSC_VER)
#pragma warning(disable: 4244 4267) // possible loss of data
#endif
// Public C API for hot-swap (defined in src/llama.cpp)
extern "C" bool llama_reload_changed_tensors(struct llama_context * ctx);
struct results_perplexity {
std::vector<llama_token> tokens;
double ppl_value;
@ -2056,21 +2060,62 @@ int main(int argc, char ** argv) {
fprintf(stderr, "%s\n", gpt_params_get_system_info(params).c_str());
}
struct results_perplexity results;
if (params.hellaswag) {
hellaswag_score(ctx, params);
} else if (params.winogrande) {
winogrande_score(ctx, params);
} else if (params.multiple_choice) {
multiple_choice_score(ctx, params);
} else if (params.kl_divergence) {
kl_divergence(ctx, params);
} else {
results = perplexity(ctx, params, n_ctx);
const char * hotswap_env = std::getenv("LLAMA_HOTSWAP_ENABLED");
const char * pre_script = std::getenv("LLAMA_PERPLEXITY_PRE_RELOAD_SCRIPT");
if (hotswap_env) {
llama_reload_changed_tensors(ctx);
}
llama_print_timings(ctx);
write_logfile(ctx, params, model, results);
while (true) {
struct results_perplexity results;
if (params.hellaswag) {
hellaswag_score(ctx, params);
} else if (params.winogrande) {
winogrande_score(ctx, params);
} else if (params.multiple_choice) {
multiple_choice_score(ctx, params);
} else if (params.kl_divergence) {
kl_divergence(ctx, params);
} else {
results = perplexity(ctx, params, n_ctx);
}
llama_print_timings(ctx);
write_logfile(ctx, params, model, results);
if (pre_script) {
fprintf(stderr, "%s: executing pre-reload script: %s\n", __func__, pre_script);
#ifdef _WIN32
FILE * fp = _popen(pre_script, "r");
#else
FILE * fp = popen(pre_script, "r");
#endif
if (fp) {
char buf[256];
while (fgets(buf, sizeof(buf), fp)) {
size_t len = strlen(buf);
if (len > 0 && buf[len-1] == '\n') buf[len-1] = '\0';
fprintf(stderr, "%s: [pre-reload] %s\n", __func__, buf);
}
#ifdef _WIN32
_pclose(fp);
#else
pclose(fp);
#endif
} else {
fprintf(stderr, "%s: failed to execute pre-reload script: %s\n", __func__, pre_script);
}
}
if (hotswap_env) {
if (!llama_reload_changed_tensors(ctx)) {
break;
}
} else {
break;
}
}
llama_free(ctx);
llama_free_model(model);

View File

@ -763,6 +763,14 @@ int main(int argc, char ** argv) {
}
res.set_content(health.dump(), "application/json");
const char * hotswap_env = std::getenv("LLAMA_HOTSWAP_ENABLED");
if (hotswap_env) {
// WARNING: llama_reload_changed_tensors is NOT thread-safe with active inference.
// Only enable this when you can guarantee the server is idle during health checks.
llama_reload_changed_tensors(ctx_server.ctx);
}
break;
}
case SERVER_STATE_LOADING_MODEL:

View File

@ -42,6 +42,8 @@ GGML_API GGML_CALL bool ggml_backend_cuda_register_host_buffer(void * buffer, si
GGML_API GGML_CALL void ggml_backend_cuda_unregister_host_buffer(void * buffer);
GGML_API void ggml_backend_cuda_log_set_callback(ggml_log_callback log_callback, void * user_data);
GGML_API void ggml_backend_cuda_invalidate_graphs(void);
#ifdef __cplusplus
}
#endif

View File

@ -240,7 +240,10 @@
// if you need to load more than 64 model shards.
#define GGML_MAX_CONTEXTS 64
#endif
#ifndef GGML_MAX_SRC
// For the machines with 11+ GPUs use -DGGML_MAX_SRC=N
#define GGML_MAX_SRC 12
#endif
#ifndef GGML_MAX_NAME
#define GGML_MAX_NAME 64
#endif

View File

@ -6,6 +6,9 @@ add_compile_definitions(GGML_SCHED_MAX_COPIES=${GGML_SCHED_MAX_COPIES})
if (GGML_MAX_CONTEXTS)
add_compile_definitions(GGML_MAX_CONTEXTS=${GGML_MAX_CONTEXTS})
endif()
if (GGML_MAX_SRC)
add_compile_definitions(GGML_MAX_SRC=${GGML_MAX_SRC})
endif()
# enable libstdc++ assertions for debug builds
if (CMAKE_SYSTEM_NAME MATCHES "Linux")

View File

@ -301,6 +301,16 @@ const ggml_cuda_device_info & ggml_cuda_info() {
return info;
}
/* ---------- hot-swap: invalidate all cached CUDA graphs ---------- */
extern "C" void ggml_backend_cuda_invalidate_graphs(void) {
auto & info = const_cast<ggml_cuda_device_info &>(ggml_cuda_info());
for (int i = 0; i < info.device_count; ++i) {
if (info.all_ctx[i]) {
info.all_ctx[i]->cuda_graphs.clear();
}
}
}
// #define DEBUG_CUDA_MALLOC
// buffer pool for cuda (legacy)
@ -847,6 +857,9 @@ GGML_CALL static void ggml_backend_cuda_split_buffer_init_tensor([[maybe_unused]
}
//printf(" allocated %zu bytes for tensor %s of type %s, dim = %ld x %ld x %ld. padding: %zu\n", padded_size, split->name, ggml_type_name(split->type),
// split->ne[0], split->ne[1], split->ne[2], padded_size - size);
//printf("DEBUG init_tensor: dev=%d split_ne0=%ld type=%s ggml_nbytes=%zu padded=%zu data_ptr=%p\n",
// i, (long)ne0, ggml_type_name(split->type), size, padded_size, (void*)buf);
//fflush(stdout);
split->data = buf;
auto ctx = new ggml_backend_cuda_buffer_context(i, buf);
auto buft = ggml_backend_cuda_buffer_type(i);
@ -1054,6 +1067,12 @@ GGML_CALL static void ggml_backend_cuda_split_buffer_set_tensor([[maybe_unused]]
memcpy(dst + tt.row_meta_size*n_interleave, src + source_offset, n_interleave*(split_row_size - tt.row_meta_size));
}
}
//printf("DEBUG set_tensor: dev=%d split_ne0=%ld nrows=%d split_row_size=%zu total=%zu "
// "split_data=%p host_data=%p host_capacity=%zu source_offset=%zu\n",
// i, (long)split->ne[0], nrows, split_row_size, nrows*split_row_size,
// (void*)split->data, (void*)host_buffer.data(), host_buffer.size(),
// (size_t)source_offset);
//fflush(stdout);
CUDA_CHECK(cudaMemcpyAsync(split->data, host_buffer.data(), nrows*split_row_size, cudaMemcpyHostToDevice, cudaStreamPerThread));
ne += split->ne[0];
}

View File

@ -7868,6 +7868,14 @@ struct ggml_tensor * ggml_mul_mat_id(
GGML_ASSERT(b->ne[3] == 1); // b is 3d
GGML_ASSERT(ids->ne[2] == 1 && ids->ne[3] == 1); // ids is 2d
GGML_ASSERT(ids->ne[1] == b->ne[2]); // must have an expert list per b row
//// can_mul_mat
//if (as->ne[0] != b->ne[0]) {
// fprintf(stderr, "MUL_MAT_ID_FAIL: as='%s' ne[0]=%ld type=%s | b='%s' ne[0]=%ld type=%s | ids->ne[1]=%ld b->ne[2]=%ld as->ne[1]=%ld as->ne[2]=%ld\n",
// as->name, (long)as->ne[0], ggml_type_name(as->type),
// b->name, (long)b->ne[0], ggml_type_name(b->type),
// (long)ids->ne[1], (long)b->ne[2], (long)as->ne[1], (long)as->ne[2]);
// fflush(stderr);
//}
GGML_ASSERT(as->ne[0] == b->ne[0]); // can_mul_mat
GGML_ASSERT(ids->ne[0] % b->ne[1] == 0); // can broadcast

View File

@ -1579,6 +1579,8 @@ LLAMA_API struct llama_grammar* llama_sampler_init_grammar_lazy_patterns(
LLAMA_API void llama_set_draft_input_hidden_state(struct llama_context * ctx, const float * hidden_state);
LLAMA_API bool llama_reload_changed_tensors(struct llama_context * ctx);
#ifdef __cplusplus
}
#endif

0
llama-mmap.h Normal file
View File

View File

@ -53,6 +53,8 @@ add_library(llama
llama-build-context.cpp
llama-model.h
llama-model.cpp
llama-reload-info.h
llama-reload.cpp
llama-quantize.cpp
llama-arch.h
llama-arch.cpp
@ -141,3 +143,6 @@ endif()
if (GGML_MAX_CONTEXTS)
add_compile_definitions(GGML_MAX_CONTEXTS=${GGML_MAX_CONTEXTS})
endif()
if (GGML_MAX_SRC)
add_compile_definitions(GGML_MAX_SRC=${GGML_MAX_SRC})
endif()

View File

@ -416,5 +416,4 @@ struct llama_context {
void set_mtp_op_type(llama_mtp_op_type value);
int max_nodes(int n_tokens, int n_kv) const;
};

View File

@ -76,7 +76,7 @@ struct llama_file::impl {
return ret;
}
impl(const char * fname, const char * mode) {
impl(const char * fname, const char * mode) : path(fname) {
fp = ggml_fopen(fname, mode);
if (fp == NULL) {
throw std::runtime_error(format("failed to open %s: %s", fname, strerror(errno)));
@ -155,13 +155,15 @@ struct llama_file::impl {
write_raw(&val, sizeof(val));
}
std::string path;
~impl() {
if (fp) {
std::fclose(fp);
}
}
#else
impl(const char * fname, const char * mode) {
impl(const char * fname, const char * mode) : path(fname) {
fp = ggml_fopen(fname, mode);
if (fp == NULL) {
throw std::runtime_error(format("failed to open %s: %s", fname, strerror(errno)));
@ -231,6 +233,7 @@ struct llama_file::impl {
void write_u32(uint32_t val) const {
write_raw(&val, sizeof(val));
}
std::string path;
~impl() {
if (fp) {
@ -681,3 +684,5 @@ const bool llama_mlock::SUPPORTED = false;
size_t llama_path_max() {
return PATH_MAX;
}
const std::string & llama_file::get_path() const { return pimpl->path; }

View File

@ -3,6 +3,7 @@
#include <cstdint>
#include <memory>
#include <vector>
#include <string>
struct llama_file;
struct llama_mmap;
@ -28,6 +29,7 @@ struct llama_file {
void write_raw(const void * ptr, size_t len) const;
void write_u32(uint32_t val) const;
const std::string & get_path() const;
private:
struct impl;

View File

@ -12,6 +12,8 @@
#include <unordered_map>
#include <set>
#include "llama-reload-info.h"
// available llama models
enum e_model {
MODEL_UNKNOWN,
@ -552,6 +554,8 @@ struct llama_model {
std::vector<float> splits;
ggml_backend_buffer_type_t split_buft = nullptr;
std::unique_ptr<reload_info> reload;
};
struct llama_lora_weight {

60
src/llama-reload-info.h Normal file
View File

@ -0,0 +1,60 @@
#pragma once
#include "ggml.h"
#include "ggml-backend.h"
#include <string>
#include <vector>
#include <unordered_map>
#include <atomic>
#include <sys/stat.h>
#include <fstream>
struct llama_model;
struct llama_model_loader;
struct tensor_reload_source {
std::string path;
size_t data_offset = 0;
size_t nbytes = 0;
int64_t last_mtime = 0;
int64_t last_mtime_ns = 0;
ggml_backend_buffer_t original_buffer = nullptr;
void * original_data = nullptr;
ggml_type original_type = GGML_TYPE_COUNT;
size_t original_nbytes = 0;
int64_t original_ne[GGML_MAX_DIMS];
size_t original_nb[GGML_MAX_DIMS];
struct split_info {
int64_t ne[GGML_MAX_DIMS];
size_t nb[GGML_MAX_DIMS];
void * data;
ggml_backend_buffer_t buffer;
struct ggml_tensor * tensor = nullptr;
};
std::vector<split_info> original_splits;
std::vector<std::string> sibling_names;
ggml_split_tensor_t * original_extra = nullptr;
enum class reload_state {
UNINITIALIZED,
ON_ORIGINAL,
DETACHED,
FALLBACK_CPU
};
reload_state state = reload_state::UNINITIALIZED;
};
struct reload_info {
std::unordered_map<std::string, tensor_reload_source> tensor_reload_sources;
std::atomic<bool> reload_snapshots_done{false};
reload_info(const llama_model_loader & ml);
bool reload_tensor(const char * name, llama_model & model);
bool reload_changed_tensors(llama_model & model);
void snapshot_all_reload_tensors(llama_model & model);
};

947
src/llama-reload.cpp Normal file
View File

@ -0,0 +1,947 @@
#include "llama-reload-info.h"
#include "llama-model.h"
#include "llama-model-loader.h"
#ifdef GGML_USE_CUDA
#include "ggml-cuda.h"
#endif
#include <map>
#include <fstream>
#include <algorithm>
#include <vector>
#include <cstdint>
#include <cstring>
// ------------------------------------------------------------------
// Debug helpers
// ------------------------------------------------------------------
static void log_tensor_state(const char * ctx, struct ggml_tensor * t) {
#ifndef NDEBUG
if (!t) {
LLAMA_LOG_DEBUG("%s: tensor=NULL\n", ctx);
return;
}
const char * buft_name = "null";
if (t->buffer) {
auto buft = ggml_backend_buffer_get_type(t->buffer);
if (buft) buft_name = ggml_backend_buft_name(buft);
}
LLAMA_LOG_DEBUG("%s: tensor='%s' type=%s ne={%ld,%ld,%ld,%ld} nb={%zu,%zu,%zu,%zu} "
"buffer=%p data=%p extra=%p buft=%s\n",
ctx, t->name, ggml_type_name(t->type),
(long)t->ne[0], (long)t->ne[1], (long)t->ne[2], (long)t->ne[3],
t->nb[0], t->nb[1], t->nb[2], t->nb[3],
(void*)t->buffer, t->data, (void*)t->extra, buft_name);
#else
(void)ctx;
(void)t;
#endif
}
static void log_split_state(const char * ctx, struct ggml_tensor * t) {
#ifndef NDEBUG
if (!t || !t->extra) {
LLAMA_LOG_DEBUG("%s: no splits (extra=%p)\n", ctx, (void*)(t ? t->extra : nullptr));
return;
}
auto extra = (ggml_split_tensor_t *)t->extra;
LLAMA_LOG_DEBUG("%s: tensor='%s' n_device=%d split_dim=%d\n",
ctx, t->name, extra->n_device, extra->split_dim);
for (int i = 0; i < extra->n_device; ++i) {
if (!extra->splits[i]) {
LLAMA_LOG_DEBUG("%s: split[%d]=NULL\n", ctx, i);
continue;
}
const char * split_buft_name = "null";
if (extra->splits[i]->buffer) {
auto buft = ggml_backend_buffer_get_type(extra->splits[i]->buffer);
if (buft) split_buft_name = ggml_backend_buft_name(buft);
}
LLAMA_LOG_DEBUG("%s: split[%d] type=%s ne={%ld,%ld,%ld,%ld} nb={%zu,%zu,%zu,%zu} "
"buffer=%p data=%p buft=%s\n",
ctx, i, ggml_type_name(extra->splits[i]->type),
(long)extra->splits[i]->ne[0], (long)extra->splits[i]->ne[1],
(long)extra->splits[i]->ne[2], (long)extra->splits[i]->ne[3],
extra->splits[i]->nb[0], extra->splits[i]->nb[1],
extra->splits[i]->nb[2], extra->splits[i]->nb[3],
(void*)extra->splits[i]->buffer, extra->splits[i]->data, split_buft_name);
}
#else
(void)ctx;
(void)t;
#endif
}
// ------------------------------------------------------------------
// GGUF header parser (reuses llama.cpp / ggml GGUF loader)
// ------------------------------------------------------------------
static bool gguf_find_tensor_meta(const char * path, const char * target_name,
size_t & out_offset, size_t & out_nbytes,
ggml_type & out_type,
int64_t out_ne[GGML_MAX_DIMS])
{
struct ggml_context * ctx = nullptr;
struct gguf_init_params params = {
/*.no_alloc = */ true,
/*.ctx = */ &ctx,
};
struct gguf_context * gguf = gguf_init_from_file(path, params);
if (!gguf) {
return false;
}
const int idx = gguf_find_tensor(gguf, target_name);
if (idx < 0) {
ggml_free(ctx);
gguf_free(gguf);
return false;
}
struct ggml_tensor * tensor = ggml_get_tensor(ctx, target_name);
if (!tensor) {
ggml_free(ctx);
gguf_free(gguf);
return false;
}
out_offset = gguf_get_data_offset(gguf) + gguf_get_tensor_offset(gguf, idx);
out_nbytes = ggml_nbytes(tensor);
out_type = tensor->type;
for (int i = 0; i < GGML_MAX_DIMS; ++i) {
out_ne[i] = tensor->ne[i];
}
ggml_free(ctx);
gguf_free(gguf);
return true;
}
// ------------------------------------------------------------------
// Buffer census helper
// ------------------------------------------------------------------
static size_t count_buffer_users(
const std::vector<std::pair<std::string, struct ggml_tensor *>> & tensors_by_name,
ggml_backend_buffer_t buf)
{
if (!buf) return 0;
size_t n = 0;
for (auto & p : tensors_by_name) {
if (p.second->buffer == buf) ++n;
}
return n;
}
static bool is_original_snapshot_buffer(llama_model & model, ggml_backend_buffer_t buf) {
if (!buf) return false;
if (!model.reload) return false;
for (const auto & kv : model.reload->tensor_reload_sources) {
const auto & src = kv.second;
if (buf == src.original_buffer) return true;
for (const auto & os : src.original_splits) {
if (buf == os.buffer) return true;
}
}
return false;
}
// ------------------------------------------------------------------
// Final size estimator
// ------------------------------------------------------------------
static size_t llama_model_compute_final_nbytes(struct ggml_tensor * tensor, ggml_type new_type) {
if (new_type == tensor->type) {
return ggml_nbytes(tensor);
}
return ggml_row_size(new_type, tensor->ne[0]) * ggml_nrows(tensor);
}
// ------------------------------------------------------------------
// Fallback allocator
// ------------------------------------------------------------------
static ggml_backend_buffer_t alloc_buffer_fallback(ggml_backend_buffer_type_t buft, size_t size) {
ggml_backend_buffer_t buf = ggml_backend_buft_alloc_buffer(buft, size);
if (buf) {
LLAMA_LOG_DEBUG("%s: allocated %zu bytes on backend '%s'\n",
__func__, size, ggml_backend_buft_name(buft));
return buf;
}
auto cpu_buft = ggml_backend_cpu_buffer_type();
if (buft == cpu_buft) {
LLAMA_LOG_WARN("%s: CPU alloc failed (%zu bytes)\n", __func__, size);
return nullptr;
}
LLAMA_LOG_WARN("%s: backend alloc failed (%zu bytes on '%s'), trying CPU fallback\n",
__func__, size, ggml_backend_buft_name(buft));
buf = ggml_backend_buft_alloc_buffer(cpu_buft, size);
if (!buf) {
LLAMA_LOG_WARN("%s: CPU fallback alloc failed (%zu bytes)\n", __func__, size);
return nullptr;
}
LLAMA_LOG_DEBUG("%s: allocated %zu bytes on CPU fallback\n", __func__, size);
return buf;
}
// ------------------------------------------------------------------
// MoE sibling resync
// ------------------------------------------------------------------
// MoE layers have three weight tensors per block: gate, up, down.
// The CUDA split backend distributes each tensor across GPUs by splitting
// one dimension (usually dim 0 or 1). Split boundaries must be multiples
// of the quantization block size (e.g. 256 for IQ1_KT). If the reference
// tensor changes quantization type, its block size changes, which changes
// the valid split boundaries. ALL siblings in the same layer MUST adopt
// the SAME per-device split dimensions, otherwise the backend dispatches
// rows to the wrong devices and corrupts inference.
//
// When the reference tensor is back on its original snapshot, siblings
// can simply be reattached to their original snapshots too -- no data
// movement or allocation is required.
// ------------------------------------------------------------------
// ------------------------------------------------------------------
// Sibling name registration
// ------------------------------------------------------------------
static void populate_moe_siblings(const char * name, tensor_reload_source & src) {
LLAMA_LOG_DEBUG("%s: name='%s'\n", __func__, name);
static const char * suffixes[] = {
".ffn_down_exps.weight",
".ffn_up_exps.weight",
".ffn_gate_exps.weight",
};
std::string n(name);
for (const char * sfx : suffixes) {
size_t pos = n.find(sfx);
if (pos == std::string::npos) continue;
std::string base = n.substr(0, pos);
for (const char * other : suffixes) {
if (strcmp(other, sfx) != 0) {
src.sibling_names.push_back(base + other);
LLAMA_LOG_DEBUG("%s: registered sibling '%s' for '%s'\n",
__func__, (base + other).c_str(), name);
}
}
return;
}
LLAMA_LOG_DEBUG("%s: '%s' no MoE suffix matched\n", __func__, name);
}
// ------------------------------------------------------------------
// Snapshot helper
// ------------------------------------------------------------------
static void snapshot_tensor_source(struct ggml_tensor * tensor,
tensor_reload_source & src)
{
if (!tensor || src.original_buffer != nullptr) return;
src.original_buffer = tensor->buffer;
src.original_data = tensor->data;
src.original_nbytes = ggml_nbytes(tensor);
src.original_type = tensor->type;
for (int i = 0; i < GGML_MAX_DIMS; ++i) {
src.original_ne[i] = tensor->ne[i];
src.original_nb[i] = tensor->nb[i];
}
auto extra = (ggml_split_tensor_t *)tensor->extra;
if (extra) {
src.original_extra = extra;
src.original_splits.clear();
for (int i = 0; i < extra->n_device; ++i) {
tensor_reload_source::split_info si;
if (extra->splits[i]) {
for (int j = 0; j < GGML_MAX_DIMS; ++j) {
si.ne[j] = extra->splits[i]->ne[j];
si.nb[j] = extra->splits[i]->nb[j];
}
si.data = extra->splits[i]->data;
si.buffer = extra->splits[i]->buffer;
si.tensor = extra->splits[i];
}
src.original_splits.push_back(si);
}
}
populate_moe_siblings(ggml_get_name(tensor), src);
src.state = tensor_reload_source::reload_state::ON_ORIGINAL;
log_tensor_state("snapshot_tensor_source", tensor);
}
// ------------------------------------------------------------------
// Constructor
// ------------------------------------------------------------------
reload_info::reload_info(const llama_model_loader & ml) {
for (const auto & w : ml.weights) {
if (!w.tensor || w.idx >= (int)ml.files.size()) continue;
struct stat st;
if (stat(ml.files[w.idx]->get_path().c_str(), &st) != 0) continue;
tensor_reload_source src;
src.path = ml.files[w.idx]->get_path();
src.data_offset = w.offs;
src.nbytes = ggml_nbytes(w.tensor);
src.last_mtime = st.st_mtime;
#ifdef __linux__
src.last_mtime_ns = st.st_mtim.tv_nsec;
#endif
tensor_reload_sources[ggml_get_name(w.tensor)] = std::move(src);
}
}
// ------------------------------------------------------------------
// Eager snapshot
// ------------------------------------------------------------------
void reload_info::snapshot_all_reload_tensors(llama_model & model) {
if (this->reload_snapshots_done.exchange(true)) return;
LLAMA_LOG_INFO("%s: eager snapshot of all reload tensors + siblings\n", __func__);
for (auto & kv : tensor_reload_sources) {
struct ggml_tensor * tensor = nullptr;
for (auto & p : model.tensors_by_name) {
if (p.first == kv.first) { tensor = p.second; break; }
}
if (!tensor) continue;
snapshot_tensor_source(tensor, kv.second);
}
for (auto & kv : tensor_reload_sources) {
auto & src = kv.second;
for (const auto & sib_name : src.sibling_names) {
auto it = this->tensor_reload_sources.find(sib_name);
if (it == this->tensor_reload_sources.end()) continue;
if (it->second.original_buffer != nullptr) continue;
struct ggml_tensor * sib = nullptr;
for (auto & p : model.tensors_by_name) {
if (p.first == sib_name) { sib = p.second; break; }
}
if (!sib) continue;
snapshot_tensor_source(sib, it->second);
}
}
}
// ------------------------------------------------------------------
// Re-attachment helper
// ------------------------------------------------------------------
static bool reattach_split_tensor_to_shared(llama_model & model, const char * name) {
auto it = model.reload->tensor_reload_sources.find(name);
if (it == model.reload->tensor_reload_sources.end()) return false;
auto & src = it->second;
if (!src.original_buffer) return false;
struct ggml_tensor * tensor = nullptr;
for (auto & p : model.tensors_by_name) {
if (p.first == name) { tensor = p.second; break; }
}
if (!tensor) return false;
if (tensor->buffer == src.original_buffer) {
log_tensor_state("reattach_split_tensor_to_shared", tensor);
src.state = tensor_reload_source::reload_state::ON_ORIGINAL;
return true;
}
if (tensor->buffer && src.state != tensor_reload_source::reload_state::ON_ORIGINAL) {
ggml_backend_buffer_free(tensor->buffer);
}
tensor->buffer = nullptr;
tensor->data = nullptr;
tensor->buffer = src.original_buffer;
tensor->data = src.original_data;
tensor->type = src.original_type;
for (int i = 0; i < GGML_MAX_DIMS; ++i) {
tensor->ne[i] = src.original_ne[i];
tensor->nb[i] = src.original_nb[i];
}
if (src.original_extra) {
tensor->extra = src.original_extra;
auto extra = (ggml_split_tensor_t *)tensor->extra;
for (int i = 0; i < extra->n_device && i < (int)src.original_splits.size(); ++i) {
auto & os = src.original_splits[i];
if (!extra->splits[i] && os.tensor) {
extra->splits[i] = os.tensor;
}
if (extra->splits[i]) {
if (extra->splits[i]->buffer && extra->splits[i]->buffer != os.buffer &&
src.state != tensor_reload_source::reload_state::ON_ORIGINAL) {
ggml_backend_buffer_free(extra->splits[i]->buffer);
}
extra->splits[i]->data = os.data;
extra->splits[i]->buffer = os.buffer;
extra->splits[i]->type = src.original_type;
for (int j = 0; j < GGML_MAX_DIMS; ++j) {
extra->splits[i]->ne[j] = os.ne[j];
extra->splits[i]->nb[j] = os.nb[j];
}
}
}
}
src.state = tensor_reload_source::reload_state::ON_ORIGINAL;
return true;
}
// ------------------------------------------------------------------
// MoE sibling resync
// ------------------------------------------------------------------
static void resync_moe_sibling_splits(
llama_model & model,
struct ggml_context * /*ctx_tmp*/,
struct ggml_tensor * ref_tensor,
const char * ref_name)
{
std::string name_str(ref_name);
std::string layer_prefix;
std::vector<std::string> suffixes;
if (name_str.find(".ffn_down_exps.weight") != std::string::npos) {
layer_prefix = name_str.substr(0, name_str.find(".ffn_down_exps.weight"));
suffixes = {".ffn_up_exps.weight", ".ffn_gate_exps.weight"};
} else if (name_str.find(".ffn_up_exps.weight") != std::string::npos) {
layer_prefix = name_str.substr(0, name_str.find(".ffn_up_exps.weight"));
suffixes = {".ffn_down_exps.weight", ".ffn_gate_exps.weight"};
} else if (name_str.find(".ffn_gate_exps.weight") != std::string::npos) {
layer_prefix = name_str.substr(0, name_str.find(".ffn_gate_exps.weight"));
suffixes = {".ffn_up_exps.weight", ".ffn_down_exps.weight"};
} else {
return;
}
auto ref_extra = (ggml_split_tensor_t *)ref_tensor->extra;
if (!ref_extra) return;
auto it_ref_src = model.reload->tensor_reload_sources.find(ref_name);
if (it_ref_src != model.reload->tensor_reload_sources.end() && ref_tensor->buffer == it_ref_src->second.original_buffer) {
for (const auto & suffix : suffixes) {
reattach_split_tensor_to_shared(model, (layer_prefix + suffix).c_str());
}
return;
}
struct sibling_job {
std::string name;
struct ggml_tensor * tensor;
ggml_split_tensor_t * extra;
std::vector<char> host_buf;
bool needs_resync = false;
};
std::vector<sibling_job> jobs;
for (const auto & suffix : suffixes) {
std::string sib_name = layer_prefix + suffix;
struct ggml_tensor * sib = nullptr;
for (auto & p : model.tensors_by_name) {
if (p.first == sib_name) { sib = p.second; break; }
}
if (!sib || !sib->extra || sib == ref_tensor) continue;
auto sib_extra = (ggml_split_tensor_t *)sib->extra;
if (sib_extra->n_device != ref_extra->n_device) continue;
int sib_dim = sib_extra->split_dim < 0 ? 0 : sib_extra->split_dim;
int ref_dim = ref_extra->split_dim < 0 ? 0 : ref_extra->split_dim;
bool need = false;
for (int i = 0; i < ref_extra->n_device; ++i) {
bool rh = ref_extra->splits[i] != nullptr;
bool sh = sib_extra->splits[i] != nullptr;
if (rh != sh) { need = true; break; }
if (rh && sh && sib_extra->splits[i]->ne[sib_dim] != ref_extra->splits[i]->ne[ref_dim]) {
need = true; break;
}
}
if (!need) continue;
size_t nbytes = ggml_nbytes(sib);
std::vector<char> buf(nbytes);
ggml_backend_tensor_get(sib, buf.data(), 0, nbytes);
jobs.push_back({sib_name, sib, sib_extra, std::move(buf), true});
}
if (jobs.empty()) return;
log_split_state("resync_moe_sibling_splits", ref_tensor);
// Phase A: Detach / free old buffers, allocate new main handles
for (auto & job : jobs) {
auto sib = job.tensor;
ggml_backend_buffer_type_t buft = sib->buffer
? ggml_backend_buffer_get_type(sib->buffer)
: ggml_backend_cpu_buffer_type();
auto it = model.reload->tensor_reload_sources.find(job.name);
bool was_orig = (it != model.reload->tensor_reload_sources.end() && it->second.state == tensor_reload_source::reload_state::ON_ORIGINAL);
if (sib->buffer) {
if (!was_orig) ggml_backend_buffer_free(sib->buffer);
sib->buffer = nullptr;
sib->data = nullptr;
}
size_t alloc_size = ggml_backend_buft_get_alloc_size(buft, sib);
ggml_backend_buffer_t new_buf = alloc_buffer_fallback(buft, alloc_size);
if (!new_buf) {
job.needs_resync = false;
continue;
}
sib->buffer = new_buf;
sib->data = (void*)0x1; // dummy; split backend uses extra->splits
if (it != model.reload->tensor_reload_sources.end()) {
it->second.state = tensor_reload_source::reload_state::DETACHED;
}
}
// Phase B: Propagate dimensions & recompute strides
for (auto & job : jobs) {
if (!job.needs_resync) continue;
auto sib = job.tensor;
auto sib_extra = job.extra;
for (int i = 0; i < ref_extra->n_device; ++i) {
if (!ref_extra->splits[i]) {
if (sib_extra->splits[i]) sib_extra->splits[i] = nullptr;
continue;
}
if (!sib_extra->splits[i]) continue;
sib_extra->splits[i]->ne[sib_extra->split_dim < 0 ? 0 : sib_extra->split_dim] =
ref_extra->splits[i]->ne[ref_extra->split_dim < 0 ? 0 : ref_extra->split_dim];
}
int n_dims = 0;
for (int i = GGML_MAX_DIMS - 1; i >= 0; --i) {
if (sib->ne[i] != 1) { n_dims = i + 1; break; }
}
size_t ctx_size = ggml_tensor_overhead() * (sib_extra->n_device + 4);
if (ctx_size < 16384) ctx_size = 16384;
struct ggml_init_params p = { ctx_size, NULL, true };
struct ggml_context * ctx = ggml_init(p);
if (ctx) {
for (int i = 0; i < sib_extra->n_device; ++i) {
if (!sib_extra->splits[i]) continue;
auto tmp = ggml_new_tensor(ctx, sib->type, n_dims, sib_extra->splits[i]->ne);
if (tmp) {
for (int j = 0; j < GGML_MAX_DIMS; ++j) {
sib_extra->splits[i]->nb[j] = tmp->nb[j];
}
}
}
ggml_free(ctx);
}
}
// Phase C: Allocate GPU split buffers
bool gpu_failed = false;
#ifdef GGML_USE_CUDA
for (auto & job : jobs) {
if (!job.needs_resync) continue;
auto sib_extra = job.extra;
for (int i = 0; i < sib_extra->n_device; ++i) {
if (!sib_extra->splits[i]) continue;
size_t need = ggml_nbytes(sib_extra->splits[i]);
auto buft = ggml_backend_cuda_buffer_type(i);
auto b = ggml_backend_buft_alloc_buffer(buft, need);
if (!b) { gpu_failed = true; break; }
sib_extra->splits[i]->buffer = b;
sib_extra->splits[i]->data = ggml_backend_buffer_get_base(b);
}
if (gpu_failed) break;
}
#else
// Without CUDA support, force CPU fallback for any resync jobs
for (auto & job : jobs) {
if (job.needs_resync) { gpu_failed = true; break; }
}
#endif
// Phase D: If any GPU alloc failed, move entire layer to CPU
if (gpu_failed) {
for (auto & job : jobs) {
if (!job.needs_resync) continue;
auto sib = job.tensor;
auto sib_extra = job.extra;
for (int i = 0; i < sib_extra->n_device; ++i) {
if (sib_extra->splits[i] && sib_extra->splits[i]->buffer) {
auto it = model.reload->tensor_reload_sources.find(job.name);
bool is_orig = false;
if (it != model.reload->tensor_reload_sources.end() && i < (int)it->second.original_splits.size()) {
is_orig = (sib_extra->splits[i]->buffer == it->second.original_splits[i].buffer);
}
if (!is_orig) ggml_backend_buffer_free(sib_extra->splits[i]->buffer);
sib_extra->splits[i]->buffer = nullptr;
sib_extra->splits[i]->data = nullptr;
}
}
if (sib->buffer) {
auto it = model.reload->tensor_reload_sources.find(job.name);
bool is_orig = (it != model.reload->tensor_reload_sources.end() && it->second.state == tensor_reload_source::reload_state::ON_ORIGINAL);
if (!is_orig) ggml_backend_buffer_free(sib->buffer);
sib->buffer = nullptr;
sib->data = nullptr;
}
size_t need = ggml_nbytes(sib);
auto cpu = alloc_buffer_fallback(ggml_backend_cpu_buffer_type(), need);
if (cpu) {
sib->buffer = cpu;
sib->data = ggml_backend_buffer_get_base(cpu);
auto it = model.reload->tensor_reload_sources.find(job.name);
if (it != model.reload->tensor_reload_sources.end()) it->second.state = tensor_reload_source::reload_state::FALLBACK_CPU;
}
}
}
// Phase E: Write data back
for (auto & job : jobs) {
if (!job.needs_resync) continue;
ggml_backend_tensor_set(job.tensor, job.host_buf.data(), 0, job.host_buf.size());
}
}
// ------------------------------------------------------------------
// reload_tensor_split_path
// ------------------------------------------------------------------
static bool reload_tensor_split_path(
llama_model & model,
struct ggml_tensor * tensor,
tensor_reload_source & src,
const std::vector<char> & host_buf,
ggml_type curr_type,
bool returning_to_original,
ggml_backend_buffer_t old_buf)
{
(void)curr_type;
const char * name = ggml_get_name(tensor);
if (returning_to_original) {
if (old_buf && src.state != tensor_reload_source::reload_state::ON_ORIGINAL) {
ggml_backend_buffer_free(old_buf);
}
tensor->buffer = nullptr;
tensor->data = nullptr;
if (!reattach_split_tensor_to_shared(model, name)) return false;
for (const auto & sib : src.sibling_names) {
reattach_split_tensor_to_shared(model, sib.c_str());
}
return true;
}
ggml_backend_buffer_type_t buft = old_buf
? ggml_backend_buffer_get_type(old_buf)
: ggml_backend_cpu_buffer_type();
if (old_buf && src.state != tensor_reload_source::reload_state::ON_ORIGINAL) {
ggml_backend_buffer_free(old_buf);
}
tensor->buffer = nullptr;
tensor->data = nullptr;
size_t alloc_size = ggml_backend_buft_get_alloc_size(buft, tensor);
ggml_backend_buffer_t new_buf = alloc_buffer_fallback(buft, alloc_size);
if (!new_buf) return false;
ggml_backend_tensor_alloc(new_buf, tensor, ggml_backend_buffer_get_base(new_buf));
//ggml_backend_buffer_init_tensor(tensor->buffer, tensor);
ggml_backend_tensor_set(tensor, host_buf.data(), 0, host_buf.size());
log_tensor_state("reload_tensor_split_path", tensor);
if (tensor->extra) resync_moe_sibling_splits(model, nullptr, tensor, name);
src.state = tensor_reload_source::reload_state::DETACHED;
return true;
}
// ------------------------------------------------------------------
// reload_tensor_non_split_path
// ------------------------------------------------------------------
static bool reload_tensor_non_split_path(
llama_model & model,
struct ggml_tensor * tensor,
tensor_reload_source & src,
const std::vector<char> & host_buf,
ggml_type curr_type,
bool returning_to_original,
ggml_backend_buffer_t old_buf)
{
(void)model;
(void)curr_type;
#ifndef NDEBUG
const char * name = ggml_get_name(tensor);
#endif
if (returning_to_original) {
if (old_buf && src.state != tensor_reload_source::reload_state::ON_ORIGINAL) {
ggml_backend_buffer_free(old_buf);
}
tensor->buffer = src.original_buffer;
tensor->data = src.original_data;
tensor->type = src.original_type;
for (int i = 0; i < GGML_MAX_DIMS; ++i) {
tensor->ne[i] = src.original_ne[i];
tensor->nb[i] = src.original_nb[i];
}
src.state = tensor_reload_source::reload_state::ON_ORIGINAL;
return true;
}
ggml_backend_buffer_type_t buft = old_buf
? ggml_backend_buffer_get_type(old_buf)
: ggml_backend_cpu_buffer_type();
if (old_buf && src.state != tensor_reload_source::reload_state::ON_ORIGINAL) {
ggml_backend_buffer_free(old_buf);
#ifndef NDEBUG
} else if (old_buf) {
LLAMA_LOG_DEBUG("detaching from original snapshot buffer %p for '%s'\n", (void*)old_buf, name);
#endif
}
tensor->buffer = nullptr;
tensor->data = nullptr;
size_t alloc_size = ggml_backend_buft_get_alloc_size(buft, tensor);
ggml_backend_buffer_t new_buf = alloc_buffer_fallback(buft, alloc_size);
if (!new_buf) return false;
ggml_backend_tensor_alloc(new_buf, tensor, ggml_backend_buffer_get_base(new_buf));
ggml_backend_tensor_set(tensor, host_buf.data(), 0, host_buf.size());
src.state = tensor_reload_source::reload_state::DETACHED;
return true;
}
// ------------------------------------------------------------------
// apply_tensor_type_change
// ------------------------------------------------------------------
static bool apply_tensor_type_change(
llama_model & /*model*/,
struct ggml_tensor * tensor,
tensor_reload_source & /*src*/,
ggml_type curr_type)
{
#ifndef NDEBUG
const char * name = ggml_get_name(tensor);
(void)name;
#endif
tensor->type = curr_type;
int n_dims = 0;
for (int i = GGML_MAX_DIMS - 1; i >= 0; --i) {
if (tensor->ne[i] != 1) { n_dims = i + 1; break; }
}
size_t ctx_size = ggml_tensor_overhead() * (1 + (tensor->extra ? ((ggml_split_tensor_t*)tensor->extra)->n_device : 0))
+ ggml_graph_overhead_custom(1, false);
struct ggml_init_params p = { ctx_size, NULL, true };
struct ggml_context * ctx = ggml_init(p);
if (!ctx) return false;
auto tmp = ggml_new_tensor(ctx, curr_type, n_dims, tensor->ne);
if (!tmp) { ggml_free(ctx); return false; }
for (int i = 0; i < GGML_MAX_DIMS; ++i) tensor->nb[i] = tmp->nb[i];
if (tensor->extra) {
auto extra = (ggml_split_tensor_t *)tensor->extra;
auto tt = ggml_internal_get_type_traits(curr_type);
if (tt.blck_size > 1 && extra->split_dim == 0) {
int64_t bs = tt.blck_size;
int n = extra->n_device;
std::vector<int64_t> bounds(n, 0);
int64_t acc = 0;
for (int i = 0; i < n; ++i) {
if (extra->splits[i]) acc += extra->splits[i]->ne[0];
bounds[i] = acc;
}
for (int i = 0; i < n - 1; ++i) {
if (bounds[i] > 0) {
bounds[i] = ((bounds[i] + bs - 1) / bs) * bs;
}
}
bounds[n - 1] = tensor->ne[0];
for (int i = 1; i < n; ++i) {
if (bounds[i] < bounds[i - 1]) bounds[i] = bounds[i - 1];
}
int64_t prev = 0;
for (int i = 0; i < n; ++i) {
if (extra->splits[i]) {
int64_t ne0 = bounds[i] - prev;
if (ne0 <= 0) {
extra->splits[i] = nullptr;
} else {
extra->splits[i]->ne[0] = ne0;
}
}
prev = bounds[i];
}
}
for (int i = 0; i < extra->n_device; ++i) {
auto split = extra->splits[i];
if (!split) continue;
split->type = curr_type;
auto t = ggml_new_tensor(ctx, curr_type, n_dims, split->ne);
if (t) {
for (int j = 0; j < GGML_MAX_DIMS; ++j) split->nb[j] = t->nb[j];
}
}
int64_t sum = 0;
for (int i = 0; i < extra->n_device; ++i) {
if (extra->splits[i]) sum += extra->splits[i]->ne[0];
}
GGML_ASSERT(sum == tensor->ne[0]);
}
ggml_free(ctx);
return true;
}
// ------------------------------------------------------------------
// reload_tensor
// ------------------------------------------------------------------
bool reload_info::reload_tensor(const char * name, llama_model & model) {
auto it = tensor_reload_sources.find(name);
if (it == tensor_reload_sources.end()) return false;
auto & src = it->second;
struct stat st;
if (stat(src.path.c_str(), &st) != 0) return false;
bool changed = (st.st_mtime != src.last_mtime);
#ifdef __linux__
changed = changed || (st.st_mtim.tv_nsec != src.last_mtime_ns);
#endif
if (!changed) return false;
size_t off = 0, file_nbytes = 0;
ggml_type curr_type = GGML_TYPE_COUNT;
int64_t file_ne[GGML_MAX_DIMS];
if (!gguf_find_tensor_meta(src.path.c_str(), name, off, file_nbytes, curr_type, file_ne)) return false;
std::ifstream file(src.path, std::ios::binary);
if (!file) return false;
file.seekg((std::streamoff)off);
if (!file) return false;
struct ggml_tensor * tensor = nullptr;
for (auto & p : model.tensors_by_name) {
if (p.first == name) { tensor = p.second; break; }
}
if (!tensor || !src.original_buffer) return false;
// Refuse to swap if the on-disk shape differs from the model tensor
for (int i = 0; i < GGML_MAX_DIMS; ++i) {
if (tensor->ne[i] != file_ne[i]) {
LLAMA_LOG_INFO("reload_tensor: dimension mismatch for '%s': model ne[%d]=%ld, file ne[%d]=%ld — refusing swap\n",
name, i, (long)tensor->ne[i], i, (long)file_ne[i]);
return false;
}
}
ggml_backend_buffer_t old_buf = tensor->buffer;
bool returning = (curr_type == src.original_type);
std::vector<char> host_buf;
if (!returning) {
if (curr_type != tensor->type) {
if (!apply_tensor_type_change(model, tensor, src, curr_type)) return false;
}
size_t need = ggml_nbytes(tensor);
if (file_nbytes < need) return false;
host_buf.resize(need);
file.read(host_buf.data(), (std::streamsize)need);
if (!file || (size_t)file.gcount() != need) return false;
}
bool ok = false;
if (tensor->extra) {
ok = reload_tensor_split_path(model, tensor, src, host_buf, curr_type, returning, old_buf);
} else {
ok = reload_tensor_non_split_path(model, tensor, src, host_buf, curr_type, returning, old_buf);
}
if (ok) {
src.last_mtime = st.st_mtime;
#ifdef __linux__
src.last_mtime_ns = st.st_mtim.tv_nsec;
#endif
}
return ok;
}
// ------------------------------------------------------------------
// reload_changed_tensors
// ------------------------------------------------------------------
bool reload_info::reload_changed_tensors(llama_model & model) {
snapshot_all_reload_tensors(model);
struct job { const char * name; bool returning; };
std::vector<job> jobs;
for (auto & kv : tensor_reload_sources) {
auto & src = kv.second;
struct stat st;
if (stat(src.path.c_str(), &st) != 0) continue;
bool changed = (st.st_mtime != src.last_mtime);
#ifdef __linux__
changed = changed || (st.st_mtim.tv_nsec != src.last_mtime_ns);
#endif
if (!changed) continue;
size_t off = 0, nbytes = 0;
ggml_type t = GGML_TYPE_COUNT;
int64_t file_ne[GGML_MAX_DIMS];
if (!gguf_find_tensor_meta(src.path.c_str(), kv.first.c_str(), off, nbytes, t, file_ne)) continue;
struct ggml_tensor * tensor = nullptr;
for (auto & p : model.tensors_by_name) {
if (p.first == kv.first) { tensor = p.second; break; }
}
if (!tensor) continue;
bool dims_ok = true;
for (int i = 0; i < GGML_MAX_DIMS; ++i) {
if (tensor->ne[i] != file_ne[i]) {
LLAMA_LOG_INFO("reload_changed_tensors: dimension mismatch for '%s': model ne[%d]=%ld, file ne[%d]=%ld — skipping\n",
kv.first.c_str(), i, (long)tensor->ne[i], i, (long)file_ne[i]);
dims_ok = false;
break;
}
}
if (!dims_ok) continue;
bool returning = (t == src.original_type);
jobs.push_back({kv.first.c_str(), returning});
}
std::sort(jobs.begin(), jobs.end(), [](const job & a, const job & b) {
return a.returning > b.returning;
});
bool r = false;
for (auto & j : jobs) {
if (reload_tensor(j.name, model)) {
r = true;
LLAMA_LOG_INFO("reloaded tensor '%s'\n", j.name);
}
}
if (r) {
#ifdef GGML_USE_CUDA
ggml_backend_cuda_invalidate_graphs();
#endif
}
return r;
}

View File

@ -4158,6 +4158,11 @@ static int llama_model_load(const std::string & fname, llama_model & model, llam
)) {
return -2;
}
// ---- populate reload registry ONLY when hot-swap is requested ----
if (std::getenv("LLAMA_HOTSWAP_ENABLED") != nullptr) {
model.reload = std::make_unique<reload_info>(ml);
}
} catch (const std::exception & err) {
LLAMA_LOG_ERROR("%s: error loading model: %s\n", __func__, err.what());
return -1;
@ -11081,3 +11086,17 @@ void llama_set_mtp_target_context(struct llama_context * ctx, struct llama_conte
size_t llama_fill_from_utf8(void* utf8, void* cpts, void* scripts) {
return unicode_fill_from_utf8((std::string*)utf8, (std::vector<uint32_t>*)cpts, (std::vector<std::string>*)scripts);
}
bool llama_reload_changed_tensors(struct llama_context * ctx) {
if (!ctx) return false;
llama_model & model = const_cast<llama_model &>(ctx->model);
if (!model.reload) return false;
bool result = model.reload->reload_changed_tensors(model);
if (result) {
// Reset cached compute graphs so they are rebuilt with new tensor pointers/sizes
ctx->prev.reset();
ctx->prev_mtp.reset();
}
return result;
}