* feat(convert): Get language model conversion working for 4.1 vision
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0)
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Plumb python-side vision projector names and mappings
There are several awkward things here:
1. Most of these are essentially identical to the audio qformer tensors. On
the c++ side, that's mapped using the prefix, so the rest of the GGUF
name needs to align, but on the python side there's no prefix notion, so
they all get duplicated.
2. There are a couple of net-new tensors for vision, in particular
PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as
belonging to the qformer portion, but the GGUF name is simply proj_norm
which conflicts with the ideal name for this new PROJ_NORM that is not
qualified as part of the qformer. To get around this, I used
"proj_layernorm" as the GGUF name.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add python side architecture name
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add python-side plumbing for setting FEATURE_LAYERS hparam
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add c++ side tensor naming defines
NOTE: Usage of these hasn't been updated to include prefix yet
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(mtmd): Convert vision_feature_layer to an ordered vector
We need to preserve the ordering of these feature index values so that they
can be mapped to the sub-tensors within the stacked projectors.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(mtmd): Add architecture label plumbing
Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(wip): Add partial conversion for mmproj
This handles stacking the projector tensors and setting the new harams
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add gguf_writer and constant support for new hparams and deepstack layer arr
Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Full conversion for mmproj w/ tensor mappings
Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add lm_head skip for mmproj for 4.0
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: De-alias text_config architecture in convert_lora_to_gguf.py
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add --trust-remote-code arg to convert_lora_to_gguf.py
This defaults to False, but allows a user to enable it programmaticly
instead of using the interactive prompt.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: De-alias model.language_model. -> model. for lora adapters
Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Extend language model tensor dealiasing in adapters
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unnecessary registration for GraniteSpeech in language model
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Plumb through mm prefix formatting for qformer tensors
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Refactor vision projector tensors to use predictor ID as the block
This is cleaner than stacking them. The modeling file hard-codes
single-layer qformers, so we can punt on the multiipule multi-layer
projectors problem.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add spatial offests array hparam conversion
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add stub plumbing for granite vision in mtmd
Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add new hparam and tensor naming in clip-impl.h
New hparams:
- KEY_PROJ_SAMPLE_QUERY_SIDE
- KEY_PROJ_SAMPLE_WINDOW_SIDE
- KEY_PROJ_SPATIAL_OFFSETS
New tensors:
- TN_MULTI_PROJ_IMG_POS
- TN_MULTI_PROJ_QUERY
- TN_MULTI_PROJ_LAYERNORM
- TN_MULTI_PROJ_LINEAR
- TN_MULTI_PROJ_NORM
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Move deepstack_layer_arr to llm hparam instead of mmproj
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove IS_DEEPSTACK_LAYERS
This appears to have been added during Qwen3 VL
(https://github.com/ggml-org/llama.cpp/pull/16780), but it was never
actually used.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: n_deepstack_layers -> deepstack_layer_arr
The old logic hard coded a correspondence between the first N layers of the
LLM and the 1->N entries in the input embeddings. Now, that relationship is
maintained at loading time if the GGUF value is single-valued. If it is
multi-valued, it loads directly allowing for deepstack layers to be spaced
out throughout the model.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use try/catch for single/multi valued deepstack info
The alternative would be to use get_key_or_arr, but then the single value
would be populated through the entire array and we'd need to detect that
and update it with the right correspondence.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add deepstack injection point for granite LLM
The use of ggml_add here assumes that the elements of inp_embd will be pre-
arranged to be the full embedding length with only the vision-mask'ed
portions non-zero from the projector. This matches how Qwen3VL does it.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: add missing vision attn layernorm eps
Branch: Granite4Vision
AI-usage: full (OpenCode + Qwen 3.6-35B)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix missing prefix template for TN_QF_PROJ_LINEAR
It's not strictly necessary since vision uses the blockwise version, but it
makes the loading consistent.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add embedding scale and image grid pinpoints hparams in conversion
Also remove dead parsing for self._deepstack_layer_arr
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add mtmd KEY_ section for hparams shared with the LLM
In this case, we need the EMBEDDING_SCALE so we can unscale the image
embeddings to compensate for applying embedding scale to the input
embeddings
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Implement c++ hparam parsing
Branch: Granite4Vision
AI-usage: draft (Claude Code)
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Flatten pinpoints in conversion
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add missing break
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: No reason to have modality prefix for img_pos
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add tensor loading
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the right portion of speech for tensor loading!
Also plumb through the layernorm -> post_norm naming change
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add logging of deepstack_layers_arr if set
I also changed the print_f output type to int32_t to avoid printing
overflow values for -1. This could cause overflows on the other side, but
I can't imagine a value for any of the current array hparams that would
trigger that.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Make sure input embeddings are cont before f_embedding_scale
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add init and mmproj_embd cases for g4v
The n_mmproj_embd is 1+ to make space for the text embedding and all 8
projectors
Branch: Granite4Vision
AI-usage: draft (Bob)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Invert (h, w) -> (w, h) pinpoints
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Reorder projectors based on llm index and skip the first injection
The multi-projector stack has a strange asymmetry based on how it's
currently implemented for qwen3vl: on the mmproj side, it's all N
projectors, but the output of the "first" (by inp_embd index) projector is
automatically consumed as if it were a standard single-projector mmproj,
so the deepstack portion needs to only contain the 1-N entries.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
* fix: Fix mmproj hparams in conversion
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
* fix: Fix ordering/logic for deepstack injection in granite
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
* fix: Fix preprocessing config to match what the model needs
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
* wip: Partial port of Eli's implementation
This is still pretty broken, but it's getting closer. It now happily
generates tokens, but the values are quite incorrect still. I suspect it's
caused by the mapping of projectors from safetensors to their respective
orders here.
Also, this implementation breaks encapsulation pretty badly in mtmd_encode.
This will need a big refactor to put the G4V-specific encoding logic
somewhere more appropriate.
Branch: Granite4Vision
AI-usage: draft (Claude Code, Bob)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
* fix: Fix the pre-scaling on the input embeddings to correctly invert the scale
We've got tokens! They still don't line up quite right, so something's a
little off, but we're getting much closer now.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: invert embedding multiplier -> base_scale at load
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix setting image_resize_pad after new enum introduced
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add G4V to mmproj mapping in conversion
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Re-add padding disable for non-hybrid hybrid models
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Simplify G4V n_tokens computation
This is slightly more efficient and flexible for when we implement the
unpad cropping. IMO, it's also clearer that it is adding the number of
image_newline tokens (embeddings) to the grid, rather than recomputing the
entire count.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add new clip APIs for post-tile-encoding assembly
Granite 4 Vision uses llava-next style pack-and-unpad which requires
injecting the learned newline after each row of the tile grid. A row here
is a single row of the grid which is composed of (grid_x * cols_per_tile) *
(grid_y * rows_per_tile), so the result is newlines injected in between
individual tile rows, thus not something that can be handled with the
standard llava-uhd block-wise endcoding.
Branch: Granite4Vision
AI-usage: draft (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add model interfaces for granite 4 vision assembler
I'm on the fence about the best organization of this. These free functions
allow the per-architecture logic in clip.cpp to access the model-specific
graph building, but they still require a fair bit of model-specific logic
in clip.cpp which is not ideal.
I think a better approach may be to replicate what is done with the
graph builders themselves (and possibly even make the assembler part of the
model's existing graph builder).
Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler
Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor(mtmd): Consolidate assembler logic into clip_assembler class family
Just like `clip_graph` is the base class for building the model-specific
encoder graphs, `clip_assembler` will be the base class for building the
model-specific assembler graphs. This allows the assembly pattern to follow
how the encoder pattern is implemented where the model-specific logic lives
in a subclass co-located with the encoder graph builder that gets
constructed by a simple factory method.
Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Comment improvement
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: granite_vision -> granite4_vision
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack
These pieces were never used on the c++ side (removed there in an earlier
commit), so this is just cleanup that I missed before.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Oops! I did not mean to commit one of my prompt files
But now it's too far back in history to effectively rebase out, even with
interactive and --rebase-merges :(
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add missing <algorithm> include for std::find
It seems that this was already pulled in on some platforms, but not on
others
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix Flake8 warnings in granite conversion module
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove clip_assembler in favor of clip_image_f32.append_token
Per conversation in the PR, the clip_assembler pattern was too invasive.
This is a compromise that limits model-specific blocks to add_media where
each preprocessed tile is annotated with an injection type, after which all
the token counting logic is generic and the newline injection itself is
handled in the graph based on the value for the given tile image.
Branch: Granite4Vision
AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor(convert): Split n_deepstack_layers and deepstack_layers (array)
Branch: Granite4Vision
AI-usage: full (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys
Branch: Granite4Vision
AI-usage: draft (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix GGUF key for deepstack_layers_arr
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs
This follows how gemma3 and gemma4 handle embedding scaling by skipping the
multiplier for raw input embeddings.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr)
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Fully revert changes to n_deepstack_layers and qwen3vl*
Since we're going to keep the GGUF KVs separate, it makes sense to just
keep the hparams separate too to limit the scope of this branch. The down
side is that n_deepstack_layers and deepstack_mapping_arr are potentially
conflicting.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Revert removal of "is_deepstack_layers" GGUF KV
This KV is not used at all on the c++ side, so it's fully dead, but there's
also no need to conflate this cleanup with the addition of G4V.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unnecessary ggml_cont and build_forward_expand in cbx
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Clean up comments
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Tighter and more flexible code for g4v_build_block
This could be refactored to look a lot more like granite-speech, but the
overall block constructs before/after the qformer are pretty different, so
for now I'm going to leave it as is and just tighten a bit.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unnecessary `unordered_set` include
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add architecture guard on deepstack_mapping_arr printout
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unnecessary AI-gen comment
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Always initialize deepstack_mapping_arr with -1 values
This was causing `test-llama-archs` to fail, likely due to trying to save
the uninitialized values, then re-loading them. It's safer to always
initialize so that other models don't forget and end up with undefined
behavior.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Remove TODO about block/vs non-block tensor mapping
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Move is_vision_feature_layer logic into clip_hparams
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use a bool for append_token
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Remove unnecessary comment
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unused get_model api
yikes!
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Rearrange helpers for g4v to be private members and use build_attn
Branch: Granite4Vision
AI-usage: full (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix off-by-one in vision layer index
This was inherited from the Claude Code implementation that pushed the
negative index inversion down into the model file.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix norm/post_norm mixup in conversion
face. palm. :(
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: More descriptive tensor names
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Apply PR cleanup for new conversion changes
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix(convert): Remove duplicate V_ENC_EMBD_IMGNL
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: append_token -> add_newline
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Comment cleanup
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Cleaner error handling/checking
NOTE: format_string is not available in granite.cpp (and including
clip-impl.h to get it doesn't compile, so I think it violates the intended
encapsulation), so std::stringstream is the simplest answer.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* spec: support MTP
* fix batch size
* rename files
* cont : simplify (#7)
* MTP: clean-up (#9)
* MTP: clean-up
* review: use llama_context_type instead of llama_graph_type
* review: remove llama_model_has_mtp
* review: fix convert issues
* convert: fix pycheck
* review: formatting
* use `mtp-` for identifying mtp models
* convert: fix mtp conversion
* mtp -> draft-mtp
* remove unused llama_arch
* add need_embd in speculative
* llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
* fix pending state
* vulkan: add GDN partial rollback
* meta: extend check to axis 1
* metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: 8c05923630
Assisted-by: llama.cpp:local pi
* delta_net_base: use ggml_pad instead of new_tensor
* review: add need_rs_seq
* review: rename part_bounded to n_rs
* review: deslop comments
* review: rename, add asserts
* server : adjust checkpoint logic (#11)
* server : adjust checkpoint logic
* cont : rm asserts
* server-context: fix early exit
* spec : fix compatibility with n-gram and add TODOs (#13)
* metal : cleanup
* llama : fix faulty bitwise check in recurrent memory
* server : disable RS-based MTP in combination with other spec types
* spec : add TODOs
* cont : fix comment
* cont : update comment
* common : fix logic for ngram + mtp compat
* llama-memory: enable checkpointing with partial rollback
* cont: add test-case for loading into a dirty ctx
* llama-memory-recurrent: clear rs_idx in clear
* download: fix mtp path
* llama-arch: fix enorm op
* docs: update docs
* conversion: fix type annotations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* WIP: add NVFP4 quantization support
* tests
* improve NVFP4 dot product implementation performance and fix bad super call
* typo
* Use nvfp4 kvalues
* vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table
* vulcal and perf fixes
* wip
* Fix metal
* fix vulcan
* Rename threshold & fix wrong scale
* Fix MOE
* Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)
Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.
Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
- ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
ggml-metal-ops.cpp
- ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c
Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.
* Fix arch-fallback.h: add NVFP4 generic fallback for all platforms
After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.
* quantize: add NVFP4 as a quantization type option
* Fix ggml_fp32_to_ue4m3: handle subnormal values
Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.
Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.
Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).
* Restore ARM NEON NVFP4 dot product implementation
Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.
tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup
* Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq
- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
ggml_ue4m3_to_fp32() in the hot loop
- Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators
tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)
* ARM NEON NVFP4: rearrange q8 to match nibble layout
Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.
Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.
* CPU only backend 64 super-block layout
* cleanup
* Remove unused LUT
* int
* exclude NVFP4 from unsupported ops in metal build
* remove quantization for now
* store scales as native UE4M3, preserve original model bits when possible
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* correct comment
* format
* reduce duplication and cleanup
* Address comments
* move detection to prepare_tensors
* Use math instead of const
* Move
* fix comment
* Shelf quantize tests
* Rebase and move check
* cleanup
* lint
* Update gguf-py/gguf/scripts/gguf_convert_endian.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use fallback quant config
* Simplify
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* organize
* Refactor
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* add quantize_nvfp4 (required for test_quants.py)
* add quantize_nvfp4 (required for test_quants.py)
* add quantize_nvfp4 (required for test_quants.py)
* fix return type
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* tests: add end-to-end tests per model architecture
* fixup for rebase
* fix use-after-free in llama-model-loader.cpp
* fix CI
* fix WebGPU
* fix CI
* disable CI for macOS-latest-cmake-arm64
* use expert_weights_scale only if != 0.0f
* comments
* Adding --direct-io flag for model loading
* Fixing read_raw() calls
* Fixing Windows read_raw_at
* Changing type off_t to size_t for windows and Renaming functions
* disable direct io when mmap is explicitly enabled
* Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL
* Fallback to std::fread in case O_DIRECT fails due to bad address
* Windows: remove const keywords and unused functions
* Update src/llama-mmap.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: jtischbein <jtischbein@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ModernBERT but without `head.norm` so will currently fail to convert and run any other ModernBERT models, PRs with `head.norm` support welcome!
* constants and tensor mappings for modern bert support, model not supported yet but working on getting conversion to work for encoder only
* conversion now working, hf -> gguf
* working on support, now working on building graph
* some cleanup
* cleanup
* continuing
* correct tensor shape for qkv
* fixed tensor mappings and working on buildin graph
* tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this
* cleanup
* cleanup
* cleanup
* more cleanup
* ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more
* added cls token per previous modern bert attempt, still working on checking out the rest
* fixed pre tokenizer and still working through previous pr
* working through previous attemp, implimented more accurate conversion per previous attempt, added local sliding window attention that alternates every third layer
* fixed pre tokenizer
* working on swa with local and global alternating attention
* some cleanup and now fails on build attn
* starting to work, and some cleanup, currently failing on last layer construction in graph build
* alternating rope implemented and modern bert graph build succeeds
* fixed asser for equal ubatch seq
* cleanup
* added mask check in vocab
* fixed alternating rope, the hparams.rope_freq_base_train and hparams.rope_freq_base_train_swa were the same and i set them to correct values
* reuse variable
* removed repeat
* standard swa method can be used instead of a new enum being LLAMA_SWA_TYPE_LOCAL
* correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ...
* more modular hparam setting
* replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf_update.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-graph.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-arch.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* removed redundant hparam set
* enums for model sizes
* conversion for modern-bert model supported rather than just granite-small
* Update src/llama-model.cpp
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
* Update src/llama-model.cpp
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
* fixed ordering of enum for freq_base_swa
* fixed where I added residual, now gives much much better embeddings~
* readded cacheless logic
* removing whitespace
* conversion now working for swa pattern - dense every n layers
* modern bert put into seperate src file
* removing whitespace
* fixed whitespace and newline errors in editorconfig job
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* better naming convention, n_swa_pattern -> swa_period
* reusing sliding_window_pattern key rather than making new dense_every_n_layers key, and adding writing and reading support
* fixing pyright type-check fail
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-hparams.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-saver.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/gguf_writer.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/models/modern-bert.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* added descriptions in llama-model
* fixed tensor mappings for conversion
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* mapping name for size
* nits
* unused
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
* Uncached model read
* Removing additional --mmap arg
* Removing trailing whitespaces
* Adding fallback when O_DIRECT is not supported
* Remove branching in llama-model-loader.cpp and reduce code duplications in llama-mmap.cpp
* Adding maybe unused keyword for Mac and Windows.
* File seek aligned
* Removing all branches for direct_io in llama-model-loader.cpp
* Always use alignment from llama_file
* use_mmap=true
* llama: automatically fit args to free memory
llama-fit-params tool
* fix CI
* hints for bug reports, ensure no reallocation
* fix segfault with Vulkan
* add llama-fit-params to CI
* fix CI
* fix CI
* fix CI
* minor adjustments
* fix assignment of 1 dense layer
* fix logger not being reset on model load failure
* remove --n-gpu-layer hint on model load failure
* fix llama-fit-params verbosity
* fix edge case
* fix typo [no ci]
* First attempt
* No permute during convert (fixes qk tensors), proper norm application.
* RoPE = NeoX
* Coherence!
* Migrate xielu params from tensors to hyperparameters
* Simple CUDA kernel
* Revert stupid LLM refactorings
* Chat template support
* configchecker / flake8 errors
* Reorder unary.cu
* I do conclude that LLMs are, in fact, stupid.
* Fix after merge
* Final newline
* Make xIELU an UNARY_OP
* Final newline
* Correctly account for parameter shift
* Argh.
* Update ggml/src/ggml-cpu/unary-ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Refactor: remove unused methods, inline and factorize softplus, add const modifiers
* Revert CUDA changes, implement xIELU as a separate OP
* Pesky newline
* Add float2half / half2float for F16 inputs/outputs
* CUDA variants, attempt 2
* Actually, attempt 3
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Missing convert header
* Proper formula and reference for xIELU in the comments.
* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add tensor mappings for Apertus to global list instead
* Fix lazy on scalars
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add comment about the constraints on positive/negative alpha
* Change `softplus` to `ggml_softplus`
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: Add NEMOTRONH to python arch enum
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add NEMOTRONH to c++ arch enum
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add NEMOTRONH to llama-arch layer map
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: First pass at conversion for nemotronh
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add a verbose log for each tensor loaded
This is really helpful for diagnosing mismatches between the expected and
received tensors
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: First (broken) pass at nemotronh model architecture
It generates tokens, just not valid ones!
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Explicitly enable add_bos_token during conversion
The `tokenizer.json`/`tokenizer_config.json` in the model are a bit
contradictory. In the config, add_bos_token is set to False, but the
tokenizer model itself has a post_processor that adds the BOS token via
type: TemplateProcessing
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use relu2 (LLM_FFN_RELU_SQR) for activation in FFN layers
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Only allocate attention cache for attention layers (not non-recurrent)
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Move residual add to after every block
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the correct norm tensor for the MLP blocks
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Nemotron-H: MLP gate cleanup (pass NULL for unused gate)
This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise.
* SSM: respect ssm_dt_rank for dt_dim when provided
Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16).
* fix: plamo2 - revert dt_dim to default (remove ssm_dt_rank usage)
* Rename nemotronh to nemotron_h for consistency
- Update architecture name from NEMOTRONH to NEMOTRON_H in constants.py
- Change architecture string from 'nemotronh' to 'nemotron_h' in all files
- Update enum LLM_ARCH_NEMOTRONH to LLM_ARCH_NEMOTRON_H
- Update class name llm_build_nemotronh to llm_build_nemotron_h
- Consistent naming with underscore convention (nemotron_h vs nemotronh)
* feat: Support conversion for older NemotronH models
https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Maicon Domingues <dominguesm@outlook.com>
Co-authored-by: weatherman <fxdstudios@gmail.com>
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
* model : print tensor size during load
* cont : fix units MB -> MiB
Co-authored-by: Diego Devesa <slarengh@gmail.com>
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* impl::load change map bpe_ranks to onordered map for reduce time of impl::load on 30%
* llama_model_loader::init_mapping - replace new llama_mmap to std::make_unique<llama_mmap> for clean code & reduce (/2) time of running init_mappings
* Update src/llama-vocab.cpp
---------
Co-authored-by: lexasub <empty@empty.ru>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* GGUF: C++ refactor, backend support, misc fixes
remove ggml_tensor.backend
update CODEOWNERS [no ci]
remove gguf_get_data from API
revise GGUF API data types