* feat(convert): Get language model conversion working for 4.1 vision
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0)
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Plumb python-side vision projector names and mappings
There are several awkward things here:
1. Most of these are essentially identical to the audio qformer tensors. On
the c++ side, that's mapped using the prefix, so the rest of the GGUF
name needs to align, but on the python side there's no prefix notion, so
they all get duplicated.
2. There are a couple of net-new tensors for vision, in particular
PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as
belonging to the qformer portion, but the GGUF name is simply proj_norm
which conflicts with the ideal name for this new PROJ_NORM that is not
qualified as part of the qformer. To get around this, I used
"proj_layernorm" as the GGUF name.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add python side architecture name
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add python-side plumbing for setting FEATURE_LAYERS hparam
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add c++ side tensor naming defines
NOTE: Usage of these hasn't been updated to include prefix yet
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(mtmd): Convert vision_feature_layer to an ordered vector
We need to preserve the ordering of these feature index values so that they
can be mapped to the sub-tensors within the stacked projectors.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(mtmd): Add architecture label plumbing
Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(wip): Add partial conversion for mmproj
This handles stacking the projector tensors and setting the new harams
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add gguf_writer and constant support for new hparams and deepstack layer arr
Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Full conversion for mmproj w/ tensor mappings
Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add lm_head skip for mmproj for 4.0
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: De-alias text_config architecture in convert_lora_to_gguf.py
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add --trust-remote-code arg to convert_lora_to_gguf.py
This defaults to False, but allows a user to enable it programmaticly
instead of using the interactive prompt.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: De-alias model.language_model. -> model. for lora adapters
Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Extend language model tensor dealiasing in adapters
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unnecessary registration for GraniteSpeech in language model
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Plumb through mm prefix formatting for qformer tensors
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Refactor vision projector tensors to use predictor ID as the block
This is cleaner than stacking them. The modeling file hard-codes
single-layer qformers, so we can punt on the multiipule multi-layer
projectors problem.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add spatial offests array hparam conversion
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add stub plumbing for granite vision in mtmd
Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add new hparam and tensor naming in clip-impl.h
New hparams:
- KEY_PROJ_SAMPLE_QUERY_SIDE
- KEY_PROJ_SAMPLE_WINDOW_SIDE
- KEY_PROJ_SPATIAL_OFFSETS
New tensors:
- TN_MULTI_PROJ_IMG_POS
- TN_MULTI_PROJ_QUERY
- TN_MULTI_PROJ_LAYERNORM
- TN_MULTI_PROJ_LINEAR
- TN_MULTI_PROJ_NORM
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Move deepstack_layer_arr to llm hparam instead of mmproj
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove IS_DEEPSTACK_LAYERS
This appears to have been added during Qwen3 VL
(https://github.com/ggml-org/llama.cpp/pull/16780), but it was never
actually used.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: n_deepstack_layers -> deepstack_layer_arr
The old logic hard coded a correspondence between the first N layers of the
LLM and the 1->N entries in the input embeddings. Now, that relationship is
maintained at loading time if the GGUF value is single-valued. If it is
multi-valued, it loads directly allowing for deepstack layers to be spaced
out throughout the model.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use try/catch for single/multi valued deepstack info
The alternative would be to use get_key_or_arr, but then the single value
would be populated through the entire array and we'd need to detect that
and update it with the right correspondence.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add deepstack injection point for granite LLM
The use of ggml_add here assumes that the elements of inp_embd will be pre-
arranged to be the full embedding length with only the vision-mask'ed
portions non-zero from the projector. This matches how Qwen3VL does it.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: add missing vision attn layernorm eps
Branch: Granite4Vision
AI-usage: full (OpenCode + Qwen 3.6-35B)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix missing prefix template for TN_QF_PROJ_LINEAR
It's not strictly necessary since vision uses the blockwise version, but it
makes the loading consistent.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add embedding scale and image grid pinpoints hparams in conversion
Also remove dead parsing for self._deepstack_layer_arr
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add mtmd KEY_ section for hparams shared with the LLM
In this case, we need the EMBEDDING_SCALE so we can unscale the image
embeddings to compensate for applying embedding scale to the input
embeddings
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Implement c++ hparam parsing
Branch: Granite4Vision
AI-usage: draft (Claude Code)
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Flatten pinpoints in conversion
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add missing break
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: No reason to have modality prefix for img_pos
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add tensor loading
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the right portion of speech for tensor loading!
Also plumb through the layernorm -> post_norm naming change
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add logging of deepstack_layers_arr if set
I also changed the print_f output type to int32_t to avoid printing
overflow values for -1. This could cause overflows on the other side, but
I can't imagine a value for any of the current array hparams that would
trigger that.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Make sure input embeddings are cont before f_embedding_scale
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add init and mmproj_embd cases for g4v
The n_mmproj_embd is 1+ to make space for the text embedding and all 8
projectors
Branch: Granite4Vision
AI-usage: draft (Bob)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Invert (h, w) -> (w, h) pinpoints
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Reorder projectors based on llm index and skip the first injection
The multi-projector stack has a strange asymmetry based on how it's
currently implemented for qwen3vl: on the mmproj side, it's all N
projectors, but the output of the "first" (by inp_embd index) projector is
automatically consumed as if it were a standard single-projector mmproj,
so the deepstack portion needs to only contain the 1-N entries.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
* fix: Fix mmproj hparams in conversion
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
* fix: Fix ordering/logic for deepstack injection in granite
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
* fix: Fix preprocessing config to match what the model needs
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
* wip: Partial port of Eli's implementation
This is still pretty broken, but it's getting closer. It now happily
generates tokens, but the values are quite incorrect still. I suspect it's
caused by the mapping of projectors from safetensors to their respective
orders here.
Also, this implementation breaks encapsulation pretty badly in mtmd_encode.
This will need a big refactor to put the G4V-specific encoding logic
somewhere more appropriate.
Branch: Granite4Vision
AI-usage: draft (Claude Code, Bob)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
* fix: Fix the pre-scaling on the input embeddings to correctly invert the scale
We've got tokens! They still don't line up quite right, so something's a
little off, but we're getting much closer now.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: invert embedding multiplier -> base_scale at load
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix setting image_resize_pad after new enum introduced
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add G4V to mmproj mapping in conversion
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Re-add padding disable for non-hybrid hybrid models
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Simplify G4V n_tokens computation
This is slightly more efficient and flexible for when we implement the
unpad cropping. IMO, it's also clearer that it is adding the number of
image_newline tokens (embeddings) to the grid, rather than recomputing the
entire count.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add new clip APIs for post-tile-encoding assembly
Granite 4 Vision uses llava-next style pack-and-unpad which requires
injecting the learned newline after each row of the tile grid. A row here
is a single row of the grid which is composed of (grid_x * cols_per_tile) *
(grid_y * rows_per_tile), so the result is newlines injected in between
individual tile rows, thus not something that can be handled with the
standard llava-uhd block-wise endcoding.
Branch: Granite4Vision
AI-usage: draft (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add model interfaces for granite 4 vision assembler
I'm on the fence about the best organization of this. These free functions
allow the per-architecture logic in clip.cpp to access the model-specific
graph building, but they still require a fair bit of model-specific logic
in clip.cpp which is not ideal.
I think a better approach may be to replicate what is done with the
graph builders themselves (and possibly even make the assembler part of the
model's existing graph builder).
Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler
Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor(mtmd): Consolidate assembler logic into clip_assembler class family
Just like `clip_graph` is the base class for building the model-specific
encoder graphs, `clip_assembler` will be the base class for building the
model-specific assembler graphs. This allows the assembly pattern to follow
how the encoder pattern is implemented where the model-specific logic lives
in a subclass co-located with the encoder graph builder that gets
constructed by a simple factory method.
Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Comment improvement
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: granite_vision -> granite4_vision
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack
These pieces were never used on the c++ side (removed there in an earlier
commit), so this is just cleanup that I missed before.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Oops! I did not mean to commit one of my prompt files
But now it's too far back in history to effectively rebase out, even with
interactive and --rebase-merges :(
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add missing <algorithm> include for std::find
It seems that this was already pulled in on some platforms, but not on
others
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix Flake8 warnings in granite conversion module
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove clip_assembler in favor of clip_image_f32.append_token
Per conversation in the PR, the clip_assembler pattern was too invasive.
This is a compromise that limits model-specific blocks to add_media where
each preprocessed tile is annotated with an injection type, after which all
the token counting logic is generic and the newline injection itself is
handled in the graph based on the value for the given tile image.
Branch: Granite4Vision
AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor(convert): Split n_deepstack_layers and deepstack_layers (array)
Branch: Granite4Vision
AI-usage: full (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys
Branch: Granite4Vision
AI-usage: draft (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix GGUF key for deepstack_layers_arr
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs
This follows how gemma3 and gemma4 handle embedding scaling by skipping the
multiplier for raw input embeddings.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr)
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Fully revert changes to n_deepstack_layers and qwen3vl*
Since we're going to keep the GGUF KVs separate, it makes sense to just
keep the hparams separate too to limit the scope of this branch. The down
side is that n_deepstack_layers and deepstack_mapping_arr are potentially
conflicting.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Revert removal of "is_deepstack_layers" GGUF KV
This KV is not used at all on the c++ side, so it's fully dead, but there's
also no need to conflate this cleanup with the addition of G4V.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unnecessary ggml_cont and build_forward_expand in cbx
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Clean up comments
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Tighter and more flexible code for g4v_build_block
This could be refactored to look a lot more like granite-speech, but the
overall block constructs before/after the qformer are pretty different, so
for now I'm going to leave it as is and just tighten a bit.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unnecessary `unordered_set` include
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add architecture guard on deepstack_mapping_arr printout
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unnecessary AI-gen comment
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Always initialize deepstack_mapping_arr with -1 values
This was causing `test-llama-archs` to fail, likely due to trying to save
the uninitialized values, then re-loading them. It's safer to always
initialize so that other models don't forget and end up with undefined
behavior.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Remove TODO about block/vs non-block tensor mapping
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Move is_vision_feature_layer logic into clip_hparams
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use a bool for append_token
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Remove unnecessary comment
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove unused get_model api
yikes!
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Rearrange helpers for g4v to be private members and use build_attn
Branch: Granite4Vision
AI-usage: full (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix off-by-one in vision layer index
This was inherited from the Claude Code implementation that pushed the
negative index inversion down into the model file.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix norm/post_norm mixup in conversion
face. palm. :(
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: More descriptive tensor names
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Apply PR cleanup for new conversion changes
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix(convert): Remove duplicate V_ENC_EMBD_IMGNL
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: append_token -> add_newline
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Comment cleanup
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Cleaner error handling/checking
NOTE: format_string is not available in granite.cpp (and including
clip-impl.h to get it doesn't compile, so I think it violates the intended
encapsulation), so std::stringstream is the simplest answer.
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* initial talkie support, coherent
* reorder to follow convention
* absorb inverse rope
* stop folding scalars to improve quantization
* use broadcasting instead of duplication
* style cleanup
* add scaling support to LoraTorchTensor; use that path in conversion
* use layer_out_scale instead of embd_skip_scale
* move conversion code to a dedicated conversion directory and split the files akin to the src/models architecture
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* convert : add split() method to LoraTorchTensor
* Fix python type-check
* Fix flake8 Lint
* fix: handle positional dim arg in torch.split dispatch
* Fix type-check again
* Fix type-checks
* Remove unit test per reviewers feedback
* work around ty deficiency
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* requirements : update transformers to 5.5.0
This commit updates the transformers dependency to version 5.5.0.
The motivation for this is that transformers 5.5.0 includes support for
Gemma4 and is required to be able to convert Gemma4 models. This is also
causing issues for user of gguf-my-repo.
Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/202
* fix huggingface_hub version
* set version of transformers to 5.5.0
* convert : add ty ignore directives to convert_hf_to_gguf.py
This commit adds `ty: ignore` directives to transformers tokenizers
field/methods to avoid type check errors. There might be better ways to
handle this and perhaps this can be done in a follow up commit.
The motivation for this is that it looks like in transformers 5.5.0
AutoTokenizer.from_pretrained can return generic tokenizer types or None
and the type checker now produces an error when the conversion script
accesses field like tokenizer.vocab.
* convert : add ty ignore to suppress type check errors
* convert : remove incorrect type ignores
* convert : fix remaining python checks
I was running a newer version of ty locally but I've switched to
version 0.0.26 which is what CI uses and I was then able to reproduce
the errors. Sorry about the noise.
* update transformers version to 5.5.1
* fix: TypeError when loading base model remotely in convert_lora_to_gguf
* refactor: simplify base model loading using cache_dir from HuggingFace
* Update convert_lora_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: add remote_hf_model_id to trigger lazy mode in LoRA converter
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: Add python-side constants and conversion for adapter.lora.invocation_string
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add c++ side constants for adapter.lora.invocation_string
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse invocation string for adapters from GGUF
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(python): Update conversion to alora_invocation_tokens
This is the preferred method in PEFT which is the source of ground truth
https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(cpp): Update to alora_invocation_tokens on c++ side
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add C APIs to get alora invocation token array from lora
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Initial implementation of alora cache logic in server
This does not yet do the part to identify the invocation tokens and only
apply the lora adapter afterwards, but it does seem to produce correct
results if the invocation tokens are the beginning of the uncached input.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Identify alora invocation sequences
This currently limits to a single enabled alora per slot. Multiple aloras
with different invocation sequences would be possible, but it would require
a more complex integration of the adapter toggling and is not really a well
studied case for alora since it's unclear if one alora can reuse cache from
previous prefill computed with a different alora.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Only reuse cache for tokens before the alora invocation start
This is a bit of an edge case, but theoretically a user could try the same
query with the alora disabled (just using the base model), then retry with
the alora. The cached tokens from the first pass should be invalid.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Handle un-cached tokens that come before the alora activation
The solution is to only fill up to the token before the invocation start in
the batch if there are any tokens to be prefilled between those pulled from
cache and the invocation start. When this is detected, the alora is
temporarily disabled with a scale of 0.0, then immediately re-enabled after
it has been initialized for the internal graph. Since the batch does not
complete the prompt tokens, the remaining prompt tokens are handled in the
next task, pulling all of the non-alora tokens from cache and proceeding
with prefill for the alora tokens.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use || instead of 'or'
Too much python 🤦
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix off-by-one for limiting cached tokens to before alora start
This was the cause of the inconsistent results from the dummy test script
with and without the turn that runs the prompt without the adapter before
running it with the adapter.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Support backwards-compatibility for "invocation_string" in adapter_config.json
While this has been replaced in the PEFT PR in favor of
alora_invocation_tokens, the existing adapters in the ibm-granite org on HF
use "invocation_string," so this will enable backwards compatibility and
enable testing now (before PEFT PR changes have percolated everywhere).
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove duplicate logging
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* (wip) support mergekit-extracted lora
* support mergekit-extract-lora
* use lora->get_scale
* correct comment
* correct norm name & condition
* add some hints
* convert-lora : make `--base` optional
* lint
* handle case where base_model_name_or_path is invalid
* do not include metadata from base model
* clarify unspecified --base
* add small comment [no ci]
* trigger ci
* convert : refactor rope_freqs generation
This should also fix vocab-only conversion for Phi-3.
* convert : adapt MiniCPM3 to separate rope_freqs insertion
MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.
gguf-py : add long and short RoPE factors to tensor mappings
Empty, but the key names are used to populate the mappings.
* gguf-py : fix some metadata name extraction edge cases
* convert_lora : use the lora dir for the model card path
* gguf-py : more metadata edge cases fixes
Multiple finetune versions are now joined together,
and the removal of the basename annotation on trailing versions
is more robust.
* gguf-py : add more name metadata extraction tests
* convert_lora : fix default filename
The default filename was previously hardcoded.
* convert_hf : Model.fname_out can no longer be None
* gguf-py : do not use title case for naming convention
Some models use acronyms in lowercase,
which can't be title-cased like other words,
so it's best to simply use the same case
as in the original model name.
Note that the size label still has an uppercased suffix
to make it distinguishable from the context size of a finetune.
Main thing is that the default output filename will take this form
{name}{parameters}{finetune}{version}{encoding}{kind}
In addition this add and remove some entries in the KV store and adds a metadata class with automatic heuristics capability to derive some values based on model card content
* No Change:
- Internal GGUF Spec
- `general.architecture`
- `general.quantization_version`
- `general.alignment`
- `general.file_type`
- General Model Details
- `general.name`
- `general.author`
- `general.version`
- `general.description`
- Licensing details
- `general.license`
- Typically represents the converted GGUF repo (Unless made from scratch)
- `general.url`
- Model Source during conversion
- `general.source.url`
* Removed:
- Model Source during conversion
- `general.source.huggingface.repository`
* Added:
- General Model Details
- `general.organization`
- `general.finetune`
- `general.basename`
- `general.quantized_by`
- `general.size_label`
- Licensing details
- `general.license.name`
- `general.license.link`
- Typically represents the converted GGUF repo (Unless made from scratch)
- `general.doi`
- `general.uuid`
- `general.repo_url`
- Model Source during conversion
- `general.source.doi`
- `general.source.uuid`
- `general.source.repo_url`
- Base Model Source
- `general.base_model.count`
- `general.base_model.{id}.name`
- `general.base_model.{id}.author`
- `general.base_model.{id}.version`
- `general.base_model.{id}.organization`
- `general.base_model.{id}.url` (Model Website/Paper)
- `general.base_model.{id}.doi`
- `general.base_model.{id}.uuid`
- `general.base_model.{id}.repo_url` (Model Source Repository (git/svn/etc...))
- Array based KV stores
- `general.tags`
- `general.languages`
- `general.datasets`
---------
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* lora: load to devide buft
* add patch tensor function
* correct tensor patch
* llama_lora_adapter_apply
* correct ggml_backend_tensor_copy
* add llm_build_mm
* fix auto merge
* update based on review comments
* add convert script
* no more transpose A
* add f16 convert
* add metadata check
* add sanity check
* fix ftype
* add requirements
* fix requirements
* fix outfile
* conversion: only allow selected models
* fix types
* cuda : do not use dmmv if the tensor does not have enough cols
* llama : lora fixes
* do not disable mmap with lora
Co-authored-by: slaren <slarengh@gmail.com>
* llm_build_lora_mm_id
* convert_lora : MoE LoRA conversion support
* convert_lora : prefer safetensors, similarly to convert_hf
* convert_hf : simplify modify_tensors for InternLM2
* convert_lora : lazy conversion
* llama : load and use alpha from LoRA adapters
* llama : use llm_build_lora_mm in most model graphs
* auto scale
* Revert "auto scale"
This reverts commit 42415a4874e0f963e4aca6796ea5dfb97cd17464.
* remove redundant params
* Apply suggestions from code review
Co-authored-by: slaren <slarengh@gmail.com>
* change kv metadata
* move add_type to __init__
* convert_hf : move add_type to main()
* convert_lora : use the GGUFWriter from Model instead of overwriting it
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>