ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

Author	SHA1	Message	Date
firecoperana	befbc0945b	server: variance based checkpoint eviction (#2020 ) Co-authored-by: firecoperana <firecoperana>	2026-06-24 08:54:07 +02:00
magikRUKKOLA	72440a19fc	on-demand tensor reload (#1989 ) * host-swap tensor loop the host-swap functionality is only triggered when the certain env. variables are declared * target_include_directories tweak * hot-swap tensor support two intrusions: 1.) at the model loading to collect the snapshot 2.) the modification of the `/health` HTTP endpoint to be able to trigger the hot-swap via sending the `llama-server` the HTTP-request. both a braced by the specific env. variables hot-swap tensor support; graph invalidation ggml_backend_cuda_invalidate_graphs export * hot-swap tensor support graph invalidation implementation; extended debug output (commented out) * llama_reload_changed_tensors export * tensor hot-swap on-demand reload cpu-only/hybrid/gpu-only with split mode layer/graph full support implementation * docs * reuse the gguf parsing from llama.cpp gguf_init_from_file, gguf_find_tensor, ggml_get_tensor * remove the manual scheduling for hybrid inference * update docs * tensor shape validation * update docs * update docs accidentally wiped the previous changes; so recovered them * revert the GGML_CUDA_MAX_DEVICES to 16 * update llama_reload_changed_tensor update llama_reload_changed_tensor, revert CMakeLists.txt * update llama_reload_changed_tensor * GGML_MAX_SRC GGML_MAX_SRC compile-time definition support * GGML_MAX_SRC GGML_MAX_SRC compile-time definition support * GGML_MAX_SRC GGML_MAX_SRC compile-time definition support * llama_reload_changed_tensor update llama_reload_changed_tensor definition * refactory move the tensor-reloading implementation to llama-reload.cpp, llama-reload-info.h; some bugfixes and code reduction * revert added back the missing newline * update docs * reload_info constructor * bugfix: cpu-only TODO: improve the working environment by compiling for multiple hardware configurations; possibly make a test pipeline * cpu-only bugfix set the fix again after unsuccessful sync with main * windows os compilation fix #include <string> * fix windows os build error C2039: 'string': is not a member of 'std' * remove dead file * implement perplexity in server * Revert "implement perplexity in server"	2026-06-22 16:36:34 +02:00
Kawrakow	71af16a6b7	Fix DFlash oerformance with split mode graph (#1980 )	2026-06-17 18:40:02 +02:00
Jun Yamog	064d23a6f8	Codex CLI Responses Compatibility (#1964 ) * responses: skip known unsupported Responses tool types from Codex CLI - Skip namespace, web_search, image_generation tools instead of HTTP 500 - Reject unknown non-function tool types with controlled error - Preserve function tool conversion logic unchanged Fixes Codex CLI 0.133.0 compatibility where it sends mixed tool types. * responses: harden codex compatibility coverage * responses: expose Codex model catalog metadata	2026-06-16 15:28:16 +02:00
Kawrakow	f9078e169b	Merge pull request #1970 from SamuelOliveirads/feat/dflash-implementation Add DFlash support	2026-06-16 15:07:55 +02:00
Kawrakow	11c9935ce8	Merge pull request #1893 from ikawrakow/ik/gemma4_mtmd_blindness Fix Gemma4 vision	2026-06-16 07:47:37 +02:00
Kawrakow	e927adc4ad	Merge pull request #1969 from Farmadupe/resize_algo_fix Correct image resize algorithm for all qwens after qwen2vl and gemma4	2026-06-15 13:39:11 +02:00
SamuelOliveirads	6cae8c7ba2	clean logs	2026-06-14 21:07:57 -03:00
Thomas Green	19f08160ad	Correct image resize algorithm for all qwens after qwen2vl and gemma4	2026-06-14 21:57:11 +01:00
Thomas Green	574f22b3c7	Replace image resizers with avx2/neon simd impls from stb_img_resize2.h	2026-06-14 20:28:08 +01:00
SamuelOliveirads	0d75eee35a	remove duplicated code and unnecesary refactor	2026-06-14 16:02:02 -03:00
SamuelOliveirads	3a1d46c4d1	Merge remote-tracking branch 'origin/main' into feat/dflash-implementation # Conflicts: # common/common.cpp # common/speculative.cpp # convert_hf_to_gguf.py # examples/server/server-context.cpp # examples/server/server-context.h # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama.cpp	2026-06-13 17:27:52 -03:00
Samuel Oliveira Alves	8a38025174	Refactor: Move spec outside server (#1949 ) * Refactor speculative decoding: move logic outside of server * remove duplicated tokens in mtp kv cache * narrow to only discard draft cells in MTP * revert mtp_speculative_gen_draft	2026-06-12 18:12:39 +02:00
Simon Lundell	b1eb8bb0a1	server: gate llama_decode_stop() to the active decode (fix queued-cancel cascade) (#1941 ) With --parallel 1, a client disconnect/timeout on a queued request aborts the active decode of a different client (llama_decode: failed to decode, ret = -3 / "Decode process is cancelled by user"), releasing the slot with the request unfinished. To the active client the stream silently stalls and never returns, while the server reports healthy — easy to misdiagnose as a network/proxy wedge. Root cause: llama_decode_stop() signals a process-global stop flag that the active decode loop polls. examples/server/server.cpp calls it ungated from the request reader's connection-closed paths, so any reader closing (including a queued, not-yet-running task's) trips the global flag against whatever decode is currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" + hybrid/recurrent ret=-3), which did not gate these call sites against non-active readers, so the queued-cancel-kills-active cascade still fires on current main. Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the three llama_decode_stop() sites on it, so the global stop is signalled only when one of THIS reader's tasks is on a slot (the active decode). A queued task's disconnect then only drops that queued task. Verified in production under heavy concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero active-decode kills). Stdlib-only reproducer in the PR description. Caveat: any_task_on_slot() reads the slots vector from the reader thread — the same race class as the existing process-global flag; can be tightened to a per-context/per-task cancellation if preferred.	2026-06-12 08:25:44 +02:00
Kawrakow	e6f8112f3b	Adjust CUDA FA kernel parameters for head size 512 on Turing (#1942 )	2026-06-10 07:49:21 +02:00
firecoperana	2a1148384c	server: fix double submits of infill (#1944 ) Co-authored-by: firecoperana <firecoperana>	2026-06-10 07:48:15 +02:00
Farmadupe	6b9de3dbaa	Fix mrope application across chunk boundaries (Fixes #993 and #1902 -- part 2) (#1918 ) * (qwen3vl) Correct calculation for injection point of deepstack image embeddings INjection point for deepstack embeddings used Hyperparameter n_embd_inp(), which caused the hidden state to be double accounted for, causing an OOB array access. The correct accessor is n_embd() * Fix m-rope when pipeline parallelism is enabled	2026-06-05 17:10:02 +02:00
Farmadupe	1520eda980	prompt cache: Fix assertion that prompt cache does ot rewind to middle of image (#1913 )	2026-06-04 17:53:06 +02:00
Samuel Oliveira Alves	007d640098	Standardize speculative decoding arguments on the server (#1908 ) * refactor spec args * add shell-safe quoting of string-valued stage keys in speculative decoding	2026-06-04 15:44:57 +02:00
firecoperana	6c0180d702	server: enable mcp proxy (#1904 ) * update http lib * Add cors proxy --------- Co-authored-by: firecoperana <firecoperana>	2026-06-04 15:43:07 +02:00
firecoperana	074fc7dafd	webui: update llamacpp webui (#1903 ) update config ui: fix audio and video modality detection (#23756) When model props are fetched asynchronously from the server, modelPropsVersion is incremented to trigger reactivity, but only the vision effect was listening to it. webui: update ignore files ui: handle audio/vnd.wave as audio WAV file (#23754) Firefox on Linux uses this MIME type ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (#23910) webui: add custom CSS injection via config (#23904) * webui: add custom CSS injection via config register a customCSS setting in the Developer section under Custom JSON, syncable so it rides the existing ui-config pass through. inject the value into a single style element in the head, reactive on the setting. lets an operator theme a prebuilt binary through --ui-config without rebuilding, and lets a user set it from the settings panel. move the textContent write into a use: action on the head style node. the action is the idiomatic way to touch a node, so the no-dom-manipulating lint rule is satisfied without a disable. value stays text through textContent, never parsed as HTML. * Update tools/ui/src/lib/constants/settings-keys.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: address review from @allozaur, rename custom config key to customJson with migration rename the custom config key to customJson across the type, the chat request builder, the settings save check and the custom tools reader, keeping the custom API param name unchanged. add a non destructive migration that copies the legacy custom key to customJson at startup. only render the head style tag when custom CSS is set. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> server: real-time reasoning interruption via control endpoint (#23971) Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. * ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export. Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. * ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated control-actions module. No behavior change, drops the magic strings so the control protocol has a single source of truth. * server: target reasoning control by completion id Address @ngxson review on the control endpoint. Switch from id_slot to the chat completion id to avoid a TOCTOU: the slot can be reassigned between the lookup and the control request, so matching the live completion (oaicompat_cmpl_id) is safe and a finished one simply matches nothing. Rename the action to reasoning_end, guard it on the reasoning_control flag of the target slot, and reduce the response to {success} with an optional message. * ui: target reasoning control by completion id Keep the streamed completion id on the message and post it back to the control endpoint instead of probing /slots. Drops the slot discovery and the TOCTOU that came with it. Action renamed to reasoning_end, response read as {success}. * server: address review from @ngxson Move the control fields into task_params and drop the redundant comments on the control path. * server: document the reasoning control endpoint * Update tools/ui/src/lib/types/database.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: rename cmplId to completionId Per @allozaur review, clearer name for the streamed completion id. * ui: wire completion id capture through the agentic flow The webui streams through the agentic flow, which relayed onModel but not onCompletionId, so the completion id never reached the message and the control request was never sent. Relay it through the flow and its callbacks type, declare id on the chunk type, and log an explicit error when the button fires without a usable id. * ui: target reasoning control model from the message The model is a property of the completion, so read it from the streaming message like the id, not from the model dropdown which is unrelated UI state. Makes the request self-consistent by construction instead of just unlikely to drift. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> ui: Add Thinking mode toggle with reasoning effort levels + improvements for Chat Form Add Action UI (#23434) Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * fix: Model tags ui: simplify network error handling (#23431) Previously error to string conversion was split in two different files, with one converting errors into strings, and another function analyzing those strings to generate yet another string. Now the the error handling for network fetches has been centralised and uses directly HTTP error codes whereas possible to generate the human-readable error strings. It also fixes an issue where all JSON errors reported from the backend, such as "Invalid API key", would get turned incorrectly in to "Failed to connect to server" due to poor matching logic in the now-gone getErrorMessage function. update html ui: Mermaid Diagrams in chat + interactive preview (#24032) webui: fix tool selector toggle/counter, key tools by stable identity (#24065) * webui: fix tool selector toggle/counter, key tools by stable identity Key the disabled set, counts and toggles by a stable per-tool key instead of bare function name, deduped from one canonical list. Per-tool checkboxes become presentational (single row handler, no nested button), category checkboxes drop the tristate (n/total carries partial). One getEnabledToolsForLLM keeps normalized MCP schemas and dedupes by name. * ui: use SvelteSet and SvelteMap for local tool collections to satisfy svelte/prefer-svelte-reactivity Co-authored-by: firecoperana <firecoperana>	2026-06-04 15:41:23 +02:00
Farmadupe	e08ad51f15	Insert image pad markers for kimi K2.5 and K2.6 (#1912 )	2026-06-04 09:27:28 +02:00
SamuelOliveirads	1250f522ed	add qwen, gemma and kimi dflash support	2026-06-01 17:14:25 -03:00
SamuelOliveirads	1369e68471	fix graph mask, swa layers and tokens positions	2026-05-31 11:12:03 -03:00
SamuelOliveirads	532499836e	improve DFlash caching and profiling capabilities	2026-05-30 21:36:10 -03:00
Samuel Oliveira Alves	3f40e73c36	expand np guardrail for all mtp types (#1901 )	2026-05-30 16:19:53 +03:00
SamuelOliveirads	9f5f70cf7e	implement target position tracking and context management	2026-05-29 23:11:38 -03:00
SamuelOliveirads	82cff238fe	Initial dflash implementation	2026-05-28 18:57:58 -03:00
Kawrakow	6648aa2e6e	Fix Gemma4 vision	2026-05-28 15:08:46 +00:00
Samuel Oliveira Alves	9f7ba245ab	Update autofix and presets (#1867 ) * Add configuration files for format, presets and examples * add clang in pre-commit config * remove clang configurations * Refactor .gitignore for consistency in formatting	2026-05-24 07:30:44 +03:00
dungquixote42	642c038ccd	Extend expiring logit bias to other sampling parameters (#1770 ) * initial commit * fix underflow bug, add debug prints, update macro/variable names * fix phrases-sharing-1-flag bug, replace macros with struct member function * cleanup * fix file parsing * string_split_open_close() -> string_extract(), improve escape handling * support multiple nested entries * make persistent entries global, simplify file parsing * cosmetic changes * add support for jumping to exitword * update variable names * fix bad search bug * better debug prints, reorg * replace lambda with string_is_found(), add string_unescape() for debug * add support for inline comments * add missing debug print macro * fix type promotion bug * actually fix type promotion bug	2026-05-23 19:19:12 +03:00
Samuel Oliveira Alves	d51036a0c4	fix: reset KV cache and prompt state in server_slot and server_context (#1860 )	2026-05-22 08:14:47 +03:00
Samuel Oliveira Alves	11a1fea9e2	Move embedding management to speculative (#1825 ) * refactor speculative decoding with companion context and draft result structures * feat: add common speculative feature handling in server context * refactor: move embedings outside server * feat: harden draft input hidden state in llama context * remove unused functions * refactor: streamline speculative feature handling and remove unused code * remove redundant code * remove more unused variables * refactor: implement speculative feature handling	2026-05-20 17:42:48 +03:00
Samuel Oliveira Alves	77413bc900	Add Hadamard parameters to draft model loading (#1840 )	2026-05-19 18:30:41 +03:00
firecoperana	104846ddee	spec : disacard last drafted token with low prob (#1820 ) * spec : disacard last drafted token with low prob * Apply suggestion from @ikawrakow Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2026-05-19 08:35:35 +03:00
firecoperana	f645ed1e2d	AutoParser: improve reasoning budget and handling of space/newline in tool calls (#1819 ) common/chat, server: refactor, move all conversion functions to common, add tests (#20690) jinja : remove unused header (#22310) common : fix jinja warnings with clang 21 (#22313) Signed-off-by: Adrien Gallouët <angt@huggingface.co> chat: fix handling of space in reasoning markers (#22353) * chat: fix handling of space in reasoning markers common : re-arm reasoning budget after DONE on new <think> (#22323) common : determine generation prompt using longest common prefix (#22657) common/autoparser: fixes for newline handling / forced tool calls (#22654) * chat/autoparser: the fixes * Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls. * Trim whitespace on apply instead common/chat : preserve media markers for typed-content templates (#22634) common : revert reasoning budget +inf logit bias (#22740) common : do not wrap raw strings in schema parser for tagged parsers (#22827) common : enable streaming JSON argument values (#23173) * common : remove atomic from json arguments * common : remove parsing logic on JSON arguments common : do not pass prompt tokens to reasoning budget sampler (#22488) reasoning-budget: clone should do a deep-copy (#23095) Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-05-19 08:34:19 +03:00
gapeleon	c35189d83c	fix(server): reset chat parser on slot reuse to prevent crash (#1763 ) (#1794 ) If a slot is reused for a standard completion (`/v1/completions`) after being used for a chat completion (`/v1/chat/completions`), the previous chat's PEG parser would remain active in the slot's parameters. This caused standard text completions to throw on the raw text.	2026-05-17 18:26:45 +03:00
Kawrakow	1f8c603d9c	Quantize: add extra output tensor for MTP (#1810 ) * Quantize: add extra output tensor for MTP * Consistently use --mtp-requantize-output-tensor	2026-05-17 13:59:56 +03:00
Samuel Oliveira Alves	f4f4b3ff26	Allow dual speculative decoding (#1789 ) * wip: test logic to use multiple specs * feat: introduce composite speculative decoding stages * handle MTP context and draft invalidation * fix: allow gemma mtp for speculative stages * fix: normalize spec stage keys * refactor: remove enable_mtp flag and improve speculative stage handling * fix: update cached text tokens handling for stage chains * feat: implement sync for external MTP after non-MTP accept	2026-05-15 10:10:40 +03:00
Samuel Oliveira Alves	40b65d8f54	feat: add support for draft imatrix output file (#1803 )	2026-05-15 08:10:58 +03:00
Kawrakow	ba72890076	Faster imatrix (#1801 ) * Faster imatrix on AVX2 * Slightly better	2026-05-15 07:15:16 +03:00
Samuel Oliveira Alves	35fbe08d6e	disable MTP for parallel slots (#1804 )	2026-05-15 07:11:04 +03:00
Samuel Oliveira Alves	0fcffdb64d	feat: map Gemma 4 tensor and support with imatrix (#1796 )	2026-05-14 09:01:24 +03:00
ubergarm	ca52a825db	feat: add --threads-mtmd for independent multimodal thread count (#1797 ) Add `-tm` / `--threads-mtmd` to control CPU thread count used during multimodal image/audio processing (mmproj encoding), separate from the main LLM thread count. This allows running the LLM on GPU with minimal CPU threads (e.g. `-t 1`) to reduce sync overhead, while using many threads (e.g. `-tm 16`) for CPU-bound mmproj encoding with `--no-mmproj-offload`. Fallback chain when `-tm` is not specified: 1. `--threads-batch` (-tb) — multimodal encoding is a batch/prefill-like operation, so it makes sense to track with batch thread count 2. `--threads` (-t) — final default Works with both mtmd-cli and llama-server. AI: ubergarm/Qwen3.6-27B-GGUF MTP IQ4_KS 15.113 GiB (4.752 BPW) + pi.dev	2026-05-13 17:44:43 +03:00
firecoperana	cdc288bc97	server: reset cache tokens after pp stops (#1787 ) Co-authored-by: firecoperana <firecoperana>	2026-05-13 09:05:32 +03:00
Samuel Oliveira Alves	be8435793e	Pre-allocate buffers for hybrid model checkpoints (#1774 ) * hybrid-spec: improve recurrent checkpoint handling in speculative decoding * change per-step save to support scheduling and asynchronous tensor operations * remove redudant backend tensor fallback * improve recurrent tensor handling for split graph	2026-05-12 07:21:25 +03:00
Lingfeng Ren	c2f498ab4c	MTP: use target slot position for drafting (#1781 )	2026-05-12 07:21:03 +03:00
Lingfeng Ren	35845dd975	server : support MTP with multimodal prompts (#1758 ) Synchronize MTP state after mtmd decode batches so multimodal prompt chunks do not desync the draft context.	2026-05-11 09:51:07 +03:00
Samuel Oliveira Alves	c2b8bca807	Add MTP Support for Gemma 4 (#1744 ) * gemma-mtp: build the arch to load the MTP model * gemma-mtp: fix mtp kv state * gemma-mtp: refactor some functions and create gguf * gemma-mtp: make usable for embeddings models variant * gemma-mtp: fix qwen mtp load in graph split * gemma-mtp: refactor tensor creation and adjust output tensor handling * Gemma 4 MTP: improve tensor handling, and adjust split mode logic	2026-05-10 07:44:20 +03:00
Alex	51331f4973	Fix two speculative-decoding crashes that prevent any usage (#1760 ) This patch addresses two latent bugs in examples/speculative/speculative.cpp that prevent llama-speculative.exe from running on greedy sampling (temp=0) or producing rejection-sampling output (temp>0): 1. Line 191: `params.sparams.grammar = { COMMON_GRAMMAR_TYPE_NONE, "" };` invokes `common_grammar(type, grammar)` which asserts `type != NONE \|\| !grammar.empty()`. Both conditions fail with the intended-to-be-empty grammar, so every speculative run hits a hard `GGML_ASSERT` in common/sampling.h:63 immediately after model load. Fix: default-construct via `common_grammar{}` to bypass the field-init constructor. 2. Lines 293-294: `GGML_ASSERT(dist_tgt.sorted)` and `GGML_ASSERT(dist_dft.sorted)` fire whenever the draft sampler does not set the .sorted flag (which is most modern sampler paths). Comment them out — the next ~10 lines re-sort both distributions by id explicitly, so the assertion is incorrect anyway. Fix: replace the asserts with an explanatory comment. After both fixes, `llama-speculative.exe` runs to completion. The acceptance-rate measurement at temp=0 still looks suspicious (0% across same-family draft/target pairs), but that is a different issue out of scope for this PR. Tested on Qwen3-0.6B-IQ4_XS drafting Qwen3-1.7B-IQ4_XS, both base models from `bartowski/Qwen_Qwen3-*-GGUF` on Windows + ik_llama.cpp build at HEAD of windows-mingw-default-win10 (which is itself a follow-up to PR #1755).	2026-05-09 08:36:38 +03:00

1 2 3 4 5 ...

1242 Commits