ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

Author	SHA1	Message	Date
Thomas Green	574f22b3c7	Replace image resizers with avx2/neon simd impls from stb_img_resize2.h	2026-06-14 20:28:08 +01:00
Kawrakow	4f1ec69ae5	Merge pull request #1965 from Nexesenex/fix_q8_0_graph_reduce_type CUDA: Fix Q8_0 graph reduce type	2026-06-14 16:32:48 +02:00
Nexesenex	0fdac83272	Fix Q8_0 graph reduce type Analogous to the BF16 fix in eea6a82b25, this adds proper Q8_0 type handling in ggml_cuda_op_add: - Add k_add_q8_0_f32 kernel: dequantize Q8_0, add F32, store F32 - Add k_add_q8_0_q8_0_f32 kernel: dequantize two Q8_0, add, store F32 - Add Q8_0+Q8_0/Q8_0+F32/F32+Q8_0 branches in the F32 dst (else) block, preventing Q8_0 data from falling through to the incorrect half cast - Expand Q8_0 dst branch to handle F32+Q8_0->Q8_0 (swapped args), not just Q8_0+F32->Q8_0	2026-06-14 16:13:17 +02:00
Kawrakow	670a3f6f5b	Merge pull request #1960 from BeccaLabs/fix/rpc-device-init fix: initialize rpc_device endpoint and device index before parsing	2026-06-14 08:14:07 +02:00
BECCA-Labs	053202f97a	fix: initialize rpc_device endpoint and device index before parsing	2026-06-13 16:13:44 -05:00
Kawrakow	5f917a64b3	Merge pull request #1958 from ikawrakow/ik/handle_think_no_space	2026-06-12 21:27:23 +02:00
Samuel Oliveira Alves	8a38025174	Refactor: Move spec outside server (#1949 ) * Refactor speculative decoding: move logic outside of server * remove duplicated tokens in mtp kv cache * narrow to only discard draft cells in MTP * revert mtp_speculative_gen_draft	2026-06-12 18:12:39 +02:00
Farmadupe	d1339249d7	Cleanup: Unify location of m-rope repacking for token and embd (#1924 ) * unify location of rope-position-array rewriting prior to ubatching * Reorder terms.	2026-06-12 08:27:50 +02:00
Simon Lundell	b1eb8bb0a1	server: gate llama_decode_stop() to the active decode (fix queued-cancel cascade) (#1941 ) With --parallel 1, a client disconnect/timeout on a queued request aborts the active decode of a different client (llama_decode: failed to decode, ret = -3 / "Decode process is cancelled by user"), releasing the slot with the request unfinished. To the active client the stream silently stalls and never returns, while the server reports healthy — easy to misdiagnose as a network/proxy wedge. Root cause: llama_decode_stop() signals a process-global stop flag that the active decode loop polls. examples/server/server.cpp calls it ungated from the request reader's connection-closed paths, so any reader closing (including a queued, not-yet-running task's) trips the global flag against whatever decode is currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" + hybrid/recurrent ret=-3), which did not gate these call sites against non-active readers, so the queued-cancel-kills-active cascade still fires on current main. Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the three llama_decode_stop() sites on it, so the global stop is signalled only when one of THIS reader's tasks is on a slot (the active decode). A queued task's disconnect then only drops that queued task. Verified in production under heavy concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero active-decode kills). Stdlib-only reproducer in the PR description. Caveat: any_task_on_slot() reads the slots vector from the reader thread — the same race class as the existing process-global flag; can be tightened to a per-context/per-task cancellation if preferred.	2026-06-12 08:25:44 +02:00
Marian M.	5fb707d19b	Update docs (#1956 ) * Update README.md Models, MTP, fit * Update parameters.md Disclaimer, terms, new flags, graph split list.	2026-06-12 08:24:22 +02:00
Kawrakow	175819b4fb	Style	2026-06-12 06:19:06 +00:00
Kawrakow	3dbc3241b9	Handle forced-open reasoning tag without trailing whitespace	2026-06-12 05:43:11 +00:00
Joel Farthing	8d91d3c3d9	common: gate empty-start reasoning extraction (#1955 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-12 07:16:24 +02:00
Kawrakow	022bd00aab	Optimize Cohere2-MoE graph parallel (#1948 ) * Optimzie Cohere2-MoE graph parallel * Minor	2026-06-11 07:26:42 +02:00
firecoperana	ca0c1c5f85	fix Qwen3.6 outputs blank <think></think> in response when thinking is off (#1951 ) Co-authored-by: firecoperana <firecoperana>	2026-06-11 07:26:07 +02:00
Kawrakow	c0d25e8fa1	Gemma4 E2B/E4B tweaks (#1947 ) * Gemma4 E2B/E4B tweaks * A few more named nodes	2026-06-10 15:28:54 +02:00
Joel Farthing	4a1e2eaa69	model: add Cohere2-MoE North Mini Code support (#1945 ) * Add Cohere2 MoE North Mini Code support * Fix Cohere2 MoE expert tensor emission * Enhance Cohere2-MoE support by modifying tensor handling and configuration logic * Fix Cohere2-MoE graph split reduce handling --------- Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-10 15:28:27 +02:00
Kawrakow	e6f8112f3b	Adjust CUDA FA kernel parameters for head size 512 on Turing (#1942 )	2026-06-10 07:49:21 +02:00
firecoperana	2a1148384c	server: fix double submits of infill (#1944 ) Co-authored-by: firecoperana <firecoperana>	2026-06-10 07:48:15 +02:00
Joel Farthing	71d5aa21f7	common: handle Laguna chat delimiters (#1943 ) * common: handle Laguna chat delimiters * common: limit tool parser changes to end-delimited content --------- Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-10 07:46:19 +02:00
Kawrakow	366e478cb6	Bug fixes (#1940 ) * Bug fixes * More	2026-06-10 07:45:49 +02:00
Kawrakow	2768b62515	Split mode graph for Laguna (#1939 )	2026-06-09 10:13:30 +02:00
Kawrakow	11c3546235	Support for alternative Gemma4 assistant (#1937 )	2026-06-09 09:30:12 +02:00
Kawrakow	a38d29232d	CPU FA: disable mask optimization (#1935 )	2026-06-09 09:13:19 +02:00
Joel Farthing	bbe1a511ee	model: add Poolside Laguna XS.2 support (#1911 ) * llama: register Laguna architecture * llama: add Laguna graph support * llama: place Laguna MoE tensors for cpu-moe * gguf: add Laguna metadata and tokenizer ids * convert: support Poolside Laguna XS.2 * model: align Laguna RoPE and graph semantics * model: align Laguna partial offload with review feedback * model: localize Laguna SWA YaRN defaults * model: localize Laguna SWA RoPE constants --------- Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-08 18:33:12 +02:00
Kawrakow	eea6a82b25	Fix bf16 graph reduce type (#1938 )	2026-06-08 16:51:05 +02:00
Kawrakow	1660459db5	CUDA FA: cover Gemma4-4B/2B assistant (#1934 )	2026-06-08 08:18:26 +02:00
Kawrakow	b50b0919d5	CPU FA: Check for empty attention mask (#1923 )	2026-06-08 07:54:57 +02:00
Joel Farthing	2f2ca7adb1	convert: support Gemma4UnifiedAssistantForCausalLM (#1925 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-08 07:43:43 +02:00
Joel Farthing	3c0f7b2f47	Gemma4: allow missing shared-KV edge tensors (#1927 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-08 07:25:19 +02:00
Farmadupe	6b9de3dbaa	Fix mrope application across chunk boundaries (Fixes #993 and #1902 -- part 2) (#1918 ) * (qwen3vl) Correct calculation for injection point of deepstack image embeddings INjection point for deepstack embeddings used Hyperparameter n_embd_inp(), which caused the hidden state to be double accounted for, causing an OOB array access. The correct accessor is n_embd() * Fix m-rope when pipeline parallelism is enabled	2026-06-05 17:10:02 +02:00
Kawrakow	1b53a58bf9	Enable split mode graph for Gemma4-12B (#1922 )	2026-06-05 10:59:22 +02:00
Farmadupe	1520eda980	prompt cache: Fix assertion that prompt cache does ot rewind to middle of image (#1913 )	2026-06-04 17:53:06 +02:00
Chip Bradford	19dcc1f7d1	CUDA : support head_dim 512 with gqa_ratio % 8 (unblocks Gemma 4 12B) (#1921 ) The MMA flash-attention dispatcher only instantiated ncols2 = 8 and 4 for head_dim 512, so any other GQA ratio hit GGML_ABORT. Gemma 4 12B's global attention layers use head_dim 512 with a 16:1 GQA ratio (16 query heads / 1 KV head), which aborts at load. Because MTP speculative decoding requires flash attention, this also blocks the Gemma 4 12B MTP drafter entirely. Instantiating ncols2 = 16 there is not viable: it exceeds the maximum dynamic shared memory on Ada (cudaFuncSetAttribute returns invalid argument). Instead, route gqa_ratio % 8 == 0 (covering 8 and 16) through the existing ncols2 = 8 kernel, which already iterates over Q-head groups (iter_z = ceil(gqa_ratio / ncols2)). gqa_ratio 8 and 4 behavior is unchanged; this mirrors the divisor dispatch already used for the 576x512 case below. Verified on RTX 4070 Ti SUPER (Ada, cc 8.9): Gemma 4 12B + MTP drafter now runs with flash attention; draft acceptance 43-95% by workload, 1.5-2.2x end-to-end speedup. The 26B-A4B drafter (gqa_ratio 8) is unaffected. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 17:36:10 +02:00
Samuel Oliveira Alves	007d640098	Standardize speculative decoding arguments on the server (#1908 ) * refactor spec args * add shell-safe quoting of string-valued stage keys in speculative decoding	2026-06-04 15:44:57 +02:00
firecoperana	6c0180d702	server: enable mcp proxy (#1904 ) * update http lib * Add cors proxy --------- Co-authored-by: firecoperana <firecoperana>	2026-06-04 15:43:07 +02:00
firecoperana	074fc7dafd	webui: update llamacpp webui (#1903 ) update config ui: fix audio and video modality detection (#23756) When model props are fetched asynchronously from the server, modelPropsVersion is incremented to trigger reactivity, but only the vision effect was listening to it. webui: update ignore files ui: handle audio/vnd.wave as audio WAV file (#23754) Firefox on Linux uses this MIME type ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (#23910) webui: add custom CSS injection via config (#23904) * webui: add custom CSS injection via config register a customCSS setting in the Developer section under Custom JSON, syncable so it rides the existing ui-config pass through. inject the value into a single style element in the head, reactive on the setting. lets an operator theme a prebuilt binary through --ui-config without rebuilding, and lets a user set it from the settings panel. move the textContent write into a use: action on the head style node. the action is the idiomatic way to touch a node, so the no-dom-manipulating lint rule is satisfied without a disable. value stays text through textContent, never parsed as HTML. * Update tools/ui/src/lib/constants/settings-keys.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: address review from @allozaur, rename custom config key to customJson with migration rename the custom config key to customJson across the type, the chat request builder, the settings save check and the custom tools reader, keeping the custom API param name unchanged. add a non destructive migration that copies the legacy custom key to customJson at startup. only render the head style tag when custom CSS is set. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> server: real-time reasoning interruption via control endpoint (#23971) Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. * ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export. Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. * ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated control-actions module. No behavior change, drops the magic strings so the control protocol has a single source of truth. * server: target reasoning control by completion id Address @ngxson review on the control endpoint. Switch from id_slot to the chat completion id to avoid a TOCTOU: the slot can be reassigned between the lookup and the control request, so matching the live completion (oaicompat_cmpl_id) is safe and a finished one simply matches nothing. Rename the action to reasoning_end, guard it on the reasoning_control flag of the target slot, and reduce the response to {success} with an optional message. * ui: target reasoning control by completion id Keep the streamed completion id on the message and post it back to the control endpoint instead of probing /slots. Drops the slot discovery and the TOCTOU that came with it. Action renamed to reasoning_end, response read as {success}. * server: address review from @ngxson Move the control fields into task_params and drop the redundant comments on the control path. * server: document the reasoning control endpoint * Update tools/ui/src/lib/types/database.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: rename cmplId to completionId Per @allozaur review, clearer name for the streamed completion id. * ui: wire completion id capture through the agentic flow The webui streams through the agentic flow, which relayed onModel but not onCompletionId, so the completion id never reached the message and the control request was never sent. Relay it through the flow and its callbacks type, declare id on the chunk type, and log an explicit error when the button fires without a usable id. * ui: target reasoning control model from the message The model is a property of the completion, so read it from the streaming message like the id, not from the model dropdown which is unrelated UI state. Makes the request self-consistent by construction instead of just unlikely to drift. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> ui: Add Thinking mode toggle with reasoning effort levels + improvements for Chat Form Add Action UI (#23434) Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * fix: Model tags ui: simplify network error handling (#23431) Previously error to string conversion was split in two different files, with one converting errors into strings, and another function analyzing those strings to generate yet another string. Now the the error handling for network fetches has been centralised and uses directly HTTP error codes whereas possible to generate the human-readable error strings. It also fixes an issue where all JSON errors reported from the backend, such as "Invalid API key", would get turned incorrectly in to "Failed to connect to server" due to poor matching logic in the now-gone getErrorMessage function. update html ui: Mermaid Diagrams in chat + interactive preview (#24032) webui: fix tool selector toggle/counter, key tools by stable identity (#24065) * webui: fix tool selector toggle/counter, key tools by stable identity Key the disabled set, counts and toggles by a stable per-tool key instead of bare function name, deduped from one canonical list. Per-tool checkboxes become presentational (single row handler, no nested button), category checkboxes drop the tristate (n/total carries partial). One getEnabledToolsForLLM keeps normalized MCP schemas and dedupes by name. * ui: use SvelteSet and SvelteMap for local tool collections to satisfy svelte/prefer-svelte-reactivity Co-authored-by: firecoperana <firecoperana>	2026-06-04 15:41:23 +02:00
Kawrakow	4406e637b5	Split mode graph for Mellum (#1920 )	2026-06-04 15:20:41 +02:00
Joel Farthing	dc51c6f9b2	Add Mellum2 architecture support (#1919 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-04 14:28:02 +02:00
Farmadupe	e08ad51f15	Insert image pad markers for kimi K2.5 and K2.6 (#1912 )	2026-06-04 09:27:28 +02:00
Samuel Oliveira Alves	3f40e73c36	expand np guardrail for all mtp types (#1901 )	2026-05-30 16:19:53 +03:00
Kawrakow	8960c5ba5e	Add extra nodes when dealing with MLA and amb (#1899 )	2026-05-29 15:17:24 +03:00
Kawrakow	e75337fec3	quantize: add exception for Gemma4 (#1897 )	2026-05-29 10:54:21 +03:00
Kawrakow	6eff055a0c	GLM-5 MTP (again) (#1890 ) * wip: port MTP architecture Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`. Changes include: - Updating `llama_batch` to support `mtp_params`. - Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft). - Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`). - Adapting the embedding extraction logic to skip MTP update passes. * Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model). * core: enable hybrid outputs (logits + embeddings) for MTP support * fix(mtp): correct KV-cache slot finding for updates * fix(mtp): persist hidden states to prevent context corruption during drafting * refactor(mtp): clean unused code * fix(mtp): update server to new functions name * fix(mtp): fix graph and save hidden state * mtp: refactor integration, context params and kv cache search * mtp: fix hidden state extraction and speculative acceptance flow * server: fix MTP warmup for long prompts and reset token buffer * llama: refactor MTP operation state to context parameters * server: fix n_past calculation in MTP acceptance * llama: fix mtp enable flags * speculative: refactor MTP to use common_speculative interface * context: remove unused signatures * clip: fix deprecated enum-enum conversion warning * common: fix format string crash in help message * context: fix mtp activation logic * llamat: always use the extracted embedding * llama: get all embeddings to kv cache * llama: revert logit to not run mtp for not supported arch * llama: allocate all the n_outputs for MTP * wip * server-context: get only the last embedding for hidden state * ggml-backend: fix array of bounds in debug build * server-context: run mt kv update to each prompt batch * revert segmentation fault fixes * glm-mtp(feat): optimize graph embedding and recursive drafting * glm5-mtp(feat): add glm 5 mtp logic * glm-mtp: standardize the MTP graph * glm 5 mtp: apply post-layer cvec * glm 5 mtp: mark head as mandatory * get normed embeddings for glm 5 * Fix GLM5 MTP * GLM5 MTP: just reuse the layer attention implementation * Make MTP work with split mode graph --------- Co-authored-by: samuel <samueloliveira32df@gmail.com>	2026-05-28 18:14:12 +03:00
Kawrakow	3bf7e836c2	Allow Hadamard transform for head sizes that are not power of 2 (#1883 ) * Disable K Hadamard transform if K-head size is not a power of 2 * Allow Hadamard transform for head sizes that are not power of 2 * Give more details why Hadamard is not possible * Arghh	2026-05-27 18:29:32 +03:00
Kawrakow	d503b046f7	Fix GLM MTP with split mode graph (#1887 ) * Fix crash with GLM and MTP * Fix GLM MTP with split mode graph	2026-05-27 07:24:28 +03:00
Kawrakow	1f66f9912f	Fix crash with GLM and MTP (#1885 )	2026-05-27 07:24:05 +03:00
Kawrakow	d2da6da05c	Fix cache loading/saving for MLA models and split mode graph (#1884 )	2026-05-26 17:07:40 +03:00
Gearstickle	4fbd0c441b	fa: preserve early-termination, fix multi-slot correctness via union of masks (#1880 ) * fa: fix FlashQKV early-termination causing S=0 assertion with --parallel N>1 The backward-scan optimization in compute_helper/compute_helper_q checks only one mask position per k_step block on the last query row (q_step-1) to find where valid KV entries end. When q_step > 1 and different query rows have non-overlapping valid KV regions (multi-slot / --parallel N>1), the scan on the last row's mask can miss blocks that contain valid entries for earlier rows. This causes those rows to accumulate S=0, triggering the GGML_ASSERT(S > 0) in normalize_and_store_1row. Fix: remove the early-termination scan at all 4 sites and iterate all nk1/k_step blocks unconditionally. The mask already handles correctness: fully-masked blocks produce smax=-inf and skip V accumulation, so the performance cost is minimal for TG (small nq1) and acceptable for PP. Fixes #809 * fa: refactor multi-slot mask fix into mask_effective_nk1() helper Replace 4× inlined early-termination scans with a shared helper that computes the effective K boundary by scanning ALL query mask rows (union-of-masks). This is the minimal fix for multi-slot parallel inference where different slots have different sequence lengths. The helper returns the k_step-aligned boundary covering the longest active sequence across all rows, preserving single-slot performance (single row = same boundary as before). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Turbomen008 <Turbomen008@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-26 16:16:49 +03:00
Kawrakow	b4e1d916c5	Per GPU fit margin (#1872 )	2026-05-25 08:16:45 +03:00

1 2 3 4 5 ...

4591 Commits