ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

Author	SHA1	Message	Date
gapeleon	4f220159b8	Fix (Gemma-4 Vision): Correct KQ mask fill for causal models in non-causal flash-attn mode (#1985 ) When llama_set_causal_attn(false) is called on a causal model (e.g. Gemma-4 during vision image decode), llama_set_inputs took the non-causal else-branch (designed for pure embedding models). That path wrote the F16 mask with stride n_tokens instead of n_kv, and iterated batch indices rather than KV cache cells. The result was that every image query row beyond the first was written at the wrong offset, leaving stale -inf values from previous decodes visible to the GPU kernel. Any conversation that had built up prior KV mask data would produce all-inf attention scores for most image tokens, collapsing softmax to NaN and aborting at sampling. Resolves #1984	2026-06-17 16:52:45 +02:00
Jun Yamog	064d23a6f8	Codex CLI Responses Compatibility (#1964 ) * responses: skip known unsupported Responses tool types from Codex CLI - Skip namespace, web_search, image_generation tools instead of HTTP 500 - Reject unknown non-function tool types with controlled error - Preserve function tool conversion logic unchanged Fixes Codex CLI 0.133.0 compatibility where it sends mixed tool types. * responses: harden codex compatibility coverage * responses: expose Codex model catalog metadata	2026-06-16 15:28:16 +02:00
Joel Farthing	d37d92b54c	chat: add Cohere2MoE North Code parser (#1968 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-16 15:27:30 +02:00
Kawrakow	8420f91ae3	Merge pull request #1977 from ikawrakow/ik/dflash_fix_cpu Fix DFlash on the CPU	2026-06-16 15:26:23 +02:00
Kawrakow	6f45163a95	Fix DFlash on the CPU	2026-06-16 13:22:36 +00:00
Kawrakow	f9078e169b	Merge pull request #1970 from SamuelOliveirads/feat/dflash-implementation Add DFlash support	2026-06-16 15:07:55 +02:00
Kawrakow	11c9935ce8	Merge pull request #1893 from ikawrakow/ik/gemma4_mtmd_blindness Fix Gemma4 vision	2026-06-16 07:47:37 +02:00
SamuelOliveirads	ad24046b51	minor refactor in DFlash kv cache graph	2026-06-15 18:22:56 -03:00
Kawrakow	2f524850a1	Merge pull request #1973 from ikawrakow/ik/fattn_mma_gqa_16 CUDA FA: faster TG when GQA is 16 and head size is 128	2026-06-15 15:24:01 +02:00
Kawrakow	37ea89cabf	Merge pull request #1974 from Nexesenex/fix_muge_crash_minimax_m3 Fix Minimax M3 crash when -muge merges up/gate experts	2026-06-15 15:07:49 +02:00
Nexesenex	3c9680fd3c	Fix Minimax M3 crash when -muge merges up/gate experts The graph builder for Minimax M3 (build_minimaxm3.cpp) was not passing model.layers[il].ffn_up_gate_exps to llm_build_std_moe_ffn, unlike Minimax M2 and all other MoE model graph builders. When -muge (merge_up_gate_experts) is enabled, the merge creates a single ffn_up_gate_exps tensor with ffn_up_exps and ffn_gate_exps as views. Only the parent merged tensor gets the split 'extra' pointer set. Without passing it as up_gate_exps parameter, the function sees null split pointers for up/gate (the views) while split_down_exps is valid, causing the assertion at llama-build-context.cpp:1453 to fail.	2026-06-15 15:00:32 +02:00
Kawrakow	6be3a488d3	CUDA FA: faster TG when GQA is 16 and head size is 128	2026-06-15 11:46:02 +00:00
Kawrakow	f81673c7db	Merge pull request #1972 from ikawrakow/ik/minimaxm3_smgraph Split mode graph for MiniMax-M3	2026-06-15 13:44:19 +02:00
Kawrakow	e927adc4ad	Merge pull request #1969 from Farmadupe/resize_algo_fix Correct image resize algorithm for all qwens after qwen2vl and gemma4	2026-06-15 13:39:11 +02:00
Kawrakow	00d96744de	Merge pull request #1967 from Farmadupe/stb_image_resize2 Replace image resizers with avx2/neon simd impls from stb_img_resize2.h	2026-06-15 13:38:31 +02:00
Kawrakow	1dc4ea938a	Merge pull request #1962 from ikawrakow/ik/fix_1961 Fix #1961	2026-06-15 13:00:27 +02:00
Kawrakow	c24d50dd88	Split mode graph for MiniMax-M3	2026-06-15 08:41:34 +00:00
Kawrakow	567854aeab	Merge pull request #1963 from jkyamog/minimax-m3-support Add preliminary MiniMax-M3 support	2026-06-15 10:16:10 +02:00
Jun Yamog	c08d194edd	Use standard graph helpers for MiniMax-M3	2026-06-15 02:00:46 +00:00
Jun Yamog	c538210e6d	Add MiniMax-M3 chat template	2026-06-15 01:29:13 +00:00
SamuelOliveirads	6cae8c7ba2	clean logs	2026-06-14 21:07:57 -03:00
Thomas Green	19f08160ad	Correct image resize algorithm for all qwens after qwen2vl and gemma4	2026-06-14 21:57:11 +01:00
Thomas Green	574f22b3c7	Replace image resizers with avx2/neon simd impls from stb_img_resize2.h	2026-06-14 20:28:08 +01:00
SamuelOliveirads	0d75eee35a	remove duplicated code and unnecesary refactor	2026-06-14 16:02:02 -03:00
Kawrakow	4f1ec69ae5	Merge pull request #1965 from Nexesenex/fix_q8_0_graph_reduce_type CUDA: Fix Q8_0 graph reduce type	2026-06-14 16:32:48 +02:00
Nexesenex	0fdac83272	Fix Q8_0 graph reduce type Analogous to the BF16 fix in eea6a82b25, this adds proper Q8_0 type handling in ggml_cuda_op_add: - Add k_add_q8_0_f32 kernel: dequantize Q8_0, add F32, store F32 - Add k_add_q8_0_q8_0_f32 kernel: dequantize two Q8_0, add, store F32 - Add Q8_0+Q8_0/Q8_0+F32/F32+Q8_0 branches in the F32 dst (else) block, preventing Q8_0 data from falling through to the incorrect half cast - Expand Q8_0 dst branch to handle F32+Q8_0->Q8_0 (swapped args), not just Q8_0+F32->Q8_0	2026-06-14 16:13:17 +02:00
Jun Yamog	0df00b3b94	Add preliminary MiniMax-M3 support	2026-06-14 12:23:20 +00:00
Kawrakow	c73bfbe9ce	Fix #1961	2026-06-14 07:42:39 +00:00
Kawrakow	670a3f6f5b	Merge pull request #1960 from BeccaLabs/fix/rpc-device-init fix: initialize rpc_device endpoint and device index before parsing	2026-06-14 08:14:07 +02:00
SamuelOliveirads	3b1a0f88d5	Add logging for DFlash statistics and clean up workspace handling	2026-06-13 20:14:08 -03:00
BECCA-Labs	053202f97a	fix: initialize rpc_device endpoint and device index before parsing	2026-06-13 16:13:44 -05:00
SamuelOliveirads	3a1d46c4d1	Merge remote-tracking branch 'origin/main' into feat/dflash-implementation # Conflicts: # common/common.cpp # common/speculative.cpp # convert_hf_to_gguf.py # examples/server/server-context.cpp # examples/server/server-context.h # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama.cpp	2026-06-13 17:27:52 -03:00
Kawrakow	5f917a64b3	Merge pull request #1958 from ikawrakow/ik/handle_think_no_space	2026-06-12 21:27:23 +02:00
Samuel Oliveira Alves	8a38025174	Refactor: Move spec outside server (#1949 ) * Refactor speculative decoding: move logic outside of server * remove duplicated tokens in mtp kv cache * narrow to only discard draft cells in MTP * revert mtp_speculative_gen_draft	2026-06-12 18:12:39 +02:00
Farmadupe	d1339249d7	Cleanup: Unify location of m-rope repacking for token and embd (#1924 ) * unify location of rope-position-array rewriting prior to ubatching * Reorder terms.	2026-06-12 08:27:50 +02:00
Simon Lundell	b1eb8bb0a1	server: gate llama_decode_stop() to the active decode (fix queued-cancel cascade) (#1941 ) With --parallel 1, a client disconnect/timeout on a queued request aborts the active decode of a different client (llama_decode: failed to decode, ret = -3 / "Decode process is cancelled by user"), releasing the slot with the request unfinished. To the active client the stream silently stalls and never returns, while the server reports healthy — easy to misdiagnose as a network/proxy wedge. Root cause: llama_decode_stop() signals a process-global stop flag that the active decode loop polls. examples/server/server.cpp calls it ungated from the request reader's connection-closed paths, so any reader closing (including a queued, not-yet-running task's) trips the global flag against whatever decode is currently active. Adjacent to #1576/#1673 ("clear sticky stop flag" + hybrid/recurrent ret=-3), which did not gate these call sites against non-active readers, so the queued-cancel-kills-active cascade still fires on current main. Fix (minimal gate): add server_response_reader::any_task_on_slot() and gate the three llama_decode_stop() sites on it, so the global stop is signalled only when one of THIS reader's tasks is on a slot (the active decode). A queued task's disconnect then only drops that queued task. Verified in production under heavy concurrent, frequently-cancelled load (hundreds of queued-task cancels, zero active-decode kills). Stdlib-only reproducer in the PR description. Caveat: any_task_on_slot() reads the slots vector from the reader thread — the same race class as the existing process-global flag; can be tightened to a per-context/per-task cancellation if preferred.	2026-06-12 08:25:44 +02:00
Marian M.	5fb707d19b	Update docs (#1956 ) * Update README.md Models, MTP, fit * Update parameters.md Disclaimer, terms, new flags, graph split list.	2026-06-12 08:24:22 +02:00
Kawrakow	175819b4fb	Style	2026-06-12 06:19:06 +00:00
Kawrakow	3dbc3241b9	Handle forced-open reasoning tag without trailing whitespace	2026-06-12 05:43:11 +00:00
Joel Farthing	8d91d3c3d9	common: gate empty-start reasoning extraction (#1955 ) Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-12 07:16:24 +02:00
Kawrakow	022bd00aab	Optimize Cohere2-MoE graph parallel (#1948 ) * Optimzie Cohere2-MoE graph parallel * Minor	2026-06-11 07:26:42 +02:00
firecoperana	ca0c1c5f85	fix Qwen3.6 outputs blank <think></think> in response when thinking is off (#1951 ) Co-authored-by: firecoperana <firecoperana>	2026-06-11 07:26:07 +02:00
Kawrakow	c0d25e8fa1	Gemma4 E2B/E4B tweaks (#1947 ) * Gemma4 E2B/E4B tweaks * A few more named nodes	2026-06-10 15:28:54 +02:00
Joel Farthing	4a1e2eaa69	model: add Cohere2-MoE North Mini Code support (#1945 ) * Add Cohere2 MoE North Mini Code support * Fix Cohere2 MoE expert tensor emission * Enhance Cohere2-MoE support by modifying tensor handling and configuration logic * Fix Cohere2-MoE graph split reduce handling --------- Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-10 15:28:27 +02:00
Kawrakow	e6f8112f3b	Adjust CUDA FA kernel parameters for head size 512 on Turing (#1942 )	2026-06-10 07:49:21 +02:00
firecoperana	2a1148384c	server: fix double submits of infill (#1944 ) Co-authored-by: firecoperana <firecoperana>	2026-06-10 07:48:15 +02:00
Joel Farthing	71d5aa21f7	common: handle Laguna chat delimiters (#1943 ) * common: handle Laguna chat delimiters * common: limit tool parser changes to end-delimited content --------- Co-authored-by: Joel Farthing <262452229+joelfarthing@users.noreply.github.com>	2026-06-10 07:46:19 +02:00
Kawrakow	366e478cb6	Bug fixes (#1940 ) * Bug fixes * More	2026-06-10 07:45:49 +02:00
Kawrakow	2768b62515	Split mode graph for Laguna (#1939 )	2026-06-09 10:13:30 +02:00
Kawrakow	11c3546235	Support for alternative Gemma4 assistant (#1937 )	2026-06-09 09:30:12 +02:00

1 2 3 4 5 ...

4628 Commits