ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-07-01 07:50:16 -05:00

Author	SHA1	Message	Date
Kawrakow	b6bac1aedb	Auto-fit for dense models (#1504 ) * Auto-fit for dense models * Minor	2026-03-25 08:28:15 +01:00
Kawrakow	86f4f516e5	Auto-fit offloaded tensors to available VRAM (MoE models) (#1501 ) * WIP: automatically fit model in available VRAM * WIP * This seems pretty solid	2026-03-25 07:29:29 +01:00
Kawrakow	233225db8f	Take into account layer sizes for setting GPU layers (count 2) (#1498 ) * Also take into account KV cache * Take into account attn_wkv_b and mla = 3 compute buffers * WIP * Minor	2026-03-24 08:18:28 +01:00
Nexes the Elder	094f76ee86	Cleaner log for adjusted splits (#1494 ) * sweep-bench: add more skipped patterns to --minilog * cleaner log for adjusted splits * Add totalization for adjusted splits * Clean up semicolons * Addition for totalizer ^^ * Change accordingly to review * Forgotten leftover removed * 'total' instead of 'totalized'	2026-03-24 07:49:40 +01:00
Kawrakow	b7a2bde4cc	Take into account layer sizes for setting GPU layers (cont) (#1476 ) * Also take into account KV cache * Take into account attn_wkv_b and mla = 3 compute buffers	2026-03-23 17:46:53 +01:00
Kawrakow	3633a7cfca	Log HAVE_FANCY_SIMD via LLAMA_LOG_INFO (#1492 )	2026-03-23 08:43:29 +01:00
gapeleon	716ecd6457	Added split mode graph for Command-R/R+ models. (#1491 )	2026-03-23 08:10:41 +01:00
Kawrakow	0871ab2964	Take into account layer sizes for setting GPU layers (#1466 ) * Take into account layer sizes for setting GPU layers * Fix bug	2026-03-20 09:38:14 +01:00
Kawrakow	9b7db9bc3f	Better --n-cpu-moe (#1464 )	2026-03-19 06:57:01 +01:00
firecoperana	f9b7fe9749	llama: add --dry-run option (#1462 ) Co-authored-by: firecoperana <firecoperana>	2026-03-18 17:20:17 +01:00
Kawrakow	56477c7a9e	Mistral 4 support (#1450 ) * WIP: mistral4 * CPU FA * CUDA FA 320, 256	2026-03-18 07:32:39 +01:00
Kawrakow	54bcafee16	Allow using -rtr and -muge together (#1444 )	2026-03-16 18:26:26 +01:00
Kawrakow	d83b0172b1	Attempt to fix #1438 (#1439 )	2026-03-16 08:34:17 +01:00
Kawrakow	56f4e9e673	Fix the fix (#1433 )	2026-03-15 17:33:10 +01:00
Kawrakow	edf54621d2	Fix long max. context bottleneck (#1430 )	2026-03-15 17:18:27 +01:00
Kawrakow	10a8f5f8f1	Fix hybrid graph parallel + muge (#1426 )	2026-03-14 18:15:09 +01:00
dungquixote42	be2940f57a	Adaptive P sampler: update review logic, delete old code comments, put prep stage after logit bias (#1386 ) * simpler n_rewind logic, delete old comments * use more consistent names, add updt_w_cur to json schema * align comments * refactor review logic, update struct/variable names * revert cosmetic changes * check enable/disable in llama_prep_adaptive_p_impl() * delete extra whitespaces after statement * show target in debug prints * more concise debug print * delete old comments * update with loop instead of move() * comment out all adaptive p debug prints * more debug prints * move review() variables: common_sampler struct -> common_sampler_review() args * match n_unsent type * fix merge bugs, delete adaptive p references in buffer_and_check_string_ban() * restore accidental erasure * Revert "adaptive p: collect probability before logit bias" This reverts commit 1434878461c49d1a2a9047fc15d5e7b78421fd2a.	2026-03-14 12:34:12 +01:00
Kawrakow	7fab617684	Enable split mode graph for on-the-fly merged up/gate experts (#1413 ) * Split mode graph for on-the-fly merged ffn_up/gate_exps * Cleanup * Also handle merged bias	2026-03-13 08:11:46 +01:00
firecoperana	433531ddae	server : support multi-modal context checkpoints and prompt caching (#1398 ) * server : support multi-modal context checkpoints and prompt caching do not create checkpoint right after image processing improve mtmd check for slot ops fix context shift do not abort if template parse failed * change to debug message when detecting ban token --------- Co-authored-by: firecoperana <firecoperana>	2026-03-13 08:07:57 +01:00
Kawrakow	5713d3b38b	Support models with merged up/gate experts (#1408 ) * WIP: support pre-merged up/gate experts Haha, mainline has elected to arrange the merged tensors the other way around compared to what I had done in the on-the-fly merge. * Change the order of on-the-fly packed up/gate * OpenAI * CUDA TG * CPU	2026-03-12 09:25:57 +01:00
Kawrakow	cda15bf175	Discard very first compute graph for recurrent models (#1393 )	2026-03-10 09:41:47 +01:00
Kawrakow	f90b4c2f27	Full graph parallel for Qwen3.5 (dense and MoE) (#1388 ) * WIP * WIP * WIP * WIP * WIP * WIP * WIP Loads and starts running, crashes with illegal memory access in quantize_mmq_q8_1. This almost always indicates NaNs in the input to the MoE FFN part. * WIP * WIP Loads and runs, wrong results (very high PPL) Performance looks promising, around 25% better than previous sm graph. Needs f32 or bf16 graph reduce type. * WIP - still wrong * Fix after rebase * WIP * WIP * This seems to be working for dense Qwen3.5!!! * WIP: Qwen3-Next is not quite working * Some cleanup * Disable Qwen3-Next for now * Disable graph parallel when mmproj was specified * Read/write split recurrent state * That should not crash * Re-enable vision - it works now * Recurrent layers should now be counted for split cache	2026-03-10 09:08:24 +01:00
firecoperana	ab1d74074b	common : introduce composable PEG parser combinators for chat parsing and new jinja template engine (#1369 ) --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com> common : add nemotron 3 parsing (#18077) common : add parser for ministral/mistral large 3/devstral 2 (#17713) common : default content to an empty string (#18485) chat: make tool description and parameters optional per OpenAI spec (#18478) Per the OpenAI API specification, both 'description' and 'parameters' fields in tool function definitions are optional. Previously, the parser would throw an exception if these fields were missing. Attempts to fix #17667 common : implement new jinja template engine (#18462) --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> jinja: correct member access rule (#18905) jinja : fix lexing of float literals with sign (#18901) jinja : add missing tojson filter for bool (#18900) jinja : attribute support for join, map and sort (#18883) jinja : fix object item order (and properly implement dictsort) (#18904) tests : add test-jinja -py option for cross-checking (#18906) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> ci : run test-jinja -py on high perf [no ci] (#18916) jinja : fix undefined keys and attributes and int/float as bool (#18924) jinja: support none\|string (#18995) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> jinja : implement mixed type object keys (#18955) --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147) `tojson` is not a supported `undefined` filter keep it DRY and fix some types jinja : do not pass empty tools and add some none filters (#19176) jinja : add unordered_map include to value.h [no ci] (#19205) jinja : add missing 'in' test to template engine (#19004) (#19239) The jinja template parser was missing the 'in' test from global_builtins(), causing templates using reject("in", ...), select("in", ...), or 'x is in(y)' to fail with "selectattr: unknown test 'in'". This broke tool-calling for Qwen3-Coder and any other model whose chat template uses the 'in' test. Added test_is_in supporting array, string, and object containment checks, mirroring the existing 'in' operator logic in runtime.cpp. Includes test cases for all three containment types plus reject/select filter usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Add Jinja support for "indent" string filter (#19529) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> add vendor refactor chat server : support preserving reasoning_content in assistant message (#18994) chat : fix translategemma crash on common_chat_format_example (#19019) chat: fix language input for translategemma (#19052) Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Aldehir Rojas <hello@alde.dev> chat: fix case where template accepts type content only (#19419) mtmd : chat : Fix extra \n between text and media marker (#19595) Thanks to @tugot17 for detecting and reporting the issue. For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation. However `llama-server` doesn't. I traced it down to extra newline inserted after `<__media__>`. This happens in `to_json_oaicompat`, that treats media markers as text and joins all parts with `\n` separator. PR introduces new type `media_marker` and uses it for media markers. Extra logic is added to prevent insertion of newlines before and after media markers. With this change number of input tokens is identical to HF implementation and as a result the output is also identical. I explored other ways to address the issue * remove completely `\n` between text parts in `to_json_oaicompat` * merge text messages in server-common.cpp before sending them to `to_json_oaicompat` Please propose alternative ways of fixing this issue. Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> common : merge qwen3-coder and nemotron nano 3 parsers (#19765) common : fix improper trimming in XML parser on complete message (#19805) Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com> jinja: correct stats for tojson and string filters (#19785) jinja : correct default size for string slices (#19913) common : handle unicode during partial json parsing (#16526) common : fix json schema with '\' in literals (#17307) add back qwen_coder_xml and mirothinker Co-authored-by: Aldehir Rojas <hello@alde.dev>	2026-03-09 11:03:33 +01:00
Kawrakow	fa0c29843d	Fix split mode graph with Qwen3.5-MoE/Qwen3-Next hybryd inference (#1368 )	2026-03-06 07:26:15 +01:00
Kawrakow	1ef4b5eddc	Disable split mode graph for recurrent/hybrid models when tensor overrides (#1366 )	2026-03-05 10:25:50 +01:00
dungquixote42	a903409a5e	fix adaptive p sampler rewinding too far back (#1359 ) * fix adaptive p sampler rewinding too far back * update comments * correct default value for total_weight, more comments * new variables/names * update comment for n_rewind * move null pointer check back to common_sampler_review() * refactor weighted_sum and total_weight to vector<pair>, better boundary check in llama_review_adaptive_p_impl()	2026-03-04 13:26:25 +01:00
Kawrakow	fd16a418de	Fix clang warnings on macOS (#1354 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-03-03 16:27:16 +01:00
Kawrakow	3735e88925	Remove unused tensors from delta-net (#1350 )	2026-03-02 16:02:40 +01:00
Kawrakow	d239dabcc6	Graph parallel for Qwen-3.5-MoE (#1347 ) * Graph parallel for Qwen3.5-MoE * Add --max-gpu to llama-bench * Fix graph reuse when not all GPUs participate in self-attention	2026-03-02 07:48:43 +01:00
Kawrakow	0ff3a43289	Bring back #1333 and #1335 (#1340 ) * Bring back fused delta net 3 * Remove autoregressive and chunking	2026-02-28 14:31:42 +01:00
Kawrakow	1922449b2c	Revert delta net 3 (#1339 ) * Revert "Simplify delta-net (#1335)" This reverts commit e5fc30244cf638852293390bfdbda856d6b0869e. * Revert "Fused delta net 3 (#1333)" This reverts commit 7b68353e0920c0c472bc28c708e38a6766490eb8.	2026-02-28 13:12:08 +01:00
Kawrakow	e5fc30244c	Simplify delta-net (#1335 ) * Simplify delta-net * Minor * Minor	2026-02-28 11:12:19 +01:00
Kawrakow	7b68353e09	Fused delta net 3 (#1333 ) * This is better than chunked * Keep the state in registers * Cleanup * Remove unused stuff * Minor * Make fused delta-net the default * Fix race	2026-02-27 15:02:56 +01:00
Kawrakow	1e6d36b1b4	Graph parallel for dense Qwen-3.5 models (#1331 ) * Graph parallel for idense Qwen-3.5 models * Cleanup	2026-02-27 07:03:25 +01:00
Kawrakow	0aa6f7e7cd	iAdding support for dense Qwen-3.5 models (#1326 )	2026-02-26 08:51:01 +01:00
Kawrakow	2616efa296	Fused delta net 2 (#1320 ) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes.	2026-02-26 06:53:43 +01:00
firecoperana	3fac78c48b	server: enable checkpoint for recurrent models (#1310 ) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana>	2026-02-26 06:51:18 +01:00
Kawrakow	c77ec4b8b8	Fused delta-net (#1315 ) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name	2026-02-25 14:12:48 +01:00
Nexes the Elder	0bf7043a7b	Display the size of the tensors overriden during the tensor loading (#1318 ) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2026-02-25 07:36:27 +01:00
Nexes the Elder	170467e835	Llama-quantize: Partial requant feature (#1313 ) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup	2026-02-25 07:25:15 +01:00
dungquixote42	aaa545c3dc	adaptive p: collect probability before logit bias (#1314 )	2026-02-24 15:39:17 +01:00
Kawrakow	7065488135	Slightly better graph parallel for Qwen3-Next (#1307 ) * Make sure we pick the reduced tensor from the right GPU * Minor	2026-02-24 15:22:30 +01:00
Kawrakow	cfb6747776	llama-quantize: --dry-run option (#1309 )	2026-02-24 15:21:52 +01:00
Kawrakow	5dacb5355a	Graph parallel for Qwen3-Next (#1292 ) * WIP * This works, but is slower than split mode layer	2026-02-23 07:58:00 +01:00
Kawrakow	89b1e2b518	Better estimate for max. nuber of compute nodes (#1296 ) * Better estimate for max. nuber of compute nodes * Just in case	2026-02-22 18:16:49 +01:00
Samuel Oliveira Alves	09a88c9ae5	Add MTP decoding support for GLM-4.x MoE (#1270 ) * wip: port MTP architecture Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`. Changes include: - Updating `llama_batch` to support `mtp_params`. - Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft). - Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`). - Adapting the embedding extraction logic to skip MTP update passes. * Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model). * core: enable hybrid outputs (logits + embeddings) for MTP support * fix(mtp): correct KV-cache slot finding for updates * fix(mtp): persist hidden states to prevent context corruption during drafting * refactor(mtp): clean unused code * fix(mtp): update server to new functions name * fix(mtp): fix graph and save hidden state * mtp: refactor integration, context params and kv cache search * mtp: fix hidden state extraction and speculative acceptance flow * server: fix MTP warmup for long prompts and reset token buffer * llama: refactor MTP operation state to context parameters * server: fix n_past calculation in MTP acceptance * llama: fix mtp enable flags * speculative: refactor MTP to use common_speculative interface * context: remove unused signatures * clip: fix deprecated enum-enum conversion warning * common: fix format string crash in help message * context: fix mtp activation logic	2026-02-22 18:14:39 +01:00
firecoperana	66323b92f7	Qwen3.5-MoE: fix regenerating message error (#1295 ) Co-authored-by: firecoperana <firecoperana>	2026-02-21 18:24:12 +01:00
Kawrakow	13c3d83ce7	Qwen3.5-MoE support (#1288 ) * WIP: loads and runs, but not correct Very high PPL, empty TG. * This appears to work	2026-02-21 08:33:06 +01:00
dungquixote42	0f411b02e2	Fix adaptive p sampler bug with string ban (#1287 ) * adaptive p: upadte internal state only if not rewinding * adaptive p: conditional update for speculative decoding * adaptive p: refactor to rewind instead of update * adaptive p fix: better comments * fix rewind check * add record to handle multi-token rewind * better comment	2026-02-20 07:11:36 +01:00
Kawrakow	e30198a553	WIP: Qwen3Next (#1266 ) * qwen3next: add architecture support and recurrent-state fixes * qwen3next: optimize broadcast sub and single-seq ssm conv * cuda: build MoE row mapping on device in mul_mat_id * cuda: add guarded multi-seq fast path for ssm_conv * docs: update qwen3next perf report for cuda MoE/SSM tuning * cuda: reduce qwen3next moe/ssm sync overhead and refresh eval * qwen3next: split cpu/cuda eval builds and tune PP scheduling * qwen3next: harden seq-state flow and support optional dense FFN layers * qwen3next: trim delta-net graph overhead in chunking path * qwen3next: remove redundant v_conv cont in delta path * qwen3next: avoid extra cont on linear attention output * qwen3next: drop redundant cont before recurrent state flatten * qwen3next: keep recurrent state in 4d layout through delta path * qwen3next: add fused delta-net op and wire model path * tests: add backend-op coverage for ggml_delta_net * qwen3next: add runtime switch for fused delta-net path * docs: refresh qwen3next perf review and benchmark matrix * qwen3next: default fused delta-net off and document quality checks * qwen3next: add decode-only fused delta mode * qwen3next: make fused delta safe by default and fix fused tensor layout * qwen3next: warn when forcing fused decode mode * qwen3next: add fused-delta regression runner script * qwen3next: integrate fused regression into eval harness * qwen3next: clean up chunked delta-net shape handling * qwen3next: add absolute sanity guards to fused regression * qwen3next: add unified regression runner script * qwen3next: disable flash-attn for cpu-only contexts * docs: reconcile qwen3next status and remaining upstream gaps * common: add qwen3next fused-delta runtime flag * cuda: add qwen3next delta-net kernel dispatch override * docs: update qwen3next quality and serving baseline findings * qwen3next: keep fused delta on safe path and remove PR artifacts * qwen3next: align autoregressive delta-net decode layout * Revert "qwen3next: align autoregressive delta-net decode layout" This reverts commit 9241164a5ea9e032a2456fbf2dd0bf798b264fd7. * cuda: port solve-tri fast-paths for qwen3next delta-net * qwen3next: add fused-delta runtime flag and drop env toggle * qwen3next: make fused delta single-flag and default on * Account for GPU arch differences * Revert "cuda: build MoE row mapping on device in mul_mat_id" This reverts commit 89e9ecfa840b04e88699ab3803eb732cd78727f9. * qwen3next: drop non-essential MoE scheduling and split heuristics * qwen3next: avoid generic ggml_sub broadcast changes * llama: restore only_active_experts log message * Remove unnecessary hacks, disable fusion for now. * qwen3next: port hybrid recurrent state memory semantics * qwen3next: clean up recurrent state slot plumbing * qwen3next: fix hybrid V-cache layout plumbing * qwen3next: guard recurrent state slots against kv capacity * qwen3next: persist recurrent state in session data - serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches * qwen3next: drop unused fused-delta builder path - remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member * qwen3next: remove unused fused-delta CLI/context plumbing - drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init * ggml: remove unused DELTA_NET operator stack * Missing include * Reorder ops/unary ops So we don't change again the enum values of the mul mat ops * Minor * Discard unnecessary changes in llama-build-context.cpp * Minor * Revert "Discard unnecessary changes in llama-build-context.cpp" This reverts commit edadb80ed68c4c0831e9c22609a9a3af19be9735. * Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches * Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next * Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next It was single-threaded and was taking ~25% of the computation time during TG. It is now down to 2%. Strangely enough, I measure 13.6 t/s with llama-bench, but if I let the model give me an actual response with llama-cli, I get close to 17 t/s. * Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next For Qwen3Next there is a scale op on a largish tensor (548k elements) that has a single row for TG, so was done in a single thread. We now simply use blocks of 1024 elements. * Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next * CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next * Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512 * Multithreading for OP_SUB * Don't commit with timing trace on * Multithread neg and sigmoid * Be able to turn on/off fusion more easily (CPU) * Name the mul_mat ops so we know where the time goes * WIP * Much better PP on CUDA * CUDA: fuse transpose -> cont -> sum_rows -> transpose Needs non-coontiguous variant of sum_rows. On the CPU this gave 30+% improvement in TG performance, on CUDA ist is disapointing 6-7%. I guess, this is because Georgi's cont CPU implementation was so bad that skipping it made such a big difference. * CUDA: faster mul for special case relevant for Qwen3Next Worth 1% in TG * Fix CPU OP_CONT --------- Co-authored-by: yurko <yurko@local> Co-authored-by: Yurko <yurko@example.com> Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net> Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>	2026-02-16 06:50:28 +01:00

1 2 3 4 5 ...

311 Commits