Commit Graph

  • f96eaddba8
    Revert DFlash SWA optimization (#2039) main Kawrakow 2026-06-26 11:00:09 +02:00
  • 0440345ba9 Revert DFlash SWA optimization ik/revert_dflash_swa_opt Kawrakow 2026-06-26 08:58:50 +00:00
  • 1255b1e479
    Minor DFlash tweaks (#2034) Kawrakow 2026-06-26 10:31:03 +02:00
  • af62a37acd
    Prune examples/llava. Dead code. (#2025) Farmadupe 2026-06-26 07:48:48 +01:00
  • c713bd599b
    llama : fix CPU-only load crash on a CUDA build (device_mem out-of-bounds) (#2037) mb8565 2026-06-26 01:47:19 -05:00
  • 0ffdf509ab
    ggml : fix set_rows CPU crash when the destination is F32 (#2038) mb8565 2026-06-26 01:46:26 -05:00
  • a4e408611d Minor DFlash tweaks ik/dflash_tweaks Kawrakow 2026-06-25 15:10:16 +00:00
  • b84902d2ad
    Split mode graph for dense Qwen35 MTP (#2027) Kawrakow 2026-06-25 11:12:22 +02:00
  • d3e86a5431
    Free raw multimedia data from server_tokens after encoding, as it will never be read again (#2029) Farmadupe 2026-06-25 09:18:32 +01:00
  • bdf5c081dc
    DFlash: enable sliding-window attention for draft models (#2021) Joel Farthing 2026-06-25 02:06:54 -05:00
  • 4553cd0059
    cuda : fix MLA flash-attn vec decode for asymmetric K/V head sizes (#2031) mb8565 2026-06-25 01:56:17 -05:00
  • e1670f6c6c Merge remote-tracking branch 'origin/main' into ik/qwen35_mtp_smgraph ik/qwen35_mtp_smgraph Kawrakow 2026-06-24 16:32:10 +00:00
  • d5507e33ae
    Split mode graph for dense Gemma4 assistant (#2022) Kawrakow 2026-06-24 18:29:32 +02:00
  • 9acd6a4cb2 Split mode graph for dense Qwen35 MTP Kawrakow 2026-06-24 16:21:17 +00:00
  • 1f5828eaa4 It is better to use llama_context pointers as keys ik/g4_assistant_smgraph Kawrakow 2026-06-24 13:53:59 +00:00
  • de6c2dfdec Compiler warnings Kawrakow 2026-06-24 09:34:58 +00:00
  • bf23a7599c
    Avoid Gemma4 assistant strange tensor name warnings (#2023) Kawrakow 2026-06-24 11:23:22 +02:00
  • 9283af5ed8 Avoid Gemma4 assistant strange tensor name warnings ik/tensor_names Kawrakow 2026-06-24 09:20:41 +00:00
  • 118c82d8ef This works Kawrakow 2026-06-24 08:35:06 +00:00
  • 75a5f6d079 Per model CUDA contexts Kawrakow 2026-06-23 16:03:28 +00:00
  • 3530b65869 WIP: Split mode graph for Gemma4 assistant Kawrakow 2026-06-23 13:34:51 +00:00
  • 7cacf28eec
    Fix minor GGML discrepencies (#2016) Nexes the Elder 2026-06-24 09:09:33 +02:00
  • 8686ea708b
    chat: Cohere2MoE/North Code: parse unopened thinking under --reasoning off (follow-up to #1968) (#2012) Joel Farthing 2026-06-24 02:04:41 -05:00
  • 5a4fa17947
    Load glm-dsa indexer tensors as optional (ggml-org/llama.cpp#24770) (#2017) Yap Sok Ann 2026-06-24 14:04:09 +07:00
  • 997b289d93
    jinja: give each for-loop iteration a fresh scope (#2018) Yap Sok Ann 2026-06-24 13:58:36 +07:00
  • a7d35d51dc
    eval-callback : sum over the full tensor, not just the printed slice (#2019) mb8565 2026-06-24 01:57:19 -05:00
  • befbc0945b
    server: variance based checkpoint eviction (#2020) firecoperana 2026-06-24 01:54:07 -05:00
  • 3476dd6a40 server: variance based checkpoint eviction fcp/checkpoint_min_var firecoperana 2026-06-23 20:22:06 -05:00
  • 7ccf1d2095
    allow user to use THP for host allocations with GGML_CUDA_HOST_MALLOC_THP (#2010) Farmadupe 2026-06-23 14:13:41 +01:00
  • 2d3ecd5e19
    Fix minor CUDA discrepancies (part 2) (#2015) Nexes the Elder 2026-06-23 14:03:22 +02:00
  • 9eaf86a7c7
    Fix minor CUDA discrepencies (#2005) Nexes the Elder 2026-06-23 09:37:48 +02:00
  • 69a8336d08
    Add native MiniMax-M3 tool call parser (#2008) Jun Yamog 2026-06-23 19:36:02 +12:00
  • b2b4f66fa0
    tests: add Seed-OSS chat template fixture (#2014) Joel Farthing 2026-06-23 02:35:28 -05:00
  • b47b90d0be
    Add Laguna M.1 GGUF support (#2003) empty-quiver 2026-06-22 10:53:10 -04:00
  • 64fceb70bc
    DFlash: use persistent FA-ready K/V cache (#1997) Joel Farthing 2026-06-22 09:49:35 -05:00
  • 72440a19fc
    on-demand tensor reload (#1989) magikRUKKOLA 2026-06-22 14:36:34 +00:00
  • 6c00e87ac8
    cmake: drop ggml-blas.h from GGML_PUBLIC_HEADERS (#2007) a1batross 2026-06-21 10:49:09 +05:00
  • d47f484d29
    Force Gemma4 assistant to be loaded on last GPU (#1999) Kawrakow 2026-06-19 18:17:13 +02:00
  • 8369cf7412
    Allow graph reuse for Gemma4 MTP (#1996) Kawrakow 2026-06-19 18:16:53 +02:00
  • b21653a56f
    Fully remove any BLAS remnants (#2001) Kawrakow 2026-06-19 17:26:09 +02:00
  • 3cf0f5468f Also these ik/purge_blas Kawrakow 2026-06-19 15:24:24 +00:00
  • d30b35cb97 Fully remove any BLAS remnants Kawrakow 2026-06-19 15:14:27 +00:00
  • e734b76632 Force Gemma4 assistant to be loaded on last GPU ik/gemma4_mtp_last_device Kawrakow 2026-06-19 13:51:11 +00:00
  • d1692e1951 Allow graph reuse for Gemma4 MTP ik/gemma4_mtp_graph_reuse Kawrakow 2026-06-19 09:34:45 +00:00
  • 4bcfe5b872
    Add compatibility for llama.cpp Gemma4 assistant GGUFs (#1995) Kawrakow 2026-06-19 11:24:54 +02:00
  • 25d91dea44 Add compatibility for llama.cpp Gemma4 assistant GGUFs ik/compat_g4_assistant Kawrakow 2026-06-19 07:50:26 +00:00
  • d5c04c15fd
    clean redudance in dflash graph and small logics (#1994) Samuel Oliveira Alves 2026-06-19 04:04:54 -03:00
  • 7321648844
    Fix Gemma4 MTP compute graph (#1993) Kawrakow 2026-06-19 09:00:44 +02:00
  • 0d59973e4a
    Fix MTP warmup for GLM models (#1992) Kawrakow 2026-06-19 08:59:55 +02:00
  • b3dfb7858c
    AVX VNNI auto-activation for MSVC ; HAVE_VNNI256 path for IQ4_XS_R8 and Qx_0 R4 quants. (#1991) Nexes the Elder 2026-06-18 18:05:19 +02:00
  • 67b0b22760 Fix Gemma4 MTP compute graph ik/fix_gemma4_mtp Kawrakow 2026-06-18 15:51:22 +00:00
  • 2c1dc8781b Fix MTP warmup for GLM models ik/glm_mtp_warmup Kawrakow 2026-06-18 13:15:10 +00:00
  • 3b81f63acd Update AUTHORS Kawrakow 2026-06-18 08:11:41 +00:00
  • 21f918c185
    faster ggml_cuda_host_malloc (#1988) Farmadupe 2026-06-18 09:01:34 +01:00
  • f5e5753c32
    Fix Qwen35 mtp warmup (#1987) Kawrakow 2026-06-18 09:03:40 +02:00
  • 71af16a6b7
    Fix DFlash oerformance with split mode graph (#1980) Kawrakow 2026-06-17 18:40:02 +02:00
  • dc81d79cb6 Provide API to gtet the model arch string ik/fix_qwen_mtp_warmup Kawrakow 2026-06-17 16:18:32 +00:00
  • 2ba9c2f404 Cleanup + remove unnecessary crippling performance by not using accept to sample draft token Kawrakow 2026-06-17 16:07:19 +00:00
  • ded03457a1 Fix Qwen35 MTP warmup Kawrakow 2026-06-17 15:42:27 +00:00
  • 4f220159b8
    Fix (Gemma-4 Vision): Correct KQ mask fill for causal models in non-causal flash-attn mode (#1985) gapeleon 2026-06-18 00:52:45 +10:00
  • 5b9c3bbc3b Fix DFlash oerformance with split mode graph ik/dflash_fix_smgraph Kawrakow 2026-06-17 05:46:05 +00:00
  • 71cf84c682 Use hidden state from prev token from qwen mtp SamuelOliveirads 2026-06-16 21:31:59 -03:00
  • 064d23a6f8
    Codex CLI Responses Compatibility (#1964) Jun Yamog 2026-06-17 01:28:16 +12:00
  • d37d92b54c
    chat: add Cohere2MoE North Code parser (#1968) Joel Farthing 2026-06-16 08:27:30 -05:00
  • 8420f91ae3
    Merge pull request #1977 from ikawrakow/ik/dflash_fix_cpu Kawrakow 2026-06-16 15:26:23 +02:00
  • 6f45163a95 Fix DFlash on the CPU ik/dflash_fix_cpu Kawrakow 2026-06-16 13:22:36 +00:00
  • f9078e169b
    Merge pull request #1970 from SamuelOliveirads/feat/dflash-implementation Kawrakow 2026-06-16 15:07:55 +02:00
  • 11c9935ce8
    Merge pull request #1893 from ikawrakow/ik/gemma4_mtmd_blindness Kawrakow 2026-06-16 07:47:37 +02:00
  • ad24046b51 minor refactor in DFlash kv cache graph SamuelOliveirads 2026-06-15 18:22:56 -03:00
  • 2f524850a1
    Merge pull request #1973 from ikawrakow/ik/fattn_mma_gqa_16 Kawrakow 2026-06-15 15:24:01 +02:00
  • 37ea89cabf
    Merge pull request #1974 from Nexesenex/fix_muge_crash_minimax_m3 Kawrakow 2026-06-15 15:07:49 +02:00
  • 3c9680fd3c Fix Minimax M3 crash when -muge merges up/gate experts Nexesenex 2026-06-15 14:36:14 +02:00
  • 6be3a488d3 CUDA FA: faster TG when GQA is 16 and head size is 128 ik/fattn_mma_gqa_16 Kawrakow 2026-06-15 11:46:02 +00:00
  • f81673c7db
    Merge pull request #1972 from ikawrakow/ik/minimaxm3_smgraph Kawrakow 2026-06-15 13:44:19 +02:00
  • e927adc4ad
    Merge pull request #1969 from Farmadupe/resize_algo_fix Kawrakow 2026-06-15 13:39:11 +02:00
  • 00d96744de
    Merge pull request #1967 from Farmadupe/stb_image_resize2 Kawrakow 2026-06-15 13:38:31 +02:00
  • 1dc4ea938a
    Merge pull request #1962 from ikawrakow/ik/fix_1961 Kawrakow 2026-06-15 13:00:27 +02:00
  • c24d50dd88 Split mode graph for MiniMax-M3 ik/minimaxm3_smgraph Kawrakow 2026-06-15 08:41:34 +00:00
  • 567854aeab
    Merge pull request #1963 from jkyamog/minimax-m3-support Kawrakow 2026-06-15 10:16:10 +02:00
  • c08d194edd Use standard graph helpers for MiniMax-M3 Jun Yamog 2026-06-15 01:57:09 +00:00
  • c538210e6d Add MiniMax-M3 chat template Jun Yamog 2026-06-15 01:29:13 +00:00
  • 6cae8c7ba2 clean logs SamuelOliveirads 2026-06-14 21:07:57 -03:00
  • 19f08160ad Correct image resize algorithm for all qwens after qwen2vl and gemma4 Thomas Green 2026-06-14 21:57:11 +01:00
  • 574f22b3c7 Replace image resizers with avx2/neon simd impls from stb_img_resize2.h Thomas Green 2026-05-31 06:12:36 +01:00
  • 0d75eee35a remove duplicated code and unnecesary refactor SamuelOliveirads 2026-06-14 16:02:02 -03:00
  • 4f1ec69ae5
    Merge pull request #1965 from Nexesenex/fix_q8_0_graph_reduce_type Kawrakow 2026-06-14 16:32:48 +02:00
  • 0fdac83272 Fix Q8_0 graph reduce type Nexesenex 2026-06-14 16:07:36 +02:00
  • 0df00b3b94 Add preliminary MiniMax-M3 support Jun Yamog 2026-06-14 12:23:20 +00:00
  • c73bfbe9ce Fix #1961 ik/fix_1961 Kawrakow 2026-06-14 07:42:39 +00:00
  • 670a3f6f5b
    Merge pull request #1960 from BeccaLabs/fix/rpc-device-init Kawrakow 2026-06-14 08:14:07 +02:00
  • 3b1a0f88d5 Add logging for DFlash statistics and clean up workspace handling SamuelOliveirads 2026-06-13 20:14:08 -03:00
  • 053202f97a fix: initialize rpc_device endpoint and device index before parsing BECCA-Labs 2026-06-13 16:13:44 -05:00
  • 3a1d46c4d1 Merge remote-tracking branch 'origin/main' into feat/dflash-implementation SamuelOliveirads 2026-06-13 17:27:52 -03:00
  • 5f917a64b3
    Merge pull request #1958 from ikawrakow/ik/handle_think_no_space Kawrakow 2026-06-12 21:27:23 +02:00
  • 8a38025174
    Refactor: Move spec outside server (#1949) Samuel Oliveira Alves 2026-06-12 13:12:39 -03:00
  • d1339249d7
    Cleanup: Unify location of m-rope repacking for token and embd (#1924) Farmadupe 2026-06-12 07:27:50 +01:00
  • b1eb8bb0a1
    server: gate llama_decode_stop() to the active decode (fix queued-cancel cascade) (#1941) Simon Lundell 2026-06-12 08:25:44 +02:00
  • 5fb707d19b
    Update docs (#1956) Marian M. 2026-06-12 09:24:22 +03:00
  • 175819b4fb Style ik/handle_think_no_space Kawrakow 2026-06-12 06:19:06 +00:00
  • 3dbc3241b9 Handle forced-open reasoning tag without trailing whitespace Kawrakow 2026-06-12 05:43:11 +00:00