9612 Commits

Author SHA1 Message Date
Aleksander Grygier
f7ca93d12c
ui: PWA support (#23871)
* feat: Add basic PWA support and service worker for offline caching

* feat: Vite PWA implementation WIP

* feat: Improve PWA icons generation

* feat: Add PWA workbox to server routes

* feat: Include `version.json` in static assets

* feat: Add HTTP cache headers for PWA static assets

* feat: Update app name for `apple-mobile-web-app-title`

* feat: Implement PWA versioning and automatic update detection

* chore: Update `.gitignore` files

* feat: Splash Screens

* feat: Add dark mode favicon support

* refactor: Cleanup

* fix: Use dark logo for dark splash screens

* refactor: Simplify favicons SVG code

* fix: Adjust caching and polling for reliable service worker updates

* fix: Add missing favicon entry

* fix: Align PWA service worker configuration with SvelteKit build structure

* fix: Replace hashed bundle paths with versioned static paths

* test: Add PWA tests

* ci: Add build output for unit tests

* refactor: Cleanup

* fix: Server build & release versioning

* chore: Update package-lock.json

* chore: Increase PWA cache size

* chore: Update packages

* feat: Update favicons

* refactor: Post-merge fix

* feat: support explicit build version for PWA cache busting

* fix: CI

* feat: Improve PWA Refresh Alert UI

* feat: Add toggleable build version display

* refactor: Cleanup

* feat: Add version mismatch detection and manual app reload

* refactor: replace dynamic imports with static

* refactor: Cleanup

* feat: Add safe space for `pwa-<size>.png` rendered icons

* fix: use relative paths for PWA assets to support base path deployment

* feat: add PWA mode detection via URL query parameter

* feat: Use ?cache=true for SW-cached PWA assets

* refactor: Build process cleanup

* refactor: Decouple PWA versioning and remove ?cache=true workaround

* chore: Update README logo

* feat: Include PWA Assets generation in build script

* refactor: `usePwa` hook for core layout

* fix: Relativize base vite plugin

* fix: remove unnecessary backslash escapes in test regexes

* test: update static asset paths for API Key test

* refactor: Move SvelteKit PWA Options config to constants

* ui: fix update notification never appearing

Keep the PWA hook object intact instead of destructuring needRefreshByStorage,
which freezes the reactive getter. Also exclude loading.html from PWA
precache to prevent 404 errors and broken SW installation.
2026-06-12 15:53:26 +02:00
Georgi Gerganov
02182fc5b9
fit : avoid including llama-ext.h in fit.h (#24506) b9611 2026-06-12 15:57:05 +03:00
Georgi Gerganov
f532be8fac sync : ggml b9610 2026-06-12 15:55:35 +03:00
Georgi Gerganov
e08c226a2c ggml : bump version to 0.15.1 (ggml/1541) 2026-06-12 15:55:35 +03:00
Adrien Gallouët
70b54e140c
vendor : update cpp-httplib to 0.47.0 (#24395)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b9608
2026-06-12 11:34:44 +02:00
Pascal
6471e3c090
UI/jpeg exif orientation (#24196)
* ui: bake jpeg exif orientation into uploaded images

stb_image in mtmd ignores exif metadata, so rotated smartphone photos
reach the model with raw pixel orientation. The webui now reads the
exif orientation tag at send time and feeds it into the existing
capImageDataURLSize canvas pass: the browser applies the rotation when
decoding, so capped images come out upright for free, and images under
the cap threshold get a single plain redraw when orientation > 1.

At most one re-encode ever happens per image. Upright jpegs with
capping disabled pass through untouched, bit perfect.

Adds jpeg-orientation.ts with a minimal exif parser working on a
bounded base64 prefix (both endianness, returns 1 on any malformed
input) and unit tests against handcrafted jpeg byte streams.

* ui: move jpeg exif constants into lib/constants

* ui: add browser test for jpeg orientation and capping

Covers capImageDataURLSize end to end in chromium with real Pillow
generated jpeg fixtures across exif orientations 1/3/5/6/8: upright
quadrant colors checked pixel-wise, expected dimensions with and
without capping, no orientation tag left in the output, and strict
passthrough when nothing needs rewriting.
2026-06-12 10:20:27 +02:00
Ruixiang Wang
88a39274ec
spec: add EAGLE3 speculative decoding support (#18039)
* llama : enable layer input extraction

* spec: support eagle3

* eagle3: fix params bug

* eagle3: support Gemma4 eagle3 from RedHatAI

* eagle3: set sync when get features from target

Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com>

* eagle3 : fix ubatch handling in embd_layer_inp extraction and encoder

Co-authored-by: Doğaç Eldenk <dogacel@gmail.com>

* eagle3: adapt to upstream changes

* eagle3: fix rebase issues and adapt to upstream changes

* eagle3:exclude the eagle3 arch from test-llama-archs

* eagle3: fix editorconfig check failures

* eagle3: fix multi-seq issue in d2t vocab mapping

* cont : minor style / clean-up

* spec : remove `common_speculative_setup_draft_model()`

* llama : clean-up unused API

* eagle3: set d2t vocab mapping in decode graph

* cont : assert layer inputs are configured

* hparams : use n_embd_inp instead of n_embd_target_features

* eagle3: make output.weight optional and inherit from target model when needed

* haparams : generic norm-before-residual param

* llama-ext : consistent names

* cont : fix

* hparams : remove target_hidden_size

* cparams : rename output_layer_inp -> embeddings_layer_inp

* arch : reuse ATTN_NORM_2 instead of adding new hidden norm

* llama : clean-up names

* cont : add assert + comment

* Update conversion/llama.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com>
Co-authored-by: Doğaç Eldenk <dogacel@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9606
2026-06-12 10:21:06 +03:00
ZihaoMu
85f99dca8b
ggml: support concat for scalar types at cuda backend (#24011)
* cuda: support concat for scalar types

* Update concat.cu

* fix metal ci issue
b9605
2026-06-12 09:32:44 +03:00
Neo Zhang
099ea76fb4
[SYCL] Fix CI build & release for SYCL backend (#24387)
* restore SYCL build and release, remove github cache

* modify for test only

* verify the ccache is used

* remove debug code change

* rm duplicate action, update key in ccache

* add action ccache-clear after building in both ubuntu and windows

* set %NUMBER_OF_PROCESSORS% in widnows build
b9604
2026-06-12 09:30:24 +03:00
shaofeiqi
ba1df050f3
opencl: add q5_0/q5_1 gemm and gemv kernels for Adreno (#24319)
* opencl: add q5_0 adreno support

* opencl: add q5_1 adreno support

* opencl: cosmetic fix

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
b9603
2026-06-11 21:43:09 -07:00
wencan
1593d5684d
docker : support specifying the GCC version for CUDA (#24447) 2026-06-11 23:12:09 +02:00
Jeff Bolz
4c6595503f
vulkan: ifdef eMesaHoneykrisp (build fix) (#24479)
Fixes build/CI after #24306.
b9601
2026-06-11 13:22:17 -05:00
Georgi Gerganov
263cc04a54 sync : ggml 2026-06-11 19:34:19 +03:00
Georgi Gerganov
17e59d6209 ggml : bump version to 0.15.0 (ggml/1539) 2026-06-11 19:34:19 +03:00
Winston Ma
fdc3db9b65
vulkan: add fast path for contiguous buffer transfers (#23973) 2026-06-11 15:46:25 +02:00
Kevin Liu
1af154a76f
vulkan: use medium matmul tile on Asahi Linux (#24306)
* vulkan: use medium matmul tile on Asahi Linux

* vulkan: switch Apple detection to Honeykrisp driver id
2026-06-11 15:43:04 +02:00
Xuan-Son Nguyen
18ef86ecec
server: skip unused log lines on router mode (#24463) b9596 2026-06-11 11:36:35 +02:00
o7si
1bfbdb134e
vocab : adopt leading TemplateProcessing special token as BOS (#24428) 2026-06-11 10:37:23 +03:00
o7si
68f30663cf
vocab : refactor normalizer flags into options struct, add strip_accents (#24371)
* vocab : refactor normalizer flags into options struct, add strip_accents

* Update src/llama-vocab.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9594
2026-06-11 10:36:50 +03:00
Aldehir Rojas
db94854ff5
server : skip checkpoints beyond pos_next (#24411)
* server : skip checkpoints beyond pos_next

* cont : update comment + TODO + ref

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-06-11 10:18:12 +03:00
Adrien Gallouët
ac4cddeb0d
vendor : update LibreSSL to 4.3.2 (#24397)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b9592
2026-06-10 22:28:03 +02:00
Gaurav Garg
e95dae18d6
Remove padding and multiple D2D copies for MTP (#24086)
* Make ggml_gated_delta_net take only the initial recurrent state (D, 1, n_seqs) and passes the snapshot count K as an op parameter instead of inferring it from state->ne[1].

Remove the padding hack and copy all emitted snapshots into the recurrent cache with a single strided ggml_cpy

* Make GDN changes in all backends. Address review comments.

* Fix CI build errors
b9591
2026-06-10 23:21:16 +05:30
Tarek Dakhran
d2462f8f7a
chat: fix LFM2/LFM2.5 ignoring json_schema (#24377)
The LFM2 specialized template handler only built a grammar for tool-calling,
silently ignoring json_schema from response_format.
b9590
2026-06-10 14:41:41 +02:00
Oliver Simons
fb83cc9a07
CUDA: Fix ssm_scan_f32 data-races (#24360)
* Add missing syncthreads before resuing cub_temp_storage

__syncthreads() is required before being allowed to resue TempStorage
smem:
https://nvidia.github.io/cccl/unstable/cub/api/classcub_1_1BlockLoad.html#_CPPv4I0EN3cub9BlockLoad4LoadEv20RandomAccessIteratorRA14ItemsPerThread_1Ti

* Add one more missing __syncthreads

Could also double-buffer, but alternative is to simply ensure all
threads have read smem* before writing to it again in the next loop
iteration

* Remove unused smem from ssm_scan_f32
b9589
2026-06-10 14:27:08 +02:00
Sigbjørn Skjæret
039e20a2db
ci : bump komac version (#24396) 2026-06-10 09:45:20 +02:00
ddh0
d2e22ed975
speculative : fix "ngram-map-k4v" name in logging (#24253)
This is a non-functional change.

When using `--spec-type ngram-map-k4v`, the log messages at startup and
runtime say `ngram-map-k`. Added logic in the in the constructor of
`common_speculative_impl_ngram_map_k` to pass the correct
`COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K4V` when `config.key_only` is
`false`.

After this change, the log messages use the correct name.
b9587
2026-06-10 09:31:35 +02:00
Rémy Mathieu
76da2450a4
webui: implement pinned conversations support (#21387)
* webui: implement pinned conversations support

* webui: linter/prettier pass

* Fix the unused handleMobileSidebarItemClick from the component.

* the search should find pinned conversations as well

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
b9586
2026-06-09 21:33:22 +02:00
Aarnav Pai
d73cd07674
graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (#24357)
* llama-graph : apply embedding scale when deepstack is not used

* nits: remove non-existant hunyuan-vl from the tests

* apply suggestion from @gabe-l-hart

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b9585
2026-06-09 19:46:27 +02:00
Sigbjørn Skjæret
e25a32e98c
ci : fix windows release (#24369) b9584 2026-06-09 19:42:23 +03:00
Pascal
483609509d
ui: add opt-in run_javascript frontend tool (#24244)
* ui: add opt-in run_javascript frontend tool

Expose a run_javascript tool to the model, executed entirely in the
browser through the existing agentic loop. Code runs in a Web Worker
inside a sandboxed iframe with an opaque origin, isolated from the
WebUI and its API. Console output, errors and the return value are
fed back as the tool result. The parent enforces a hard timeout by
removing the iframe, which terminates the worker.

Disabled by default, toggle in Settings > Developer.

* ui: address review feedback from allozaur

Use the JsonSchemaType enum for the tool definition parameter types
instead of raw string literals, extending it with STRING and NUMBER.

Move the worker shim and the iframe harness html into their own files
so the service no longer carries inline source blobs.

Replace the remaining magic strings with constants: SANDBOX_EMPTY_OUTPUT
and SANDBOX_TRUNCATION_NOTICE, and reuse NEWLINE_SEPARATOR for joins.

* ui: move sandbox worker shim to a raw imported file

Replace the inline worker template string with a real sandbox-worker.js
imported as raw text, and build the iframe harness from it in
sandbox-harness.ts. The raw worker ships as a string, not a module, so
it is excluded from eslint and the typecheck program.
2026-06-09 18:02:31 +02:00
Saba Fallah
49f3542190
mtmd: build_vit batching (#24352) 2026-06-09 16:32:08 +02:00
Jeff Bolz
d6d0ce8215
vulkan: reduce iq1 shared memory usage for mul_mm (#24287) b9581 2026-06-09 13:27:38 +02:00
Ruben Ortlam
b4e3dc613b
vulkan: add v_dot2_f32_f16 support in matrix-matrix multiplication and Flash Attention (#24123)
* vulkan: add support for valve fp16 dot2 extension

* use macro for dot2 path choice

* properly check for the feature

* add dot_product abstraction to reduce preprocessor branching
b9580
2026-06-09 13:27:04 +02:00
Nick Towle
ae735b1314
ui: Fix excessive style recalculation on hover (#24243) 2026-06-09 12:52:20 +02:00
Xuan-Son Nguyen
9682e351b8
mtmd: refactor video subproc handling (#24316)
* mtmd: refactor video subproc handling

* Update tools/mtmd/mtmd-helper.cpp

Co-authored-by: Mikko Juola <mikjuo@gmail.com>

---------

Co-authored-by: Mikko Juola <mikjuo@gmail.com>
b9578
2026-06-09 13:15:12 +03:00
jacekpoplawski
1e912561dd
server: log prompts to directory (#22031)
* server: log prompts to directory

Add `--log-prompts-dir` to write each prompt to a separate text file in
the specified directory.

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b9577
2026-06-09 12:09:07 +02:00
Pascal
efbacf8d21
ui: fix mobile chat form overflow and bust stale bundle cache (#24158) 2026-06-09 11:12:58 +02:00
Pascal
26021699bc
ggml : add GGML_OP_COL2IM_1D (#24206)
* cpu: add GGML_OP_COL2IM_1D

Add the overlap-add (scatter-add) step of a 1D transposed convolution.
A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight
pre-permuted to [IC, K*OC] is contracted against the [IC, T_in] input
with mul_mat to produce a column matrix [K*OC, T_in], and col2im_1d
scatters those columns back into the [T_out, OC] signal, with
T_out = (T_in - 1)*s0 + K - 2*p0.

Keeping the contraction as a plain mul_mat leaves the heavy work on the
optimized (and quantizable) matmul kernels, so col2im_1d only does the
cheap overlap-add.

CPU uses a gather formulation parallelized over output channels,
supporting F32, F16 and BF16 with an F32 accumulator.

* tests: add backend coverage for GGML_OP_COL2IM_1D

Add test_col2im_1d next to the conv_transpose_1d cases, covering F32,
F16 and BF16 across eight geometries: the canonical kernel = 2*stride
DAC upsampling shape, overlap, no overlap, cropping (p0 = 1 and
p0 = stride/2), kernel < stride with zeroed gaps, kernel not a
multiple of stride, and a single column unfold.

Perf mode gets three real vocoder stage shapes reporting memory
bandwidth. max_nmse_err relaxes to 5e-4 for F16 and BF16.

* cpu: harden GGML_OP_COL2IM_1D

ggml_col2im_1d validates s0, oc, p0 and input contiguity at graph
build time, before the oc division, protecting every backend at once.
The kernel asserts the contiguity its flat indexing assumes and its
doc states the full output length including the crop term.

The kernel parallelizes over the time axis: the split stays balanced
down to OC = 1, where the previous channel split was single threaded.
Values are bit identical on the three real vocoder chains, two out of
three improve.

* tests: extend the GGML_OP_COL2IM_1D grid

The eval grid grows to eleven geometries: OC = 1 (mono output stage),
K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and
a crop down to T_out = 2 where all the gather bounds act at once.

* tests: add col2im_1d equivalence test

tests/test-col2im-1d.cpp proves mul_mat + col2im_1d matches the
native ggml_conv_transpose_1d on the CPU backend, F32 bit exact, F16
and BF16 through casts of the column matrix. test-backend-ops cannot
cover this for a CPU only op since the CPU backend is its own
reference there.

* rpc: bump protocol patch version for GGML_OP_COL2IM_1D

GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the
static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the
op is appended and no existing op code shifts.
b9575
2026-06-09 12:01:37 +03:00
fiesh
961e9a3e46
server : do not clear slots without unified KV cache (#24190)
* Always export idle slots to RAM

Without this, a slot's VRAM cache may not be written to RAM.  If this
slot happens to be busy then later on, this triggers needless
preprocessing in another slot.

* cont : clean-up

---------

Co-authored-by: Christoph Weiss <weiss@wsoptics.de>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b9574
2026-06-09 10:45:16 +03:00
Sigbjørn Skjæret
f0152efe40
models : fix plamo2 attention_key/value_length regression (#24317) b9573 2026-06-09 10:26:44 +03:00
Yash Raj Pandey
fd3271e0b4
ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (#24305)
* ggml-cpu : fix rms_norm_back wrong output under in-place aliasing

* cont : clean-up comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b9572
2026-06-09 10:24:27 +03:00
ravel7524
e3471b3e73
Remove case for GGML_TYPE_Q4_K in mvvq.cu (#23528) b9571 2026-06-09 07:46:23 +02:00
Reese Levine
3ac3c20c96
ggml-webgpu: Add clang-format job (#24308)
* Add clang-format job

* try local formatting
b9570
2026-06-08 20:54:24 -07:00
Masashi Yoshimura
1e1aca09da
ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants (#24225)
* ggml-webgpu: Improve prefill speeds + refactor matmul for quants

* Fixes for editroconfig checker
2026-06-08 15:19:56 -07:00
Max Krasnyansky
7d2b45b4f7
mtp: support for gemma-4 E2B and E4B assistants (#24282)
* models: update converter to support smaller assistants

* models: add masked_embd tensors to gemma4-assist arch

* gemma-4: remove temp debug for conversion

* gemma-4-mtp: filter out masked_embedding tensors during conversion
b9568
2026-06-08 13:48:52 -07:00
Aldehir Rojas
42a0afd594
server : do not parse when flushing http headers (#24281) b9567 2026-06-08 13:32:41 -05:00
Pascal
a66d50588b
graph: guard iswa kq_mask on its own buffer (#24294)
A SWA-only draft head (e.g. StepFun MTP) leaves the base sub-cache
empty, so its kq_mask buffer stays null and asserts at load. Guard
each mask on its own buffer in set_input and can_reuse, base and swa.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b9566
2026-06-08 19:20:28 +02:00
Nikhil Jain
1705d434f6
[ggml-webgpu] Handle buffer overlap / buffer aliasing for concat operator (#24000)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* handle buffer overlap case for concat operator

* restore build-webgpu.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Run clang-format

* Update ggml/src/ggml-webgpu/wgsl-shaders/concat.wgsl

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
b9565
2026-06-08 08:07:31 -07:00
Nikhil Jain
3b3da01dc2
[ggml-webgpu] Implement 2D workgroups for scale, binary, and unary ops (#24044)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* Implement 2d workgroups for more operations

* fix

* Fix type

* Move back to global_invocation_id
b9564
2026-06-08 08:07:15 -07:00
Xuan-Son Nguyen
3ebe862b5d
docker: install ffmpeg in the released image (#24302) b9563 2026-06-08 16:59:57 +02:00