402 Commits

Author SHA1 Message Date
Marian M.
5fb707d19b
Update docs (#1956)
* Update README.md

Models, MTP, fit

* Update parameters.md

Disclaimer, terms, new flags, graph split list.
2026-06-12 08:24:22 +02:00
Marian M.
b2e7f7f6cd
Update docs (#1800)
* Update README.md

- New model
- New features

* Update parameters.md

- Recent new parameters
2026-05-14 08:44:58 +03:00
Andrew Moryakov
45dfd80371
readme : link "Build for CPU" to AVX-512 build flags reference (#1735)
Adds a short note in README's "Build for CPU" section pointing to the
AVX-512 build flags reference in docs/build.md (added by #1729).

The vanilla `cmake -B build -DGGML_NATIVE=ON` example shown right above
silently falls back to the AVX2 path on AMD Zen4 / Intel Sapphire
Rapids+ hardware; users hitting "my Zen4 build is slow" tend to look at
the README first, so a single-paragraph cross-reference here saves them
from having to dig through docs/ to find the right knob.

No content moved — README still has its own short example, the new
paragraph just points at the deeper reference.
2026-05-04 15:35:24 +03:00
Kawrakow
fb07c1e6e5
Update README.md 2026-04-27 11:05:30 +02:00
mcm007
5720a4131a
Update docs (#1606)
* Update parameters.md

- list sm graph architectures
- gpu tips
- build options and parameters

* Update README.md

- Gemma4
2026-04-10 18:20:28 +02:00
Kawrakow
fd71191b2a
Update README.md 2026-04-04 08:32:37 +02:00
mcm007
d557d6c098
Update docs (#1574)
* Update README.md

- Model support
- KV cache improvements

* Update parameters.md

- KV Q4_0 improvements
- wgt, with notice
- mtmd-kq-type
2026-04-03 08:30:29 +02:00
mcm007
028fc79710
Update README.md and parameters docs (#1550)
* Update parameters.md withe recent changes

* Update README.md with recent changes

- Hadamard for V cache
- AVX-VNNI optimizations
- Auto-fit
2026-03-29 18:52:08 +02:00
Kawrakow
b08b620c9f
Update README 2026-03-18 14:25:47 +01:00
Kawrakow
dea161f108
Update model support list in README 2026-03-18 07:34:37 +01:00
mcm007
bfef07d10b
Update README.md and parameters.md with recent improvements (#1423)
* Improve text formatting

* Update README.md with recent models and features

* Update parameters.md with recent additions

* Remove deprecated from parameters.md
2026-03-14 18:14:20 +01:00
Kawrakow
714329f4ca
Remove pre-merged up/gate notice from the README
No need for that after PRs #1408 and #1412
2026-03-12 17:29:36 +01:00
Kawrakow
fd4638f0e8
Update README with model compatibility warnings
Add warnings about incompatible models with merged ffn_up_exps and ffn_gate_exps tensors.
2026-03-11 12:06:45 +01:00
mullecofo
f67fd9a452
Update README.md with build instructions for Windows (#1372)
* Fix compilation on clang-cl.exe

Fixes https://github.com/ikawrakow/ik_llama.cpp/issues/1169

See bitwise ariphmetics here: https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html

Clang (and GCC) supports a language feature called Vector Extensions.

To Clang, `__m512i` is not just a "struct" or a "bag of bits"; it is recognized by the compiler as a native vector type.
Because it is a native vector type, Clang automatically maps standard C operators to the corresponding hardware instructions.
When you write `a | b`, Clang sees that a and b are 512-bit integer vectors.
It implicitly understands that the bitwise OR operator (|) applies to these vectors.
It automatically generates the VPORQ (or VPORD) instruction without needing any helper function.

MSVC follows a stricter, more traditional C++ model regarding intrinsics.

In MSVC, __m512i is defined in the header files (<immintrin.h>) as a struct or union (e.g., typedef struct __m512i { ... } __m512i). To the MSVC compiler, it is essentially a user-defined data type, not a fundamental language primitive like int or float.
Standard C++ does not define what `|` means for a user-defined struct.
MSVC does not have the same "Vector Extensions" that automatically apply operators to these structs.
When you write `a | b` in MSVC, the compiler looks for a definition of `operator|` for the __m512i struct. Since the standard headers don't provide one, the compiler throws an error.
You must use the explicit intrinsic function provided by Intel/MSVC: _mm512_or_si512(a, b).

To get the nice syntax `(a | b)` in MSVC, you have to manually "teach" the compiler what `|` means by defining the `operator|` overload yourself.

* Update README.md with build instructions for Windows

Current README lacks any guide for Windows users, whereas build process on that platform is quite compicated

* Update build.md with instruction about clang-cl.exe

Brings step-by-step build instruction for Windows

* Apply suggestions from code review

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

* Polish build.md for Windows usage

Added example of use for Windows

* Apply suggestions from code review

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-03-09 11:17:26 +01:00
firecoperana
ab1d74074b
common : introduce composable PEG parser combinators for chat parsing and new jinja template engine (#1369)
---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>

common : add nemotron 3 parsing (#18077)

common : add parser for ministral/mistral large 3/devstral 2 (#17713)

common : default content to an empty string (#18485)

chat: make tool description and parameters optional per OpenAI spec (#18478)

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix #17667

common : implement new jinja template engine (#18462)
---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

jinja: correct member access rule (#18905)

jinja : fix lexing of float literals with sign (#18901)

jinja : add missing tojson filter for bool (#18900)

jinja : attribute support for join, map and sort (#18883)

jinja : fix object item order (and properly implement dictsort) (#18904)

tests : add test-jinja -py option for cross-checking (#18906)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ci : run test-jinja -py on high perf [no ci] (#18916)

jinja : fix undefined keys and attributes and int/float as bool (#18924)

jinja: support none|string (#18995)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

jinja : implement mixed type object keys (#18955)

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)

`tojson` is not a supported `undefined` filter

keep it DRY and fix some types

jinja : do not pass empty tools and add some none filters (#19176)

jinja : add unordered_map include to value.h [no ci] (#19205)

jinja : add missing 'in' test to template engine (#19004) (#19239)

The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".

This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.

Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.

Includes test cases for all three containment types plus
reject/select filter usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

Add Jinja support for "indent" string filter (#19529)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

add vendor

refactor chat

server : support preserving reasoning_content in assistant message (#18994)

chat : fix translategemma crash on common_chat_format_example (#19019)

chat: fix language input for translategemma (#19052)

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>

chat: fix case where template accepts type content only (#19419)

mtmd : chat : Fix extra \n between text and media marker (#19595)

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

common : merge qwen3-coder and nemotron nano 3 parsers (#19765)

common : fix improper trimming in XML parser on complete message (#19805)

Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>

jinja: correct stats for tojson and string filters (#19785)

jinja : correct default size for string slices (#19913)

common : handle unicode during partial json parsing (#16526)

common : fix json schema with '\' in literals (#17307)

add back qwen_coder_xml and mirothinker

Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-03-09 11:03:33 +01:00
Kawrakow
542988773c
Update README with backend support notes
Clarify backend support and usage of quantized models in README.
2026-03-09 07:36:08 +01:00
Kawrakow
702e0765b8
Update README with clarification on '_XL' models
Clarified warning about Unsloth '_XL' models in README.
2026-02-27 16:22:10 +01:00
Kawrakow
cbf7fc7e2f
Update README with warning about '_XL' models from Unsloth
Added important note regarding quantized models from Unsloth.
2026-02-22 07:42:17 +01:00
mcm007
b2cb4512c5
Create parameters overview (#1269)
* raw parameters.md

* fix small typos in common.cpp

* Update build args in parameters.md

* Update parameters.md

- format as table
- sections

* Update README.md

- quickstart
- build and run

* Update parameters.md

other tools examples

* add PR links

* multiple updates to parameters.md

- description
- add jargon section
- add suggestions from feedbacks

* don't imply that only linux is supported in README.md

* add alias to parameters.md

* Update README.md with recent models and features

* Update parameters.md with latest features

* address suggestions

- no-ooae
- placeholder for common commands
- no-kv-offload
- llama-sweep-bench
- placeholder for unique parameters

* specify Linux distro in README.md
2026-02-20 07:20:56 +01:00
mcm007
f5fe33b7a9
Update README.md (#1263)
* Update README.md

Add new models and few of the features, quants and improvements

* Update README.md

ministral3 and split mode "graph"
2026-02-14 09:02:33 +01:00
mcm007
dbcbfdb0ef
Ik llama swap in container step by step guide (#1249)
* Create README.md

* Add container files and llama-swap configs

* Update main README.md

* Build without GGML_IQK_FA_ALL_QUANTS

Otherwise fails with CUDA_DOCKER_ARCH=default

* Mention GGML_IQK_FA_ALL_QUANTS usage

* First step more explicit
2026-02-07 18:30:19 +02:00
Kawrakow
0486b5ad93 Update README.md 2025-07-23 19:38:54 +02:00
Anton Sokolchenko
9ee72225dc Function calling support for Kimi-K2 (#628)
* Implement function calling / tools for ik_llama.cpp for Kimi K2

* Implement basic tool choice

* Backport llama.cpp tool calls support

* Enhance function calls with improved chat parser and string utilities

- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components

* Enhance function calling with unified streaming and parser improvements

- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation

* Replace hardcoded values in kimi_k2_parser.hpp with named constants

- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser

* Fix duplicate common_chat_parse definition

- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse

* Fix JSON assertion failure in function call parsing

- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures

* Add comprehensive Qwen3 XML tool calling support with unit tests

- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils

* Add DeepSeek R1 function calling support with comprehensive unit tests

- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility

Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support

* Add partial parsing support for JSON and regex

- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality

* Add format_chat integration tests for Qwen3 tool injection

- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation

Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.

* Fix Qwen3 tool call parsing - pass model name to parser

Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.

* Fix non-streaming path to use model-specific parsing

Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.
2025-07-23 18:11:42 +02:00
Kawrakow
9513222ba5 Revert "Update README.md"
This reverts commit b48d71fec834c540fcd4c3b83a8c998aaf670b9a.
2025-07-22 15:22:46 +03:00
Kawrakow
c3cd543d77 Update README.md 2025-07-22 09:01:59 +02:00
saood06
638fb80e8a Minor readme update (#535)
* Condense CUDA implementations).

* move thing

* move thing

* move thing fix
2025-06-19 10:18:39 +03:00
saood06
ed868d928c Update News section of readme (#510)
* Convert existing News to new format

* Update with new ones

* Add more links and minor fix

* more minor fixes

* requested changes

* Add old PRs

* Add more old PRs

* Add all IQK quants
2025-06-13 07:56:40 +03:00
Kawrakow
537f72f9cc Update README.md 2025-05-12 15:48:37 +03:00
Kawrakow
b64cb29713 Update README.md
@saood06 Thanks!
2025-05-09 11:16:36 +03:00
Kawrakow
957a6e7911 Update README.md 2025-05-09 10:13:25 +03:00
Kawrakow
828758ec0d Update README.md 2025-05-07 18:59:01 +03:00
Kawrakow
6e7b28f7b0 Update README.md 2025-05-06 08:48:11 +03:00
Kawrakow
db0ed280f1 Update README.md 2025-05-04 12:06:47 +03:00
Kawrakow
7cb99f8078 Update README.md 2025-05-04 11:49:29 +03:00
Kawrakow
9303df7450 Update README.md (#352)
* Update README.md

* Edits

* Updates

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-30 15:11:29 +02:00
Ikko Eltociear Ashimine
79db2e243f docs: update README.md (#304) 2025-04-01 21:30:25 +02:00
Kawrakow
25ade24526 Update README.md 2024-08-12 15:16:00 +02:00
Kawrakow
74f2f50abf Update README.md
There have been a few minor improvements here and there, so updated the AVX2 Bitnet performance values to current main branch.
2024-08-05 07:35:30 +02:00
Kawrakow
a14a9426ec Offload Bitnet token embeddings to the GPU (#1)
* bitnet: put token embeddings on the GPU

* Update README with the new CUDA/Meat performance

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-26 09:41:04 +02:00
Kawrakow
5626b09e4b Update README.md 2024-07-24 19:55:06 +02:00
Kawrakow
ddaae42194 Update README.md
Trying to avoid line breaks in table
2024-07-24 19:44:52 +02:00
Kawrakow
914b7ef460 Update README.md 2024-07-24 19:20:46 +02:00
Kawrakow
28b4229295 Correct spelling in README 2024-07-24 19:22:43 +03:00
Kawrakow
b84d0c1744 Update README.md
Adding some more details
2024-07-24 17:38:37 +02:00
Kawrakow
de43999de5 Update README.md
Adding MoE and Bitnet performance tables
2024-07-24 16:49:00 +02:00
Kawrakow
cd77618324 Update README.md
I hate it when tables look fine in the Preview but then end up with columns split into 2 lines when committed. That's what is happening here, so removed test column from the performance tables.
2024-07-24 11:18:50 +02:00
Kawrakow
4bb58ea8f8 Update README.md
Added performance comparison tables
2024-07-24 11:01:16 +02:00
Kawrakow
847588cc92 Update README.md 2024-07-23 18:05:05 +02:00
Kawrakow
97680f602c Update README.md 2024-07-23 12:23:06 +02:00
Abheek Gulati
d406a5fb51 readme : update UI list (#7943) 2024-06-18 09:57:41 +03:00