ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-06-28 04:30:15 -05:00

Author	SHA1	Message	Date
Marian M.	5fb707d19b	Update docs (#1956 ) * Update README.md Models, MTP, fit * Update parameters.md Disclaimer, terms, new flags, graph split list.	2026-06-12 08:24:22 +02:00
Marian M.	b2e7f7f6cd	Update docs (#1800 ) * Update README.md - New model - New features * Update parameters.md - Recent new parameters	2026-05-14 08:44:58 +03:00
Andrew Moryakov	45dfd80371	readme : link "Build for CPU" to AVX-512 build flags reference (#1735 ) Adds a short note in README's "Build for CPU" section pointing to the AVX-512 build flags reference in docs/build.md (added by #1729). The vanilla `cmake -B build -DGGML_NATIVE=ON` example shown right above silently falls back to the AVX2 path on AMD Zen4 / Intel Sapphire Rapids+ hardware; users hitting "my Zen4 build is slow" tend to look at the README first, so a single-paragraph cross-reference here saves them from having to dig through docs/ to find the right knob. No content moved — README still has its own short example, the new paragraph just points at the deeper reference.	2026-05-04 15:35:24 +03:00
Kawrakow	fb07c1e6e5	Update README.md	2026-04-27 11:05:30 +02:00
mcm007	5720a4131a	Update docs (#1606 ) * Update parameters.md - list sm graph architectures - gpu tips - build options and parameters * Update README.md - Gemma4	2026-04-10 18:20:28 +02:00
Kawrakow	fd71191b2a	Update README.md	2026-04-04 08:32:37 +02:00
mcm007	d557d6c098	Update docs (#1574 ) * Update README.md - Model support - KV cache improvements * Update parameters.md - KV Q4_0 improvements - wgt, with notice - mtmd-kq-type	2026-04-03 08:30:29 +02:00
mcm007	028fc79710	Update README.md and parameters docs (#1550 ) * Update parameters.md withe recent changes * Update README.md with recent changes - Hadamard for V cache - AVX-VNNI optimizations - Auto-fit	2026-03-29 18:52:08 +02:00
Kawrakow	b08b620c9f	Update README	2026-03-18 14:25:47 +01:00
Kawrakow	dea161f108	Update model support list in README	2026-03-18 07:34:37 +01:00
mcm007	bfef07d10b	Update README.md and parameters.md with recent improvements (#1423 ) * Improve text formatting * Update README.md with recent models and features * Update parameters.md with recent additions * Remove deprecated from parameters.md	2026-03-14 18:14:20 +01:00
Kawrakow	714329f4ca	Remove pre-merged up/gate notice from the README No need for that after PRs #1408 and #1412	2026-03-12 17:29:36 +01:00
Kawrakow	fd4638f0e8	Update README with model compatibility warnings Add warnings about incompatible models with merged ffn_up_exps and ffn_gate_exps tensors.	2026-03-11 12:06:45 +01:00
mullecofo	f67fd9a452	Update README.md with build instructions for Windows (#1372 ) * Fix compilation on clang-cl.exe Fixes https://github.com/ikawrakow/ik_llama.cpp/issues/1169 See bitwise ariphmetics here: https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html Clang (and GCC) supports a language feature called Vector Extensions. To Clang, `__m512i` is not just a "struct" or a "bag of bits"; it is recognized by the compiler as a native vector type. Because it is a native vector type, Clang automatically maps standard C operators to the corresponding hardware instructions. When you write `a \| b`, Clang sees that a and b are 512-bit integer vectors. It implicitly understands that the bitwise OR operator (\|) applies to these vectors. It automatically generates the VPORQ (or VPORD) instruction without needing any helper function. MSVC follows a stricter, more traditional C++ model regarding intrinsics. In MSVC, __m512i is defined in the header files (<immintrin.h>) as a struct or union (e.g., typedef struct __m512i { ... } __m512i). To the MSVC compiler, it is essentially a user-defined data type, not a fundamental language primitive like int or float. Standard C++ does not define what `\|` means for a user-defined struct. MSVC does not have the same "Vector Extensions" that automatically apply operators to these structs. When you write `a \| b` in MSVC, the compiler looks for a definition of `operator\|` for the __m512i struct. Since the standard headers don't provide one, the compiler throws an error. You must use the explicit intrinsic function provided by Intel/MSVC: _mm512_or_si512(a, b). To get the nice syntax `(a \| b)` in MSVC, you have to manually "teach" the compiler what `\|` means by defining the `operator\|` overload yourself. * Update README.md with build instructions for Windows Current README lacks any guide for Windows users, whereas build process on that platform is quite compicated * Update build.md with instruction about clang-cl.exe Brings step-by-step build instruction for Windows * Apply suggestions from code review Co-authored-by: Kawrakow <iwankawrakow@gmail.com> * Polish build.md for Windows usage Added example of use for Windows * Apply suggestions from code review --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2026-03-09 11:17:26 +01:00
firecoperana	ab1d74074b	common : introduce composable PEG parser combinators for chat parsing and new jinja template engine (#1369 ) --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com> common : add nemotron 3 parsing (#18077) common : add parser for ministral/mistral large 3/devstral 2 (#17713) common : default content to an empty string (#18485) chat: make tool description and parameters optional per OpenAI spec (#18478) Per the OpenAI API specification, both 'description' and 'parameters' fields in tool function definitions are optional. Previously, the parser would throw an exception if these fields were missing. Attempts to fix #17667 common : implement new jinja template engine (#18462) --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> jinja: correct member access rule (#18905) jinja : fix lexing of float literals with sign (#18901) jinja : add missing tojson filter for bool (#18900) jinja : attribute support for join, map and sort (#18883) jinja : fix object item order (and properly implement dictsort) (#18904) tests : add test-jinja -py option for cross-checking (#18906) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> ci : run test-jinja -py on high perf [no ci] (#18916) jinja : fix undefined keys and attributes and int/float as bool (#18924) jinja: support none\|string (#18995) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> jinja : implement mixed type object keys (#18955) --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147) `tojson` is not a supported `undefined` filter keep it DRY and fix some types jinja : do not pass empty tools and add some none filters (#19176) jinja : add unordered_map include to value.h [no ci] (#19205) jinja : add missing 'in' test to template engine (#19004) (#19239) The jinja template parser was missing the 'in' test from global_builtins(), causing templates using reject("in", ...), select("in", ...), or 'x is in(y)' to fail with "selectattr: unknown test 'in'". This broke tool-calling for Qwen3-Coder and any other model whose chat template uses the 'in' test. Added test_is_in supporting array, string, and object containment checks, mirroring the existing 'in' operator logic in runtime.cpp. Includes test cases for all three containment types plus reject/select filter usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Add Jinja support for "indent" string filter (#19529) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> add vendor refactor chat server : support preserving reasoning_content in assistant message (#18994) chat : fix translategemma crash on common_chat_format_example (#19019) chat: fix language input for translategemma (#19052) Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Aldehir Rojas <hello@alde.dev> chat: fix case where template accepts type content only (#19419) mtmd : chat : Fix extra \n between text and media marker (#19595) Thanks to @tugot17 for detecting and reporting the issue. For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation. However `llama-server` doesn't. I traced it down to extra newline inserted after `<__media__>`. This happens in `to_json_oaicompat`, that treats media markers as text and joins all parts with `\n` separator. PR introduces new type `media_marker` and uses it for media markers. Extra logic is added to prevent insertion of newlines before and after media markers. With this change number of input tokens is identical to HF implementation and as a result the output is also identical. I explored other ways to address the issue * remove completely `\n` between text parts in `to_json_oaicompat` * merge text messages in server-common.cpp before sending them to `to_json_oaicompat` Please propose alternative ways of fixing this issue. Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> common : merge qwen3-coder and nemotron nano 3 parsers (#19765) common : fix improper trimming in XML parser on complete message (#19805) Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com> jinja: correct stats for tojson and string filters (#19785) jinja : correct default size for string slices (#19913) common : handle unicode during partial json parsing (#16526) common : fix json schema with '\' in literals (#17307) add back qwen_coder_xml and mirothinker Co-authored-by: Aldehir Rojas <hello@alde.dev>	2026-03-09 11:03:33 +01:00
Kawrakow	542988773c	Update README with backend support notes Clarify backend support and usage of quantized models in README.	2026-03-09 07:36:08 +01:00
Kawrakow	702e0765b8	Update README with clarification on '_XL' models Clarified warning about Unsloth '_XL' models in README.	2026-02-27 16:22:10 +01:00
Kawrakow	cbf7fc7e2f	Update README with warning about '_XL' models from Unsloth Added important note regarding quantized models from Unsloth.	2026-02-22 07:42:17 +01:00
mcm007	b2cb4512c5	Create parameters overview (#1269 ) * raw parameters.md * fix small typos in common.cpp * Update build args in parameters.md * Update parameters.md - format as table - sections * Update README.md - quickstart - build and run * Update parameters.md other tools examples * add PR links * multiple updates to parameters.md - description - add jargon section - add suggestions from feedbacks * don't imply that only linux is supported in README.md * add alias to parameters.md * Update README.md with recent models and features * Update parameters.md with latest features * address suggestions - no-ooae - placeholder for common commands - no-kv-offload - llama-sweep-bench - placeholder for unique parameters * specify Linux distro in README.md	2026-02-20 07:20:56 +01:00
mcm007	f5fe33b7a9	Update README.md (#1263 ) * Update README.md Add new models and few of the features, quants and improvements * Update README.md ministral3 and split mode "graph"	2026-02-14 09:02:33 +01:00
mcm007	dbcbfdb0ef	Ik llama swap in container step by step guide (#1249 ) * Create README.md * Add container files and llama-swap configs * Update main README.md * Build without GGML_IQK_FA_ALL_QUANTS Otherwise fails with CUDA_DOCKER_ARCH=default * Mention GGML_IQK_FA_ALL_QUANTS usage * First step more explicit	2026-02-07 18:30:19 +02:00
Kawrakow	0486b5ad93	Update README.md	2025-07-23 19:38:54 +02:00
Anton Sokolchenko	9ee72225dc	Function calling support for Kimi-K2 (#628 ) * Implement function calling / tools for ik_llama.cpp for Kimi K2 * Implement basic tool choice * Backport llama.cpp tool calls support * Enhance function calls with improved chat parser and string utilities - Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling - Improve function calls parsing with fallback to llama.cpp builder pattern - Add string utility functions (starts_with, ends_with, find_partial_stop) - Update README with function calls testing instructions - Enhance Kimi K2 parser and function calls documentation - Add comprehensive test suite for function calls - Update CMakeLists.txt and Makefile for new components * Enhance function calling with unified streaming and parser improvements - Fix streaming content cleanup to prevent function syntax in output - Unify content extraction patterns with llama.cpp approach - Improve Kimi K2 parser robustness and partial content handling - Add comprehensive test coverage for function call scenarios - Optimize chat message parsing and diff computation * Replace hardcoded values in kimi_k2_parser.hpp with named constants - Add compile-time constants for all token format markers - Add compile-time constants for XML format markers - Add compile-time constants for simple format patterns - Replace all hardcoded string literals with named constants - Use compile-time length calculation to avoid manual counting - Improve maintainability and reduce magic numbers throughout parser * Fix duplicate common_chat_parse definition - Remove duplicate implementation from chat-parser.cpp - Keep single implementation in chat.cpp following llama.cpp patterns - Resolves linker error: multiple definition of common_chat_parse * Fix JSON assertion failure in function call parsing - Add proper validation that 'function' field is an object before accessing nested keys - Handle missing 'arguments' field gracefully with default "{}" - Prevents crash when parsing malformed tool call JSON structures * Add comprehensive Qwen3 XML tool calling support with unit tests - Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format - Add model detection and routing for Qwen3 vs Kimi-K2 formats - Create 8 comprehensive unit tests covering parsing, streaming, error handling - Fix token format cleaning bug in kimi_k2_parser.hpp processing order - Remove progressive parsing code and related utilities - Add tool injection support for Qwen3 format in server utils * Add DeepSeek R1 function calling support with comprehensive unit tests - Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp - Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp - Update function_calls.hpp with DeepSeek R1 integration and content extraction - Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models - Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration - Port exact implementation patterns from original llama.cpp for compatibility Key features: - Native DeepSeek R1 format: <｜tool▁calls▁begin｜>function<｜tool▁sep｜>name```json{}```<｜tool▁call▁end｜><｜tool▁calls▁end｜> - Reasoning content extraction from <think>...</think> tags - Multiple tool calls support with separate call blocks - Model detection for deepseek-r1, deepseek_r1 naming patterns - Integration with incremental parsing and streaming support * Add partial parsing support for JSON and regex - json-partial.h/cpp: JSON partial parsing functionality - regex-partial.h/cpp: Regex partial parsing functionality * Add format_chat integration tests for Qwen3 tool injection - Add test_qwen3_format_chat_integration() to validate tool injection pipeline - Test tool injection conditions and system message enhancement - Verify JSON formatting and anti-preamble instructions - Add comprehensive test documentation Tests confirm tool injection works correctly - conversational preamble issue is not in ik_llama.cpp but likely in UI configuration. * Fix Qwen3 tool call parsing - pass model name to parser Server was not passing model name to parse_chat_message_incremental(), causing Qwen3 to fall back to Kimi-K2 parser and return tool calls as content instead of proper tool_calls array. * Fix non-streaming path to use model-specific parsing Non-streaming responses were hardcoded to use Kimi-K2 format, causing Qwen3 XML tool calls to be returned as content instead of proper tool_calls array. Now uses same model detection as streaming path for consistency.	2025-07-23 18:11:42 +02:00
Kawrakow	9513222ba5	Revert "Update README.md" This reverts commit b48d71fec834c540fcd4c3b83a8c998aaf670b9a.	2025-07-22 15:22:46 +03:00
Kawrakow	c3cd543d77	Update README.md	2025-07-22 09:01:59 +02:00
saood06	638fb80e8a	Minor readme update (#535 ) * Condense CUDA implementations). * move thing * move thing * move thing fix	2025-06-19 10:18:39 +03:00
saood06	ed868d928c	Update News section of readme (#510 ) * Convert existing News to new format * Update with new ones * Add more links and minor fix * more minor fixes * requested changes * Add old PRs * Add more old PRs * Add all IQK quants	2025-06-13 07:56:40 +03:00
Kawrakow	537f72f9cc	Update README.md	2025-05-12 15:48:37 +03:00
Kawrakow	b64cb29713	Update README.md @saood06 Thanks!	2025-05-09 11:16:36 +03:00
Kawrakow	957a6e7911	Update README.md	2025-05-09 10:13:25 +03:00
Kawrakow	828758ec0d	Update README.md	2025-05-07 18:59:01 +03:00
Kawrakow	6e7b28f7b0	Update README.md	2025-05-06 08:48:11 +03:00
Kawrakow	db0ed280f1	Update README.md	2025-05-04 12:06:47 +03:00
Kawrakow	7cb99f8078	Update README.md	2025-05-04 11:49:29 +03:00
Kawrakow	9303df7450	Update README.md (#352 ) * Update README.md * Edits * Updates --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-30 15:11:29 +02:00
Ikko Eltociear Ashimine	79db2e243f	docs: update README.md (#304 )	2025-04-01 21:30:25 +02:00
Kawrakow	25ade24526	Update README.md	2024-08-12 15:16:00 +02:00
Kawrakow	74f2f50abf	Update README.md There have been a few minor improvements here and there, so updated the AVX2 Bitnet performance values to current main branch.	2024-08-05 07:35:30 +02:00
Kawrakow	a14a9426ec	Offload Bitnet token embeddings to the GPU (#1 ) * bitnet: put token embeddings on the GPU * Update README with the new CUDA/Meat performance --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-26 09:41:04 +02:00
Kawrakow	5626b09e4b	Update README.md	2024-07-24 19:55:06 +02:00
Kawrakow	ddaae42194	Update README.md Trying to avoid line breaks in table	2024-07-24 19:44:52 +02:00
Kawrakow	914b7ef460	Update README.md	2024-07-24 19:20:46 +02:00
Kawrakow	28b4229295	Correct spelling in README	2024-07-24 19:22:43 +03:00
Kawrakow	b84d0c1744	Update README.md Adding some more details	2024-07-24 17:38:37 +02:00
Kawrakow	de43999de5	Update README.md Adding MoE and Bitnet performance tables	2024-07-24 16:49:00 +02:00
Kawrakow	cd77618324	Update README.md I hate it when tables look fine in the Preview but then end up with columns split into 2 lines when committed. That's what is happening here, so removed test column from the performance tables.	2024-07-24 11:18:50 +02:00
Kawrakow	4bb58ea8f8	Update README.md Added performance comparison tables	2024-07-24 11:01:16 +02:00
Kawrakow	847588cc92	Update README.md	2024-07-23 18:05:05 +02:00
Kawrakow	97680f602c	Update README.md	2024-07-23 12:23:06 +02:00
Abheek Gulati	d406a5fb51	readme : update UI list (#7943 )	2024-06-18 09:57:41 +03:00

1 2 3 4 5 ...

402 Commits