Merge remote-tracking branch 'origin/main' into feat/dflash-implementation

# Conflicts: # common/common.cpp # common/speculative.cpp # convert_hf_to_gguf.py # examples/server/server-context.cpp # examples/server/server-context.h # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama.cpp
2026-06-28 04:30:15 -05:00 · 2026-06-13 17:27:52 -03:00 · 2026-06-13 17:27:52 -03:00 · 3a1d46c4d1
commit 3a1d46c4d1
parent 08e4590dcb 5f917a64b3
609 changed files with 75041 additions and 24366 deletions
--- a/.gitignore
+++ b/.gitignore
@ -96,7 +96,7 @@ lcov-report/
 !/examples/sycl/*.sh

 # Server Web UI temporary files
-
+/examples/server/webui/node_modules
 /examples/server/webui_llamacpp/.svelte-kit
 /examples/server/webui_llamacpp/node_modules
 /examples/server/webui_llamacpp/build
--- a/README.md
+++ b/README.md
@ -89,7 +89,7 @@ That's all! Open [http://127.0.0.1:8080](http://127.0.0.1:8080) in Browser start

 ### Model Support

-LlaMA-3-Nemotron [PR 377](https://github.com/ikawrakow/ik_llama.cpp/pull/377), Qwen3 [PR 355](https://github.com/ikawrakow/ik_llama.cpp/pull/355), GLM-4 [PR 344](https://github.com/ikawrakow/ik_llama.cpp/pull/344), Command-A [PR 341](https://github.com/ikawrakow/ik_llama.cpp/pull/341), bitnet-b1.58-2B-4T [PR 337](https://github.com/ikawrakow/ik_llama.cpp/pull/337), LLaMA-4 [PR 321](https://github.com/ikawrakow/ik_llama.cpp/pull/321), Gemma3 [PR 276](https://github.com/ikawrakow/ik_llama.cpp/pull/276),  DeepSeek-V3 [PR 176](https://github.com/ikawrakow/ik_llama.cpp/pull/176), Kimi-2 [PR 609](https://github.com/ikawrakow/ik_llama.cpp/pull/609), dots.llm1 [PR 573](https://github.com/ikawrakow/ik_llama.cpp/pull/573), Hunyuan [PR 565](https://github.com/ikawrakow/ik_llama.cpp/pull/565), GLM-4.5 [PR 668](https://github.com/ikawrakow/ik_llama.cpp/pull/668) (4.5/4.6/4.7/AIR), Ernie 4.5 MOE and 0.3B [PR 759](https://github.com/ikawrakow/ik_llama.cpp/pull/759), grok-2 [PR 782](https://github.com/ikawrakow/ik_llama.cpp/pull/782), Ling/Ring (Bailing-MoE2) [PR 833](https://github.com/ikawrakow/ik_llama.cpp/pull/833), Qwen3-VL [PR 883](https://github.com/ikawrakow/ik_llama.cpp/pull/883), SmolLM3 [PR 934](https://github.com/ikawrakow/ik_llama.cpp/pull/934), GigaChat3 [PR 995](https://github.com/ikawrakow/ik_llama.cpp/pull/995), ministral3 [PR 1030](https://github.com/ikawrakow/ik_llama.cpp/pull/1030), Mimo-V2-Flash [PR 1096](https://github.com/ikawrakow/ik_llama.cpp/pull/1096), GLM-4.7-Flash [PR 1168](https://github.com/ikawrakow/ik_llama.cpp/pull/1168), Seed-OSS [PR 1218](https://github.com/ikawrakow/ik_llama.cpp/pull/1218), Step-3.5-Flash [PR 1231](https://github.com/ikawrakow/ik_llama.cpp/pull/1231), GLM-5 [PR 1268](https://github.com/ikawrakow/ik_llama.cpp/pull/1268), Qwen3-Next [PR 1266](https://github.com/ikawrakow/ik_llama.cpp/pull/1266), Qwen3.5-MoE [PR 1288](https://github.com/ikawrakow/ik_llama.cpp/pull/1288) and dense Qwen-3.5 [1326](https://github.com/ikawrakow/ik_llama.cpp/pull/1326), Mistral 4 [PR 1450](https://github.com/ikawrakow/ik_llama.cpp/pull/1450), Bonsai 1-bit [PR 1570](https://github.com/ikawrakow/ik_llama.cpp/pull/1570), Gemma4 [PR 1581](https://github.com/ikawrakow/ik_llama.cpp/pull/1581), Mimo-2.5 [PR 1723](https://github.com/ikawrakow/ik_llama.cpp/pull/1723)
+LlaMA-3-Nemotron [PR 377](https://github.com/ikawrakow/ik_llama.cpp/pull/377), Qwen3 [PR 355](https://github.com/ikawrakow/ik_llama.cpp/pull/355), GLM-4 [PR 344](https://github.com/ikawrakow/ik_llama.cpp/pull/344), Command-A [PR 341](https://github.com/ikawrakow/ik_llama.cpp/pull/341), bitnet-b1.58-2B-4T [PR 337](https://github.com/ikawrakow/ik_llama.cpp/pull/337), LLaMA-4 [PR 321](https://github.com/ikawrakow/ik_llama.cpp/pull/321), Gemma3 [PR 276](https://github.com/ikawrakow/ik_llama.cpp/pull/276),  DeepSeek-V3 [PR 176](https://github.com/ikawrakow/ik_llama.cpp/pull/176), Kimi-2 [PR 609](https://github.com/ikawrakow/ik_llama.cpp/pull/609), dots.llm1 [PR 573](https://github.com/ikawrakow/ik_llama.cpp/pull/573), Hunyuan [PR 565](https://github.com/ikawrakow/ik_llama.cpp/pull/565), GLM-4.5 [PR 668](https://github.com/ikawrakow/ik_llama.cpp/pull/668) (4.5/4.6/4.7/AIR), Ernie 4.5 MOE and 0.3B [PR 759](https://github.com/ikawrakow/ik_llama.cpp/pull/759), grok-2 [PR 782](https://github.com/ikawrakow/ik_llama.cpp/pull/782), Ling/Ring (Bailing-MoE2) [PR 833](https://github.com/ikawrakow/ik_llama.cpp/pull/833), Qwen3-VL [PR 883](https://github.com/ikawrakow/ik_llama.cpp/pull/883), SmolLM3 [PR 934](https://github.com/ikawrakow/ik_llama.cpp/pull/934), GigaChat3 [PR 995](https://github.com/ikawrakow/ik_llama.cpp/pull/995), ministral3 [PR 1030](https://github.com/ikawrakow/ik_llama.cpp/pull/1030), Mimo-V2-Flash [PR 1096](https://github.com/ikawrakow/ik_llama.cpp/pull/1096), GLM-4.7-Flash [PR 1168](https://github.com/ikawrakow/ik_llama.cpp/pull/1168), Seed-OSS [PR 1218](https://github.com/ikawrakow/ik_llama.cpp/pull/1218), Step-3.5-Flash [PR 1231](https://github.com/ikawrakow/ik_llama.cpp/pull/1231), GLM-5 [PR 1268](https://github.com/ikawrakow/ik_llama.cpp/pull/1268), Qwen3-Next [PR 1266](https://github.com/ikawrakow/ik_llama.cpp/pull/1266), Qwen3.5-MoE [PR 1288](https://github.com/ikawrakow/ik_llama.cpp/pull/1288) and dense Qwen-3.5 [1326](https://github.com/ikawrakow/ik_llama.cpp/pull/1326), Mistral 4 [PR 1450](https://github.com/ikawrakow/ik_llama.cpp/pull/1450), Bonsai 1-bit [PR 1570](https://github.com/ikawrakow/ik_llama.cpp/pull/1570), Gemma4 [PR 1581](https://github.com/ikawrakow/ik_llama.cpp/pull/1581), Mimo-2.5 [PR 1723](https://github.com/ikawrakow/ik_llama.cpp/pull/1723), JetBrains Mellum2 [PR 1919](https://github.com/ikawrakow/ik_llama.cpp/pull/1919), Poolside Laguna XS.2 [PR 1911](https://github.com/ikawrakow/ik_llama.cpp/pull/1911), Cohere2-MoE North Mini Code [PR 1945](https://github.com/ikawrakow/ik_llama.cpp/pull/1945)

 ### Quantization

@ -125,6 +125,7 @@ Implemented for Zen4, AVX2, ARM_NEON, Metal, CUDA [PR 682](https://github.com/ik

 * `IQ1_M` [PR 327](https://github.com/ikawrakow/ik_llama.cpp/pull/327), `IQ2_XS` [PR 312](https://github.com/ikawrakow/ik_llama.cpp/pull/312), `Q2_K, Q4_K, Q5_K, Q4_1, Q5_1` [PR 302](https://github.com/ikawrakow/ik_llama.cpp/pull/302), `Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL` [PR 295](https://github.com/ikawrakow/ik_llama.cpp/pull/295)
 * Low perplexity `Q4_0` KV cache [PR 1547](https://github.com/ikawrakow/ik_llama.cpp/pull/1547) [PR 1556](https://github.com/ikawrakow/ik_llama.cpp/pull/1556)
+* MTP: option to use re-quantized output tensor `--mtp-requantize-output-tensor new_type` [PR 1809](https://github.com/ikawrakow/ik_llama.cpp/pull/1809)

 #### Quantization performance improvements 

@ -143,16 +144,16 @@ Implemented for Zen4, AVX2, ARM_NEON, Metal, CUDA [PR 682](https://github.com/ik
 * New split mode "graph" for multi GPU setups [PR 1022](https://github.com/ikawrakow/ik_llama.cpp/pull/1022)
 * Fused delta-net for Qwen3-Next and Qwen3.5-MoE [PR 1315](https://github.com/ikawrakow/ik_llama.cpp/pull/1315) [PR 1333](https://github.com/ikawrakow/ik_llama.cpp/pull/1333) [PR 1362](https://github.com/ikawrakow/ik_llama.cpp/pull/1362) [PR 1373](https://github.com/ikawrakow/ik_llama.cpp/pull/1373)
 * Hadamard transforms for K-cache and V-cache [PR 1033](https://github.com/ikawrakow/ik_llama.cpp/pull/1033) [PR 1034](https://github.com/ikawrakow/ik_llama.cpp/pull/1034) [PR 1527](https://github.com/ikawrakow/ik_llama.cpp/pull/1527)
-* Auto-fit offloaded tensors to available VRAM (MoE and dense models) [PR 1501](https://github.com/ikawrakow/ik_llama.cpp/pull/1501) [PR 1504](https://github.com/ikawrakow/ik_llama.cpp/pull/1504)
+* Auto-fit offloaded tensors to available VRAM (MoE and dense models) [PR 1501](https://github.com/ikawrakow/ik_llama.cpp/pull/1501) [PR 1504](https://github.com/ikawrakow/ik_llama.cpp/pull/1504), allows per GPU fit margin [PR 1872](https://github.com/ikawrakow/ik_llama.cpp/pull/1872)
 * Checkpoints for recurrent models [PR 1310](https://github.com/ikawrakow/ik_llama.cpp/pull/1310) [PR 1398](https://github.com/ikawrakow/ik_llama.cpp/pull/1398)
-* MTP decoding support for popular models like GLM-4.x MoE [1270](https://github.com/ikawrakow/ik_llama.cpp/pull/1270), Qwen 3.5/3.6 [1698](https://github.com/ikawrakow/ik_llama.cpp/pull/1698) [1745](https://github.com/ikawrakow/ik_llama.cpp/pull/1745), Gemma 4 [1744](https://github.com/ikawrakow/ik_llama.cpp/pull/1744)
+* MTP decoding support for popular models like GLM-4.x MoE [1270](https://github.com/ikawrakow/ik_llama.cpp/pull/1270), Qwen 3.5/3.6 [1698](https://github.com/ikawrakow/ik_llama.cpp/pull/1698) [1745](https://github.com/ikawrakow/ik_llama.cpp/pull/1745), Gemma 4 [1744](https://github.com/ikawrakow/ik_llama.cpp/pull/1744), GLM 5 [1890](https://github.com/ikawrakow/ik_llama.cpp/pull/1890)
 * Self speculative decoding, ngram [PR 1261](https://github.com/ikawrakow/ik_llama.cpp/pull/1261), suffix [PR 1646](https://github.com/ikawrakow/ik_llama.cpp/pull/1646)
 * String ban function for all completions [PR 1185](https://github.com/ikawrakow/ik_llama.cpp/pull/1185) [PR 1243](https://github.com/ikawrakow/ik_llama.cpp/pull/1243)
 * Expiring Logit Bias [PR 1731](https://github.com/ikawrakow/ik_llama.cpp/pull/1731)
 * OpenAI `/v1/responses` API endpoint [PR 1184](https://github.com/ikawrakow/ik_llama.cpp/pull/1184)
 * Function call support [PR 628](https://github.com/ikawrakow/ik_llama.cpp/pull/628)
 * jinja template support [PR 677](https://github.com/ikawrakow/ik_llama.cpp/pull/677)
-* Webui: New Features for Conversations, Settings, and Chat Messages [PR 618](https://github.com/ikawrakow/ik_llama.cpp/pull/618)
+* Webui: New Features for Conversations, Settings, and Chat Messages [PR 618](https://github.com/ikawrakow/ik_llama.cpp/pull/618), MCP [PR 1904](https://github.com/ikawrakow/ik_llama.cpp/pull/1904)
 * Dynamic control vector management endpoints [PR 1223](https://github.com/ikawrakow/ik_llama.cpp/pull/1223)
 * Legacy quants conversion schemes in `convert_hf_to_gguf.py` [PR 449](https://github.com/ikawrakow/ik_llama.cpp/pull/449), `Q6_0` in [PR 483](https://github.com/ikawrakow/ik_llama.cpp/pull/483)
 * Adaptive-P Sampler [PR 1100](https://github.com/ikawrakow/ik_llama.cpp/pull/1100) implemented as designed by it's author; supported on Webui
--- a/common/CMakeLists.txt
+++ b/common/CMakeLists.txt
@ -71,6 +71,7 @@ add_library(${TARGET} STATIC
    train.cpp
    log.cpp
    log.h    
+    http.h
    ngram-cache.cpp
    ngram-cache.h
    ngram-map.cpp
--- a/common/chat-auto-parser-generator.cpp
+++ b/common/chat-auto-parser-generator.cpp
@ -93,7 +93,8 @@ common_peg_arena autoparser::build_parser(const generation_params & inputs) cons
    }
    return build_chat_peg_parser([&](common_chat_peg_builder & p) {
        parser_build_context ctx(p, inputs);
-        bool                 extract_reasoning = inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE;
+        bool                 extract_reasoning =
+            inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE && (inputs.enable_thinking || !reasoning.start.empty());

        ctx.extracting_reasoning = extract_reasoning && reasoning.mode != reasoning_mode::NONE;
        ctx.content              = &content;
@ -155,6 +156,16 @@ common_peg_parser analyze_content::build_parser(parser_build_context & ctx) cons
        }
        return p.content(p.until(start)) + start + p.content(p.until(end)) + end + p.end();
    }
+    if (is_end_delimited()) {
+        auto content = p.choice({
+            p.content(p.until(end)) + p.optspace(end),
+            p.content(p.rest()),
+        });
+        if (ctx.extracting_reasoning) {
+            return ctx.reasoning_parser + p.space() + content + p.end();
+        }
+        return content + p.end();
+    }
    return ctx.reasoning_parser + p.content(p.rest()) + p.end();
 }

@ -216,7 +227,6 @@ common_peg_parser analyze_tools::build_tool_parser_json_native(parser_build_cont
        auto wrapped_content = ctx.content->build_optional_wrapped(ctx);
        return ctx.reasoning_parser + wrapped_content + tools_parser + p.end();
    }
-
    std::string tool_start = "{";
    if (!format.section_start.empty()) {
        tool_start = format.section_start;
@ -224,7 +234,12 @@ common_peg_parser analyze_tools::build_tool_parser_json_native(parser_build_cont
        tool_start = format.per_call_start;
    }

-    return ctx.reasoning_parser + p.optional(p.content(p.until(tool_start))) + tools_parser + p.end();
+    if (!ctx.content || !ctx.content->is_end_delimited()) {
+        return ctx.reasoning_parser + p.optional(p.content(p.until(tool_start))) + tools_parser + p.end();
+    }
+
+    auto content_end = p.optional(p.optspace(ctx.content->end));
+    return ctx.reasoning_parser + p.space() + p.optional(p.content(p.until(tool_start))) + tools_parser + content_end + p.end();
 }

 common_peg_parser analyze_tools::build_func_parser(common_chat_peg_builder & p, const std::string & name,
@ -333,7 +348,13 @@ common_peg_parser analyze_tools::build_tool_parser_tag_json(parser_build_context

    std::string trigger_marker       = !format.section_start.empty() ? format.section_start : format.per_call_start;
    auto        content_before_tools = trigger_marker.empty() ? p.eps() : p.until(trigger_marker);
-    return ctx.reasoning_parser + p.optional(p.content(content_before_tools)) + tool_calls + p.end();
+
+    if (!ctx.content || !ctx.content->is_end_delimited()) {
+        return ctx.reasoning_parser + p.optional(p.content(content_before_tools)) + tool_calls + p.end();
+    }
+
+    auto content_end = p.optional(p.optspace(ctx.content->end));
+    return ctx.reasoning_parser + p.space() + p.optional(p.content(content_before_tools)) + tool_calls + content_end + p.end();
 }

 common_peg_parser analyze_tools::build_tool_parser_tag_tagged(parser_build_context & ctx) const {
@ -464,7 +485,13 @@ common_peg_parser analyze_tools::build_tool_parser_tag_tagged(parser_build_conte

    std::string trigger_marker       = !format.section_start.empty() ? format.section_start : format.per_call_start;
    auto        content_before_tools = trigger_marker.empty() ? p.eps() : p.until(trigger_marker);
-    return ctx.reasoning_parser + p.optional(p.content(content_before_tools)) + tool_calls + p.end();
+
+    if (!ctx.content || !ctx.content->is_end_delimited()) {
+        return ctx.reasoning_parser + p.optional(p.content(content_before_tools)) + tool_calls + p.end();
+    }
+
+    auto content_end = p.optional(p.optspace(ctx.content->end));
+    return ctx.reasoning_parser + p.space() + p.optional(p.content(content_before_tools)) + tool_calls + content_end + p.end();
 }

 }  // namespace autoparser
--- a/common/chat-auto-parser.h
+++ b/common/chat-auto-parser.h
@ -101,6 +101,7 @@ enum class content_mode {
    PLAIN,                   // No content markers
    ALWAYS_WRAPPED,          // Content always wrapped with markers
    WRAPPED_WITH_REASONING,  // Content wrapped only when reasoning present
+    END_DELIMITED,           // Content is terminated by a marker but has no start marker
 };

 inline std::ostream & operator<<(std::ostream & os, const content_mode & mode) {
@ -111,6 +112,8 @@ inline std::ostream & operator<<(std::ostream & os, const content_mode & mode) {
            return os << "ALWAYS_WRAPPED";
        case content_mode::WRAPPED_WITH_REASONING:
            return os << "WRAPPED_WITH_REASONING";
+        case content_mode::END_DELIMITED:
+            return os << "END_DELIMITED";
        default:
            return os << "UNKNOWN";
    }
@ -286,6 +289,7 @@ struct analyze_content : analyze_base {
    common_peg_parser build_parser(parser_build_context & ctx) const override;

    bool is_always_wrapped() const;
+    bool is_end_delimited() const;
    common_peg_parser build_optional_wrapped(parser_build_context & ctx) const;
 };

--- a/common/chat-diff-analyzer.cpp
+++ b/common/chat-diff-analyzer.cpp
@ -45,6 +45,28 @@ static std::vector<std::function<void(const common_chat_template & tmpl, autopar
              LOG_DBG(ANSI_ORANGE "[Patch: old Qwen/Deepseek thinking template]\n" ANSI_RESET);
          }
      },
+      // Poolside Laguna templates prefill <think> in the generation prompt, so generated
+      // reasoning starts immediately and is delimited only by </think>.
+      [](const common_chat_template & tmpl, autoparser & analysis) -> void {
+          if (tmpl.src.find("laguna_glm_thinking") != std::string::npos &&
+              tmpl.src.find("{{- \"<assistant>\\n\" -}}") != std::string::npos &&
+              tmpl.src.find("{{- '<think>' -}}") != std::string::npos) {
+              analysis.reasoning.mode  = reasoning_mode::TAG_BASED;
+              analysis.reasoning.start = "";
+              analysis.reasoning.end   = "</think>";
+              analysis.content.mode     = content_mode::END_DELIMITED;
+              analysis.content.end      = "</assistant>";
+              if (std::find(analysis.preserved_tokens.begin(), analysis.preserved_tokens.end(), "</think>") ==
+                  analysis.preserved_tokens.end()) {
+                  analysis.preserved_tokens.push_back("</think>");
+              }
+              if (std::find(analysis.preserved_tokens.begin(), analysis.preserved_tokens.end(), "</assistant>") ==
+                  analysis.preserved_tokens.end()) {
+                  analysis.preserved_tokens.push_back("</assistant>");
+              }
+              LOG_DBG(ANSI_ORANGE "[Patch: Poolside Laguna thinking template]\n" ANSI_RESET);
+          }
+      },
      // Granite 3.3, with separate reasoning and content markers
      [](const common_chat_template & tmpl, autoparser & analysis) -> void {
          if (tmpl.src.find("Write your thoughts between <think></think> and write your response between "
@ -552,6 +574,10 @@ bool analyze_content::is_always_wrapped() const {
    return mode == content_mode::ALWAYS_WRAPPED && !start.empty() && !end.empty();
 }

+bool analyze_content::is_end_delimited() const {
+    return mode == content_mode::END_DELIMITED && !end.empty();
+}
+
 analyze_tools::analyze_tools(const common_chat_template & tmpl,
                             const jinja::caps &          caps,
                             const analyze_reasoning &    reasoning)
--- a/common/chat-peg-parser.cpp
+++ b/common/chat-peg-parser.cpp
@ -785,7 +785,22 @@ common_peg_parser common_chat_peg_builder::prefix(const std::string & s, const s
    if (delimiter.empty()) {
        return literal(s);
    }
-    return literal(s.substr(0, s.rfind(delimiter)));
+    auto pos = s.rfind(delimiter);
+    if (pos == std::string::npos) {
+        // The generation prompt may force-open the reasoning block without the
+        // whitespace that surrounds the detected tag (e.g. a prompt ending in
+        // '<think>' while history renders '<think>\n'). Only strip when the
+        // prompt ends exactly with the trimmed tag, so prompts with trailing
+        // whitespace after the tag (e.g. '<think>\n') keep their behavior.
+        if (auto b = delimiter.find_first_not_of(" \t\n\r"); b != std::string::npos) {
+            auto e = delimiter.find_last_not_of (" \t\n\r");
+            auto trimmed = delimiter.substr(b, e - b + 1);
+            if (s.size() >= trimmed.size() && s.compare(s.size() - trimmed.size(), trimmed.size(), trimmed) == 0) {
+                pos = s.size() - trimmed.size();
+            }
+        }
+    }
+    return literal(s.substr(0, pos));
 }

 common_peg_parser common_chat_peg_builder::optspace(const std::string & tag) {
--- a/common/common.cpp
+++ b/common/common.cpp
@ -124,7 +124,16 @@ static int32_t common_speculative_stage_effective_n_min(

 std::vector<common_speculative_stage_params> common_params_speculative::get_resolved_stages() const {
    if (!stages.empty()) {
-        return stages;
+        std::vector<common_speculative_stage_params> resolved;
+        resolved.reserve(stages.size());
+
+        for (const auto & stage : stages) {
+            if (stage.type != COMMON_SPECULATIVE_TYPE_NONE) {
+                resolved.push_back(stage);
+            }
+        }
+
+        return resolved;
    }

    if (type == COMMON_SPECULATIVE_TYPE_NONE) {
@ -167,6 +176,9 @@ common_params_speculative common_params_speculative::with_stage_overrides(const
    if (stage.has_suffix_max_depth_override()) {
        result.suffix_max_depth = stage.suffix_max_depth;
    }
+    if (stage.has_suffix_corpus_override()) {
+        result.suffix_corpus = stage.suffix_corpus;
+    }

    result.n_max = std::max(result.n_max, 0);
    result.n_min = std::max(0, std::min(result.n_min, result.n_max));
@ -186,10 +198,39 @@ bool common_params_speculative::has_stage_type(common_speculative_type stage_typ
    });
 }

+void common_params_speculative::remove_stage_type(common_speculative_type stage_type) {
+    stages.erase(std::remove_if(stages.begin(), stages.end(), [stage_type](const common_speculative_stage_params & stage) {
+        return stage.type == stage_type;
+    }), stages.end());
+
+    if (type == stage_type) {
+        const auto resolved = get_resolved_stages();
+        type = resolved.empty() ? COMMON_SPECULATIVE_TYPE_NONE : resolved.front().type;
+    }
+}
+
 bool common_params_speculative::has_composite_stage_chain() const {
    return get_resolved_stages().size() > 1;
 }

+bool common_params_speculative::needs_dft_model() const {
+    return has_stage_type(COMMON_SPECULATIVE_TYPE_DRAFT) ||
+        has_stage_type(COMMON_SPECULATIVE_TYPE_DFLASH) ||
+        (has_stage_type(COMMON_SPECULATIVE_TYPE_MTP) && has_dft());
+}
+
+void common_params_speculative::clear_dft() {
+    if (model_dft != nullptr) {
+        llama_free_model(model_dft);
+        model_dft = nullptr;
+    }
+
+    model.clear();
+    params.clear();
+    mparams_dft.path.clear();
+    cparams_dft = llama_context_default_params();
+}
+
 int32_t common_params_speculative::get_max_stage_n_max() const {
    const auto resolved = get_resolved_stages();
    if (resolved.empty()) {
@ -619,28 +660,20 @@ static void common_speculative_finalize_stages(gpt_params & params) {
    auto & spec = params.speculative;

    if (!spec.stages.empty()) {
-        spec.type = spec.stages.front().type;
+        const auto resolved = spec.get_resolved_stages();
+        if (resolved.size() != spec.stages.size()) {
+            spec.stages = resolved;
+        }
+
+        spec.type = resolved.empty() ? COMMON_SPECULATIVE_TYPE_NONE : resolved.front().type;
        params.has_mtp = spec.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP);
        return;
    }

-    const bool wants_mtp = params.has_mtp;
-    const bool wants_draft = spec.has_dft();
-
    if (spec.type != COMMON_SPECULATIVE_TYPE_NONE) {
        spec.stages.push_back({ .type = spec.type });
-
-        if (common_speculative_type_is_self_spec(spec.type)) {
-            if (wants_mtp) {
-                spec.stages.push_back({ .type = COMMON_SPECULATIVE_TYPE_MTP });
-            } else if (wants_draft) {
-                spec.stages.push_back({ .type = COMMON_SPECULATIVE_TYPE_DRAFT });
-            }
-        }
-    } else if (wants_mtp) {
+    } else if (params.has_mtp) {
        spec.stages.push_back({ .type = COMMON_SPECULATIVE_TYPE_MTP });
-    } else if (wants_draft) {
-        spec.stages.push_back({ .type = COMMON_SPECULATIVE_TYPE_DRAFT });
    }

    spec.type = spec.stages.empty() ? COMMON_SPECULATIVE_TYPE_NONE : spec.stages.front().type;
@ -834,13 +867,16 @@ static std::string common_normalize_spec_stage_key(std::string key) {

    std::replace(key.begin(), key.end(), '-', '_');

-    if (key.rfind("spec_", 0) == 0) {
-        key.erase(0, 5);
-    }
-
    return key;
 }

+static std::invalid_argument common_speculative_legacy_option_error(
+        const std::string & arg,
+        const std::string & replacement) {
+    return std::invalid_argument(
+        "legacy speculative option '" + arg + "' is disabled; use " + replacement);
+}
+
 static void common_speculative_remove_explicit_stage(common_params_speculative & params, common_speculative_type type) {
    params.stages.erase(std::remove_if(params.stages.begin(), params.stages.end(), [type](const common_speculative_stage_params & stage) {
        return stage.type == type;
@ -857,21 +893,21 @@ static void common_speculative_stage_apply_kv(
        const std::string & value_raw) {
    const std::string key = common_normalize_spec_stage_key(key_raw);

-    if (key == "draft" || key == "draft_max" || key == "draft_n" || key == "n_max") {
+    if (key == "n_max") {
        stage.n_max = std::stoi(value_raw);
        if (stage.n_max < 0) {
            throw std::invalid_argument("speculative stage n_max must be >= 0");
        }
        return;
    }
-    if (key == "draft_min" || key == "draft_n_min" || key == "n_min") {
+    if (key == "n_min") {
        stage.n_min = std::stoi(value_raw);
        if (stage.n_min < 0) {
            throw std::invalid_argument("speculative stage n_min must be >= 0");
        }
        return;
    }
-    if (key == "draft_p_min" || key == "p_min") {
+    if (key == "p_min") {
        stage.p_min = std::stof(value_raw);
        if (stage.p_min < 0.0f) {
            throw std::invalid_argument("speculative stage p_min must be >= 0");
@ -906,7 +942,7 @@ static void common_speculative_stage_apply_kv(
        }
        return;
    }
-    if (key == "suffix_min_match_len" || key == "suffix_pattern_len") {
+    if (key == "suffix_min_match_len") {
        stage.suffix_min_match_len = std::stoi(value_raw);
        if (stage.suffix_min_match_len < 1) {
            throw std::invalid_argument("speculative stage suffix_min_match_len must be at least 1");
@ -920,10 +956,100 @@ static void common_speculative_stage_apply_kv(
        }
        return;
    }
+    if (key == "suffix_corpus") {
+        stage.suffix_corpus = value_raw;
+        if (stage.suffix_corpus.empty()) {
+            throw std::invalid_argument("speculative stage suffix_corpus must not be empty");
+        }
+        return;
+    }

    throw std::invalid_argument("unknown speculative stage parameter: " + key_raw);
 }

+static std::vector<std::string> common_speculative_stage_split_kvs(const std::string & values) {
+    std::vector<std::string> result;
+    std::string current;
+    char quote = '\0';
+    bool escaped = false;
+
+    for (char ch : values) {
+        if (escaped) {
+            current += ch;
+            escaped = false;
+            continue;
+        }
+
+        if (ch == '\\') {
+            current += ch;
+            escaped = true;
+            continue;
+        }
+
+        if (quote != '\0') {
+            if (ch == quote) {
+                quote = '\0';
+            }
+            current += ch;
+            continue;
+        }
+
+        if ((ch == '\'' || ch == '"') && !current.empty() && current.back() == '=') {
+            quote = ch;
+            current += ch;
+            continue;
+        }
+
+        if (ch == ',') {
+            result.push_back(current);
+            current.clear();
+            continue;
+        }
+
+        current += ch;
+    }
+
+    if (quote != '\0') {
+        throw std::invalid_argument("invalid speculative stage option list: unterminated quote");
+    }
+
+    result.push_back(current);
+    return result;
+}
+
+static std::string common_speculative_stage_unescape_value(const std::string & value_raw) {
+    std::string value = value_raw;
+    if (value.size() >= 2) {
+        const char first = value.front();
+        const char last = value.back();
+        if ((first == '\'' && last == '\'') || (first == '"' && last == '"')) {
+            value = value.substr(1, value.size() - 2);
+        }
+    }
+
+    std::string result;
+    result.reserve(value.size());
+
+    for (size_t i = 0; i < value.size(); ++i) {
+        const char ch = value[i];
+        if (ch != '\\' || i + 1 >= value.size()) {
+            result += ch;
+            continue;
+        }
+
+        const char next = value[i + 1];
+        if (next == '\\' || next == ',' || next == '\'' || next == '"') {
+            result += next;
+            ++i;
+            continue;
+        }
+
+        result += ch;
+    }
+
+    return result;
+}
+
 static common_speculative_stage_params common_speculative_stage_from_arg(const std::string & value) {
    const auto spec_pos = value.find(':');
    const std::string type_name = value.substr(0, spec_pos);
@ -938,15 +1064,13 @@ static common_speculative_stage_params common_speculative_stage_from_arg(const s
        return stage;
    }

-    std::stringstream ss(value.substr(spec_pos + 1));
-    std::string kv;
-    while (std::getline(ss, kv, ',')) {
+    for (const std::string & kv : common_speculative_stage_split_kvs(value.substr(spec_pos + 1))) {
        const auto eq_pos = kv.find('=');
        if (eq_pos == std::string::npos) {
            throw std::invalid_argument("invalid speculative stage option: " + kv);
        }

-        common_speculative_stage_apply_kv(stage, kv.substr(0, eq_pos), kv.substr(eq_pos + 1));
+        common_speculative_stage_apply_kv(stage, kv.substr(0, eq_pos), common_speculative_stage_unescape_value(kv.substr(eq_pos + 1)));
    }

    return stage;
@ -1393,18 +1517,18 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
    }
    if (arg == "--draft" || arg == "--draft-max" || arg == "--draft-n") {
        CHECK_ARG
-        params.speculative.n_max = std::stoi(argv[i]);
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "the value inside the relevant repeated --spec-type entry, e.g. --spec-type mtp:n_max=" + std::string(argv[i]) + ",p_min=0.0 or --spec-type draft:n_max=" + std::string(argv[i]) + ",p_min=0.0");
    }
    if (arg == "--draft-min" || arg == "--draft-n-min") {
        CHECK_ARG
-        params.speculative.n_min = std::stoi(argv[i]);
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "the value inside the relevant repeated --spec-type entry using the canonical key n_min, e.g. --spec-type ngram-mod:n_min=" + std::string(argv[i]));
    }
    if (arg == "--draft-p-min") {
        CHECK_ARG
-        params.speculative.p_min = std::stof(argv[i]);
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "the value inside the relevant repeated --spec-type entry using the canonical key p_min, e.g. --spec-type mtp:p_min=" + std::string(argv[i]));
    }
    if (arg == "--recurrent-ckpt-mode") {
        CHECK_ARG
@ -1459,91 +1583,46 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
    }
    if (arg == "--spec-stage") {
        CHECK_ARG
-
-        if (params.speculative.stages.empty()) {
-            if (params.speculative.type != COMMON_SPECULATIVE_TYPE_NONE) {
-                throw std::invalid_argument("--spec-stage cannot be combined with --spec-type; use only --spec-stage for explicit stage chains");
-            }
-            if (params.has_mtp) {
-                throw std::invalid_argument("--spec-stage cannot be combined with -mtp/--multi-token-prediction; add the mtp fallback explicitly with --spec-stage mtp[:k=v,...]");
-            }
-        }
-
-        params.speculative.stages.push_back(common_speculative_stage_from_arg(argv[i]));
-        if (params.speculative.stages.size() == 1) {
-            params.speculative.type = params.speculative.stages.front().type;
-        }
-        params.has_mtp = params.speculative.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP);
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "repeated --spec-type SPEC[:k=v,...] entries, e.g. --spec-type ngram-mod:n_max=64,n_min=2,ngram_size_n=8 --spec-type mtp:n_max=1,p_min=0.0");
    }
    if (arg == "--spec-type") {
        CHECK_ARG
-        if (!params.speculative.stages.empty()) {
-            throw std::invalid_argument("--spec-type cannot be combined with --spec-stage; use only --spec-stage for explicit stage chains");
-        }
-
-        const auto stage = common_speculative_stage_from_arg(argv[i]);
-        const auto type = stage.type;
-        if (type == COMMON_SPECULATIVE_TYPE_NONE || type == COMMON_SPECULATIVE_TYPE_DFLASH || type == COMMON_SPECULATIVE_TYPE_MTP || common_speculative_type_is_self_spec(type)) {
-            params.speculative = params.speculative.with_stage_overrides(stage);
-            params.speculative.type = type;
-            if (type == COMMON_SPECULATIVE_TYPE_MTP) {
-                params.has_mtp = true;
-            }
-        } else {
-            throw std::invalid_argument("unknown speculative decoding type");
-        }
+        params.speculative.stages.push_back(common_speculative_stage_from_arg(argv[i]));
+        const auto resolved = params.speculative.get_resolved_stages();
+        params.speculative.type = resolved.empty() ? COMMON_SPECULATIVE_TYPE_NONE : resolved.front().type;
+        params.has_mtp = params.speculative.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP);
        return true;
    }
    if (arg == "--spec-ngram-size-n") {
        CHECK_ARG
-        int value = std::stoi(argv[i]);
-        if (value < 1 || value > 1024) {
-            throw std::invalid_argument("ngram size N must be between 1 and 1024 inclusive");
-        }
-        params.speculative.ngram_size_n = value;
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "the canonical stage key inside --spec-type, e.g. --spec-type ngram-mod:ngram_size_n=" + std::string(argv[i]));
    }
    if (arg == "--spec-ngram-size-m") {
        CHECK_ARG
-        int value = std::stoi(argv[i]);
-        if (value < 1 || value > 1024) {
-            throw std::invalid_argument("ngram size M must be between 1 and 1024 inclusive");
-        }
-        params.speculative.ngram_size_m = value;
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "the canonical stage key inside --spec-type, e.g. --spec-type ngram-map-k4v:ngram_size_m=" + std::string(argv[i]));
    }
    if (arg == "--spec-ngram-min-hits") {
        CHECK_ARG
-        int value = std::stoi(argv[i]);
-        if (value < 1) {
-            throw std::invalid_argument("ngram min hits must be at least 1");
-        }
-        params.speculative.ngram_min_hits = value;
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "the canonical stage key inside --spec-type, e.g. --spec-type ngram-map-k4v:ngram_min_hits=" + std::string(argv[i]));
    }
    if (arg == "--suffix-pattern-len") {
        CHECK_ARG
-        int value = std::stoi(argv[i]);
-        if (value < 1) {
-            throw std::invalid_argument("suffix pattern length must be at least 1");
-        }
-        params.speculative.suffix_min_match_len = value;
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "the canonical stage key inside --spec-type, e.g. --spec-type suffix:suffix_min_match_len=" + std::string(argv[i]));
    }
    if (arg == "--suffix-max-depth") {
        CHECK_ARG
-        int value = std::stoi(argv[i]);
-        if (value < 1) {
-            throw std::invalid_argument("suffix max depth must be at least 1");
-        }
-        params.speculative.suffix_max_depth = value;
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "the canonical stage key inside --spec-type, e.g. --spec-type suffix:suffix_max_depth=" + std::string(argv[i]));
    }
    if (arg == "--suffix-corpus") {
        CHECK_ARG
-        params.speculative.suffix_corpus = argv[i];
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "the canonical stage key inside --spec-type, e.g. --spec-type suffix:suffix_corpus=" + std::string(argv[i]));
    }
    if (arg == "-a" || arg == "--alias") {
        CHECK_ARG
@ -1804,6 +1883,10 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
    if (arg == "-amb" || arg == "--attention-max-batch") {
        CHECK_ARG
        params.attn_max_batch = std::stoi(argv[i]);
+        if (params.attn_max_batch > 0 && params.attn_max_batch < 128) {
+            LLAMA_LOG_WARN("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX amb = %d is too low. Changing to 128\n", params.attn_max_batch);
+            params.attn_max_batch = 128;
+        }
        return true;
    }
    if (arg == "-no-fmoe" || arg == "--no-fused-moe") {
@ -1988,17 +2071,12 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
        return true;
    }
    if (arg == "-mtp" || arg == "--multi-token-prediction") {
-        if (!params.speculative.stages.empty()) {
-            throw std::invalid_argument("-mtp/--multi-token-prediction cannot be combined with --spec-stage; add the mtp fallback explicitly with --spec-stage mtp[:k=v,...]");
-        }
-
-        params.has_mtp = true;
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "--spec-type mtp:n_max=1,p_min=0.0");
    }
    if (arg == "-no-mtp" || arg == "--no-multi-token-prediction") {
-        params.has_mtp = false;
-        common_speculative_remove_explicit_stage(params.speculative, COMMON_SPECULATIVE_TYPE_MTP);
-        return true;
+        throw common_speculative_legacy_option_error(arg,
+            "remove the mtp entry from repeated --spec-type arguments");
    }
    if (arg == "-draft" || arg == "--draft-params") {
        CHECK_ARG
@ -2409,6 +2487,10 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
        params.webui = common_webui_from_name(std::string(argv[i]));
        return true;
    }
+    if (arg == "--webui-mcp-proxy" || arg == "--ui-mcp-proxy") {
+        params.webui_mcp_proxy = true;
+        return true;
+    }
    if (arg == "--api-key") {
        CHECK_ARG
        params.api_keys.push_back(argv[i]);
@ -3180,29 +3262,21 @@ void gpt_params_print_usage(int /*argc*/, char ** argv, const gpt_params & param
    options.push_back({ "*",           "-hfr,  --hf-repo REPO",         "Hugging Face model repository (default: unused)" });
    options.push_back({ "*",           "-hff,  --hf-file FILE",         "Hugging Face model file (default: unused)" });
    options.push_back({ "*",           "-hft,  --hf-token TOKEN",       "Hugging Face access token (default: value from HF_TOKEN environment variable)" });
-    options.push_back({ "*", "-mtp, --multi-token-prediction",          "legacy shortcut for enabling MTP when --spec-stage is not used (default: %s)", params.has_mtp ? "true" : "false" });
-    options.push_back({ "*", "-no-mtp, --no-multi-token-prediction",    "disable the legacy MTP shortcut or remove an explicit MTP stage (default: %s)", !params.has_mtp ? "true" : "false" });
-    options.push_back({ "*", "--draft-max, --draft, --draft-n N",
-                                                                        "global default number of tokens to draft for speculative decoding or for stages without an explicit n_max override (default: %d)", params.speculative.n_max });
-    options.push_back({ "*", "--draft-min, --draft-n-min N",   "global default minimum draft threshold or fallback threshold for stages without an explicit n_min override" });
-    options.push_back({ "*", "--draft-p-min P",                "global default minimum speculative decoding probability (greedy) for stages without an explicit p_min override (default: %.1f)", (double)params.speculative.p_min });
    options.push_back({ "*", "--recurrent-ckpt-mode MODE",    "checkpoint strategy for recurrent/hybrid speculative decoding\n"
                                                              "  auto         auto-select: per-step if CUDA full-GPU, gpu-fallback otherwise (default)\n"
                                                              "  per-step     save SSM state per draft step in VRAM; no re-decode on rejection\n"
                                                              "  gpu-fallback copy state to GPU buffer; re-decode on rejection\n"
                                                              "  cpu          serialise state via llama_state_seq; re-decode on rejection" });
-    options.push_back({ "*", "--spec-stage SPEC[:k=v,...]",    "explicit speculative stage. repeat once for a supported two-stage chain.\n"
-                                                              "examples: --spec-stage ngram-mod:n_max=64,n_min=2 --spec-stage mtp:n_max=1\n"
-                                                              "supported two-stage shape in this PR: self-spec first, then mtp or draft fallback" });
-    options.push_back({ "*", "--spec-type Name[:k=v,...] [none | dflash | mtp | ngram-cache | ngram-simple | ngram-map-k | ngram-map-k4v | ngram-mod | suffix]", "single-stage speculative selection when --spec-stage is not used (default: %d)\n", (int)params.speculative.type});
-    options.push_back({ "*", "--spec-ngram-size-n N", "ngram size N for ngram-simple/ngram-map speculative decoding, length of lookup n-gram (default: %d)\n",params.speculative.ngram_size_n });
-
-    options.push_back({ "*", "--spec-ngram-size-m N", "ngram size M for ngram-simple/ngram-map speculative decoding, length of draft m-gram (default: %d)\n", params.speculative.ngram_size_m });
-
-    options.push_back({ "*", "--spec-ngram-min-hits N", "minimum hits for ngram-map speculative decoding (default: %d)\n", params.speculative.ngram_min_hits });
-    options.push_back({ "*", "--suffix-pattern-len N",   "minimum context match length for suffix decoding (default: %d)", params.speculative.suffix_min_match_len });
-    options.push_back({ "*", "--suffix-max-depth N",     "suffix tree maximum depth for suffix decoding (default: %d)",    params.speculative.suffix_max_depth });
-    options.push_back({ "*", "--suffix-corpus PATH",     "corpus file to pre-warm the suffix tree: .json (array of strings or conversation messages) or .bin (raw int32 token IDs)" });
+    options.push_back({ "*", "--spec-type SPEC[:k=v,...]",      "canonical speculative stage entry; repeat for a supported two-stage chain.\n"
+                                                              "types: none, draft, dflash, mtp, ngram-cache, ngram-simple, ngram-map-k, ngram-map-k4v, ngram-mod, suffix\n"
+                                                              "canonical keys: n_max,n_min,p_min,cross_ctx,ngram_size_n,ngram_size_m,ngram_min_hits,suffix_min_match_len,suffix_max_depth,suffix_corpus\n"
+                                                              "for comma-bearing string values, quote the value inside the stage payload for normal shell use\n"
+                                                              "if argv is passed directly without shell unescaping, the parser also accepts escaped commas as \\,\n"
+                                                              "examples: --spec-type mtp:n_max=1,p_min=0.0\n"
+                                                              "          --model-draft draft.gguf --spec-type dflash:n_max=4,cross_ctx=512\n"
+                                                              "          --spec-type ngram-mod:n_max=64,n_min=2,ngram_size_n=8 --spec-type mtp:n_max=1,p_min=0.0\n"
+                                                              "          --spec-type \"suffix:n_max=16,n_min=2,suffix_min_match_len=5,suffix_max_depth=64,suffix_corpus='/tmp/spec,type-corpus.json'\"\n"
+                                                              "legacy --spec-stage, --draft-*, --spec-ngram-*, --suffix-* and -mtp flags are rejected" });
    options.push_back({ "*", "--spec-autotune",          "automatically tune speculative params to maximize tokens/sec" });

    options.push_back({ "retrieval" });
@ -3246,6 +3320,7 @@ void gpt_params_print_usage(int /*argc*/, char ** argv, const gpt_params & param
                                                            "- auto: default webui \n"
                                                            "- llamacpp: llamacpp webui \n"
                                                            "(default: auto)", });
+    options.push_back({ "server",      "       --ui-mcp-proxy, --webui-mcp-proxy",          "experimental: whether to enable MCP CORS proxy - do not enable in untrusted environments (default: disabled)" });
    options.push_back({ "server",      "       --api-key KEY",          "API key to use for authentication (default: none)" });
    options.push_back({ "server",      "       --api-key-file FNAME",   "path to file containing API keys (default: none)" });
    options.push_back({ "server",      "       --ssl-key-file FNAME",   "path to file a PEM-encoded SSL private key" });
@ -4024,14 +4099,7 @@ static std::pair<int, int> get_batch_ubatch(const gpt_params & params) {
    if (params.n_ctx > 0) {
        n_batch = std::min(n_batch, params.n_ctx);
    }
-    if (!params.mmproj.path.empty() && params.mmproj_use_gpu) {
-        // temporary fix for qwen mtmd (only when mmproj is on GPU)
-        n_batch = std::max(n_batch, n_ubatch);
-        n_ubatch = n_batch;
-        fprintf(stdout, "Adjust batch size for mtmd: u_batch = %d, batch = %d\n", n_ubatch, n_batch);
-    } else {
-        n_ubatch = std::min(n_batch, n_ubatch);
-    }
+    n_ubatch = std::min(n_batch, n_ubatch);
    return {n_batch, n_ubatch};
 }

@ -5121,7 +5189,7 @@ void yaml_dump_non_result_info(FILE * stream, const gpt_params & params, const l

    yaml_dump_string_multiline(stream, "in_prefix", params.input_prefix.c_str());
    fprintf(stream, "in_prefix_bos: %s # default: false\n", params.input_prefix_bos ? "true" : "false");
-    yaml_dump_string_multiline(stream, "in_suffix", params.input_prefix.c_str());
+    yaml_dump_string_multiline(stream, "in_suffix", params.input_suffix.c_str());
    fprintf(stream, "interactive: %s # default: false\n", params.interactive ? "true" : "false");
    fprintf(stream, "interactive_first: %s # default: false\n", params.interactive_first ? "true" : "false");
    fprintf(stream, "keep: %d # default: 0\n", params.n_keep);
--- a/common/common.h
+++ b/common/common.h
@ -171,6 +171,7 @@ struct common_speculative_stage_params {

    int32_t suffix_min_match_len = -1;
    int32_t suffix_max_depth = -1;
+    std::string suffix_corpus;

    bool has_n_max_override() const { return n_max >= 0; }
    bool has_n_min_override() const { return n_min >= 0; }
@ -181,6 +182,7 @@ struct common_speculative_stage_params {
    bool has_ngram_min_hits_override() const { return ngram_min_hits > 0; }
    bool has_suffix_min_match_len_override() const { return suffix_min_match_len >= 0; }
    bool has_suffix_max_depth_override() const { return suffix_max_depth >= 0; }
+    bool has_suffix_corpus_override() const { return !suffix_corpus.empty(); }
 };

 struct common_params_model {
@ -254,7 +256,10 @@ struct common_params_speculative {
    common_params_speculative with_stage_overrides(const common_speculative_stage_params & stage) const;
    bool has_stage_chain() const;
    bool has_stage_type(common_speculative_type stage_type) const;
+    void remove_stage_type(common_speculative_type stage_type);
    bool has_composite_stage_chain() const;
+    bool needs_dft_model() const;
+    void clear_dft();
    int32_t get_max_stage_n_max() const;
    int32_t get_min_usable_stage_n_min() const;

@ -505,6 +510,7 @@ struct gpt_params {

    // "advanced" endpoints are disabled by default for better security
    common_webui webui = COMMON_WEBUI_AUTO;
+    bool webui_mcp_proxy  = false;
    bool endpoint_slots   = true;
    bool endpoint_props   = false; // only control POST requests, not GET
    bool endpoint_metrics = false;
--- a/common/http.h
+++ b/common/http.h
@ -0,0 +1,99 @@
+#pragma once
+
+#include <cpp-httplib/httplib.h>
+
+struct common_http_url {
+    std::string scheme;
+    std::string user;
+    std::string password;
+    std::string host;
+    int port;
+    std::string path;
+};
+
+static common_http_url common_http_parse_url(const std::string & url) {
+    common_http_url parts;
+    auto scheme_end = url.find("://");
+
+    if (scheme_end == std::string::npos) {
+        throw std::runtime_error("invalid URL: no scheme");
+    }
+    parts.scheme = url.substr(0, scheme_end);
+
+    if (parts.scheme != "http" && parts.scheme != "https") {
+        throw std::runtime_error("unsupported URL scheme: " + parts.scheme);
+    }
+
+    auto rest = url.substr(scheme_end + 3);
+    auto at_pos = rest.find('@');
+
+    if (at_pos != std::string::npos) {
+        auto auth = rest.substr(0, at_pos);
+        auto colon_pos = auth.find(':');
+        if (colon_pos != std::string::npos) {
+            parts.user = auth.substr(0, colon_pos);
+            parts.password = auth.substr(colon_pos + 1);
+        } else {
+            parts.user = auth;
+        }
+        rest = rest.substr(at_pos + 1);
+    }
+
+    auto slash_pos = rest.find('/');
+
+    if (slash_pos != std::string::npos) {
+        parts.host = rest.substr(0, slash_pos);
+        parts.path = rest.substr(slash_pos);
+    } else {
+        parts.host = rest;
+        parts.path = "/";
+    }
+
+    auto colon_pos = parts.host.find(':');
+
+    if (colon_pos != std::string::npos) {
+        parts.port = std::stoi(parts.host.substr(colon_pos + 1));
+        parts.host = parts.host.substr(0, colon_pos);
+    } else if (parts.scheme == "http") {
+        parts.port = 80;
+    } else if (parts.scheme == "https") {
+        parts.port = 443;
+    } else {
+        throw std::runtime_error("unsupported URL scheme: " + parts.scheme);
+    }
+
+    return parts;
+}
+
+static std::pair<httplib::Client, common_http_url> common_http_client(const std::string & url) {
+    common_http_url parts = common_http_parse_url(url);
+
+    if (parts.host.empty()) {
+        throw std::runtime_error("error: invalid URL format");
+    }
+
+#ifndef CPPHTTPLIB_OPENSSL_SUPPORT
+    if (parts.scheme == "https") {
+        throw std::runtime_error(
+            "HTTPS is not supported. Please rebuild with one of:\n"
+            "  -DLLAMA_BUILD_BORINGSSL=ON\n"
+            "  -DLLAMA_BUILD_LIBRESSL=ON\n"
+            "  -DLLAMA_OPENSSL=ON (default, requires OpenSSL dev files installed)"
+        );
+    }
+#endif
+
+    httplib::Client cli(parts.scheme + "://" + parts.host + ":" + std::to_string(parts.port));
+
+    if (!parts.user.empty()) {
+        cli.set_basic_auth(parts.user, parts.password);
+    }
+
+    cli.set_follow_location(true);
+
+    return { std::move(cli), std::move(parts) };
+}
+
+static std::string common_http_show_masked_url(const common_http_url & parts) {
+    return parts.scheme + "://" + (parts.user.empty() ? "" : "****:****@") + parts.host + parts.path;
+}
--- a/common/sampling.cpp
+++ b/common/sampling.cpp
@ -24,7 +24,7 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, co
    result->grammar = nullptr;
    result->rbudget = nullptr;

-    struct llama_grammar* grmr;
+    struct llama_grammar* grmr = nullptr;
    const std::string & grammar_str = common_grammar_value(params.grammar);
    if (grammar_str.compare(0, 11, "%llguidance") == 0) {
 #ifdef LLAMA_USE_LLGUIDANCE
--- a/common/spec-tuner.cpp
+++ b/common/spec-tuner.cpp
@ -357,20 +357,15 @@ void spec_tuner::print_best() const {

    {
        std::ostringstream oss;
-        oss << "Autotune reuse: ";
+        oss << "Autotune reuse: --spec-type " << common_speculative_type_to_str(spec_type);
+        bool first_kv = true;
        for (const auto & coord : coords) {
            bool is_int = (coord.name != "p_min");
-            if      (coord.name == "n_max")             oss << "--draft-max ";
-            else if (coord.name == "p_min")             oss << "--draft-p-min ";
-            else if (coord.name == "n_min")             oss << "--draft-min ";
-            else if (coord.name == "ngram_size_n")      oss << "--spec-ngram-size-n ";
-            else if (coord.name == "ngram_size_m")      oss << "--spec-ngram-size-m ";
-            else if (coord.name == "ngram_min_hits")    oss << "--spec-ngram-min-hits ";
-            else if (coord.name == "suffix_min_match_len") oss << "--suffix-pattern-len ";
-            else                                        oss << "--" << coord.name << " ";
+            oss << (first_kv ? ':' : ',') << coord.name << '=';
+            first_kv = false;

-            if (is_int) oss << (int)coord.arms[coord.best_idx].value << " ";
-            else oss << std::fixed << std::setprecision(2) << coord.arms[coord.best_idx].value << " ";
+            if (is_int) oss << (int)coord.arms[coord.best_idx].value;
+            else oss << std::fixed << std::setprecision(2) << coord.arms[coord.best_idx].value;
        }
        LOG_INF("%s\n", oss.str().c_str());
    }
--- a/common/speculative-impl.h
+++ b/common/speculative-impl.h
@ -1128,6 +1128,7 @@ struct common_speculative_state_suffix : public common_speculative_state {
 struct common_speculative {
    std::vector<common_speculative_config> configs; // resolved stage config for each implementation
    std::vector<std::unique_ptr<common_speculative_state>> impls; // list of implementations to use and their states
+    common_speculative_checkpoint checkpoint;
    common_speculative_state * curr_impl = nullptr; // current implementation in use (for stats)
    std::unique_ptr<spec_tuner> tuner;
    int last_n_drafted = 0;
@ -1522,6 +1523,7 @@ void common_speculative_free(common_speculative * spec) {
        return;
    }

+    spec->checkpoint.clear();
    delete spec;
 }

--- a/common/speculative.cpp
+++ b/common/speculative.cpp
@ -53,6 +53,18 @@ const std::map<std::string, enum common_speculative_type> common_speculative_typ
    {"suffix",        COMMON_SPECULATIVE_TYPE_SUFFIX}
 };

+void common_speculative_checkpoint::clear() {
+    valid = false;
+    per_step_enabled = false;
+    n_past = 0;
+    sampled = LLAMA_TOKEN_NULL;
+
+    if (sampler != nullptr) {
+        common_sampler_free(sampler);
+        sampler = nullptr;
+    }
+}
+
 struct common_speculative_config {
    common_speculative_stage_params stage;
    common_speculative_type type;
@ -70,10 +82,10 @@ static bool common_speculative_are_compatible(
    const llama_vocab * vocab_tgt = llama_model_get_vocab(model_tgt);
    const llama_vocab * vocab_dft = llama_model_get_vocab(model_dft);

-    const bool vocab_type_tgt = llama_vocab_type(vocab_tgt);
+    const auto vocab_type_tgt = llama_vocab_type(vocab_tgt);
    LOG_DBG("%s: vocab_type tgt: %d\n", __func__, vocab_type_tgt);

-    const bool vocab_type_dft = llama_vocab_type(vocab_dft);
+    const auto vocab_type_dft = llama_vocab_type(vocab_dft);
    LOG_DBG("%s: vocab_type dft: %d\n", __func__, vocab_type_dft);

    if (vocab_type_tgt != vocab_type_dft) {
@ -261,6 +273,17 @@ static void dflash_append_target_features(
    int32_t n_rows);
 static void dflash_clear_target_features(common_speculative_state_dflash & state);
 static void mtp_invalidate_cached_drafts(common_speculative_state_mtp & state);
+static bool common_speculative_checkpoint_save(
+    common_speculative_checkpoint & ckpt,
+    llama_model * model,
+    llama_context * ctx,
+    common_sampler * sampler_src,
+    const common_params_sampling & sparams,
+    llama_seq_id seq_id,
+    llama_pos n_past,
+    llama_token sampled,
+    int max_tokens,
+    int ckpt_mode);

 static std::vector<llama_token> mtp_speculative_gen_draft(
    common_speculative_state_mtp & state,
@ -583,6 +606,251 @@ bool common_speculative_ensure_sequence_hidden(
    return common_speculative_capture_output_hidden(spec, ctx, -1, seq_id, pos);
 }

+common_speculative_draft_result common_speculative_draft_ex(
+        common_speculative * spec,
+        llama_context * ctx,
+        common_params_speculative & params,
+        const llama_tokens & prompt_tgt,
+        llama_token id_last,
+        llama_pos draft_base_pos,
+        llama_seq_id draft_seq_id) {
+    common_speculative_draft_result result = {};
+
+    if (common_speculative_has_type(spec, COMMON_SPECULATIVE_TYPE_MTP)) {
+        if (!common_speculative_ensure_sequence_hidden(spec, ctx, draft_seq_id, draft_base_pos - 1)) {
+            LOG_ERR("%s: seq_id=%d MTP hidden state is empty during speculation\n",
+                    __func__, (int) draft_seq_id);
+            return result;
+        }
+    }
+
+    result.tokens = common_speculative_draft(
+        spec,
+        params,
+        prompt_tgt,
+        id_last,
+        draft_base_pos,
+        draft_seq_id);
+    result.type = spec != nullptr && spec->curr_impl != nullptr
+        ? spec->curr_impl->type
+        : COMMON_SPECULATIVE_TYPE_NONE;
+
+    return result;
+}
+
+static bool common_speculative_has_target_features(const common_speculative * spec) {
+    return common_speculative_has_type(spec, COMMON_SPECULATIVE_TYPE_MTP) ||
+        common_speculative_has_type(spec, COMMON_SPECULATIVE_TYPE_DFLASH);
+}
+
+bool common_speculative_load_draft_model(
+        common_params_speculative & params,
+        const gpt_params         & params_base) {
+    if (!params.has_dft()) {
+        return true;
+    }
+
+    gpt_params params_dft;
+    params_dft.devices          = params.devices;
+    params_dft.model            = params.model;
+    params_dft.main_gpu         = params_base.main_gpu;
+    params_dft.n_gpu_layers     = params.n_gpu_layers;
+    params_dft.rpc_servers      = params_base.rpc_servers;
+    params_dft.cache_type_k     = params.cache_type_k.empty() ? params_base.cache_type_k : params.cache_type_k;
+    params_dft.cache_type_v     = params.cache_type_v.empty() ? params_base.cache_type_v : params.cache_type_v;
+    params_dft.flash_attn       = params_base.flash_attn;
+    params_dft.k_cache_hadamard = params_base.k_cache_hadamard;
+    params_dft.v_cache_hadamard = params_base.v_cache_hadamard;
+
+    if (params.has_stage_type(COMMON_SPECULATIVE_TYPE_DFLASH)) {
+        params_dft.split_mode = params_base.split_mode;
+        for (size_t i = 0; i < std::size(params_dft.tensor_split); ++i) {
+            params_dft.tensor_split[i] = params_base.tensor_split[i];
+        }
+        params_dft.attn_max_batch = params_base.attn_max_batch;
+        params_dft.graph_reuse = params_base.graph_reuse;
+        params_dft.split_mode_graph_scheduling = params_base.split_mode_graph_scheduling;
+        params_dft.scheduler_async = params_base.scheduler_async;
+        params_dft.max_extra_alloc_MiB = params_base.max_extra_alloc_MiB;
+        params_dft.reduce_type = params_base.reduce_type;
+    }
+
+    if (!params.params.empty()) {
+        auto [argc, argv] = parse_command_line("llama-server " + params.params);
+        if (!gpt_params_parse(argc, argv, params_dft)) {
+            gpt_params_print_usage(argc, argv, params_dft);
+            free_command_line(argc, argv);
+            return false;
+        }
+        free_command_line(argc, argv);
+    }
+
+    LOG_INF("%s: loading draft model '%s'\n", __func__, params_dft.model.c_str());
+
+    if (params_dft.n_ctx == 0) {
+        params_dft.n_ctx = params.n_ctx;
+    }
+    if (params.has_stage_type(COMMON_SPECULATIVE_TYPE_DFLASH) && params_dft.n_gpu_layers < 0) {
+        params_dft.n_gpu_layers = params_base.n_gpu_layers;
+    }
+    params_dft.n_ctx = params_dft.n_ctx == 0 ? params_base.n_ctx / params_base.n_parallel : params_dft.n_ctx;
+    params_dft.n_parallel = 1;
+    params_dft.n_batch = params_dft.n_ctx;
+
+    params.mparams_dft.path = params_dft.model;
+
+    llama_model_params mparams_dft = common_model_params_to_llama(params_dft);
+    llama_model * loaded_model = llama_model_load_from_file(params_dft.model.c_str(), mparams_dft);
+    if (loaded_model == nullptr) {
+        LOG_ERR("%s: failed to load draft model '%s'\n", __func__, params.model.c_str());
+        return false;
+    }
+
+    params.model_dft = loaded_model;
+    params.cparams_dft = common_context_params_to_llama(params_dft);
+    return true;
+}
+
+bool common_speculative_prepare_mtp_runtime(
+        common_params_speculative & params,
+        const gpt_params         & params_base,
+        const llama_model        * model,
+        bool                       has_external_mtp) {
+    if (!params.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP)) {
+        return false;
+    }
+
+    if (llama_model_n_nextn_layer(model) == 0 && !has_external_mtp) {
+        LOG_WRN("%s: MTP speculative stage requested, but model has 0 NextN layers. Removing MTP from the configured stage chain.\n",
+                __func__);
+        params.remove_stage_type(COMMON_SPECULATIVE_TYPE_MTP);
+        if (!params.needs_dft_model()) {
+            params.clear_dft();
+        }
+        return false;
+    }
+
+    if (!has_external_mtp) {
+        gpt_params params_mtp = params_base;
+        params_mtp.pooling_type = LLAMA_POOLING_TYPE_NONE;
+        params.cparams_dft = common_context_params_to_llama(params_mtp);
+    }
+
+    params.cparams_dft.mtp         = true;
+    params.cparams_dft.mtp_op_type = MTP_OP_WARMUP;
+    params.cparams_dft.embeddings  = true;
+
+    return true;
+}
+
+common_speculative_init_status common_speculative_try_init(
+        common_params_speculative & params,
+        llama_context             * ctx_tgt,
+        common_speculative      ** out_spec) {
+    if (out_spec != nullptr) {
+        *out_spec = nullptr;
+    }
+
+    if (!params.has_stage_chain()) {
+        return COMMON_SPECULATIVE_INIT_SKIPPED;
+    }
+
+    common_speculative * spec = common_speculative_init(params, ctx_tgt);
+    if (spec != nullptr) {
+        if (out_spec != nullptr) {
+            *out_spec = spec;
+        }
+        return COMMON_SPECULATIVE_INIT_READY;
+    }
+
+    const llama_model * model = ctx_tgt != nullptr ? llama_get_model(ctx_tgt) : nullptr;
+    if (model != nullptr && llama_model_has_recurrent(model)) {
+        return COMMON_SPECULATIVE_INIT_ERR_RECURRENT;
+    }
+    if (params.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP)) {
+        return COMMON_SPECULATIVE_INIT_ERR_MTP;
+    }
+    return COMMON_SPECULATIVE_INIT_ERR_GENERIC;
+}
+
+void common_speculative_prepare_startup(
+        gpt_params & params_base,
+        bool         allow_parallel_mtp) {
+    auto & params = params_base.speculative;
+
+    if (!allow_parallel_mtp && params_base.n_parallel > 1 && params.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP)) {
+        LOG_WRN("%s: MTP is not supported with parallel slots yet, removing the MTP stage to avoid cross-slot corruption. n_parallel=%d, stage_chain=%s\n",
+                __func__, params_base.n_parallel, common_speculative_stage_chain_to_str(params).c_str());
+        params.remove_stage_type(COMMON_SPECULATIVE_TYPE_MTP);
+    }
+
+    if (!params.needs_dft_model()) {
+        params.clear_dft();
+    }
+
+    params_base.has_mtp = params.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP);
+}
+
+bool common_speculative_finalize_startup(
+        gpt_params        & params_base,
+        const llama_model * model) {
+    auto & params = params_base.speculative;
+
+    if (!params.needs_dft_model()) {
+        params.clear_dft();
+    }
+
+    if (params.has_dft()) {
+        LLAMA_LOG_INFO("\n\n==================================loading DRAFT model==================================\n\n");
+        if (!common_speculative_load_draft_model(params, params_base)) {
+            return false;
+        }
+    }
+
+    params_base.has_mtp = params.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP);
+    const bool has_external_mtp = params_base.has_mtp &&
+        llama_model_is_gemma4_mtp_assistant(params.model_dft);
+
+    params_base.has_mtp = common_speculative_prepare_mtp_runtime(
+        params,
+        params_base,
+        model,
+        has_external_mtp);
+    if (params_base.has_mtp) {
+        params_base.pooling_type = LLAMA_POOLING_TYPE_NONE;
+    }
+
+    return true;
+}
+
+bool common_speculative_before_draft(
+        common_speculative * spec,
+        llama_model * model,
+        llama_context * ctx,
+        common_sampler * sampler_src,
+        const common_params_sampling & sparams,
+        llama_seq_id seq_id,
+        llama_pos n_past,
+        llama_token sampled,
+        int max_tokens,
+        int ckpt_mode) {
+    if (spec == nullptr) {
+        return false;
+    }
+
+    return common_speculative_checkpoint_save(
+        spec->checkpoint,
+        model,
+        ctx,
+        sampler_src,
+        sparams,
+        seq_id,
+        n_past,
+        sampled,
+        max_tokens,
+        ckpt_mode);
+}
+
 int32_t common_speculative_on_target_seq_batch(
        common_speculative * spec,
        llama_context * ctx_tgt,
@ -794,6 +1062,234 @@ bool common_speculative_commit_accepted_output(
        hidden_rows);
 }

+static bool common_speculative_checkpoint_save(
+        common_speculative_checkpoint & ckpt,
+        llama_model * model,
+        llama_context * ctx,
+        common_sampler * sampler_src,
+        const common_params_sampling & sparams,
+        llama_seq_id seq_id,
+        llama_pos n_past,
+        llama_token sampled,
+        int max_tokens,
+        int ckpt_mode) {
+    ckpt.clear();
+    ckpt.n_past = n_past;
+    ckpt.sampled = sampled;
+
+    const int actual_mode = llama_spec_ckpt_init(ctx, ckpt_mode, max_tokens);
+    if (actual_mode == LLAMA_SPEC_CKPT_NONE) {
+        return false;
+    }
+    ckpt.per_step_enabled = (actual_mode == LLAMA_SPEC_CKPT_PER_STEP);
+
+    ckpt.valid = llama_spec_ckpt_save(ctx, seq_id);
+    if (!ckpt.valid) {
+        llama_spec_ckpt_discard(ctx);
+        return false;
+    }
+
+    ckpt.sampler = common_sampler_init(model, sparams);
+    if (ckpt.sampler == nullptr) {
+        common_speculative_checkpoint_discard(ckpt, ctx);
+        return false;
+    }
+
+    if (sampler_src != nullptr) {
+        common_sampler_clone(sampler_src, ckpt.sampler);
+    }
+
+    return true;
+}
+
+const common_speculative_checkpoint * common_speculative_get_checkpoint(const common_speculative * spec) {
+    return spec != nullptr ? &spec->checkpoint : nullptr;
+}
+
+void common_speculative_checkpoint_discard(
+        common_speculative_checkpoint & ckpt,
+        llama_context * ctx) {
+    ckpt.clear();
+    llama_spec_ckpt_discard(ctx);
+}
+
+void common_speculative_checkpoint_restore(
+        common_speculative_checkpoint & ckpt,
+        common_speculative * spec,
+        llama_context * ctx,
+        common_sampler * sampler_dst,
+        llama_seq_id seq_id,
+        common_speculative_type spec_type_used,
+        llama_token sampled_before,
+        const std::vector<llama_token> & ids,
+        int n_draft,
+        const std::vector<float> & mtp_hidden_state_pre,
+        int32_t mtp_n_past_base) {
+    if (!ckpt.valid) {
+        return;
+    }
+
+    if (ckpt.per_step_enabled) {
+        const int step = (int) ids.size() - 1;
+        llama_spec_ckpt_restore(ctx, seq_id, ckpt.n_past, step);
+
+        if (ckpt.sampler != nullptr && sampler_dst != nullptr) {
+            common_sampler_clone(ckpt.sampler, sampler_dst);
+        }
+        if (sampler_dst != nullptr) {
+            for (llama_token id : ids) {
+                common_sampler_accept(sampler_dst, ctx, id, true);
+            }
+        }
+
+        if (common_speculative_has_target_features(spec) && !mtp_hidden_state_pre.empty()) {
+            if (!common_speculative_commit_accepted_hidden_rows(
+                    spec,
+                    spec_type_used,
+                    seq_id,
+                    mtp_n_past_base,
+                    sampled_before,
+                    ids,
+                    mtp_hidden_state_pre)) {
+                common_speculative_clear_sequence_hidden(spec, seq_id);
+            } else if (spec_type_used != COMMON_SPECULATIVE_TYPE_MTP) {
+                LOG_DBG("%s: seq_id=%d synced MTP target hidden state from accepted-prefix rows after per-step restore\n",
+                        __func__, (int) seq_id);
+            }
+        }
+
+        LOG_DBG("%s: seq_id=%d per-step restore: step=%d (rejected %d drafts)\n",
+                __func__, (int) seq_id, step, (int) (n_draft - (ids.size() - 1)));
+    } else {
+        llama_spec_ckpt_restore(ctx, seq_id, ckpt.n_past, 0);
+
+        if (ckpt.sampler != nullptr && sampler_dst != nullptr) {
+            common_sampler_clone(ckpt.sampler, sampler_dst);
+        }
+
+        if (!ids.empty()) {
+            const int n_re = (int) ids.size();
+            llama_batch re_batch = llama_batch_init(n_re, 0, 1);
+            common_batch_add(re_batch, ckpt.sampled, ckpt.n_past, { seq_id }, n_re == 1);
+            for (int j = 0; j < n_re - 1; ++j) {
+                common_batch_add(re_batch, ids[j], ckpt.n_past + 1 + j, { seq_id }, j == n_re - 2);
+            }
+
+            if (common_speculative_has_type(spec, COMMON_SPECULATIVE_TYPE_MTP)) {
+                for (int j = 0; j < re_batch.n_tokens; ++j) {
+                    re_batch.logits[j] = true;
+                }
+                llama_set_embeddings(ctx, true);
+            }
+
+            const int ret = llama_decode(ctx, re_batch);
+            if (ret != 0) {
+                LOG_ERR("%s: seq_id=%d failed to re-decode accepted tokens after checkpoint restore: %d\n",
+                        __func__, (int) seq_id, ret);
+            }
+
+            if (common_speculative_has_target_features(spec)) {
+                std::vector<int32_t> redecoded_indices(n_re);
+                for (int j = 0; j < n_re; ++j) {
+                    redecoded_indices[j] = j;
+                }
+
+                if (!common_speculative_commit_accepted_output(
+                        spec,
+                        ctx,
+                        spec_type_used,
+                        seq_id,
+                        ckpt.n_past,
+                        sampled_before,
+                        ids,
+                        redecoded_indices)) {
+                    common_speculative_clear_sequence_hidden(spec, seq_id);
+                }
+            }
+
+            if (sampler_dst != nullptr) {
+                for (llama_token id : ids) {
+                    common_sampler_accept(sampler_dst, ctx, id, true);
+                }
+            }
+
+            llama_batch_free(re_batch);
+            LOG_DBG("%s: seq_id=%d spec checkpoint restored: re-decoded %d tokens (rejected %d drafts)\n",
+                    __func__, (int) seq_id, n_re, (int) (n_draft - (ids.size() - 1)));
+        }
+    }
+
+    common_speculative_checkpoint_discard(ckpt, ctx);
+}
+
+void common_speculative_commit(
+        common_speculative * spec,
+        llama_context * ctx,
+        common_sampler * sampler_dst,
+        llama_seq_id seq_id,
+        llama_token sampled_before,
+        const std::vector<llama_token> & ids,
+        int n_draft,
+        llama_pos pos_base,
+        const std::vector<int32_t> & accepted_output_indices) {
+    GGML_ASSERT(spec != nullptr);
+    GGML_ASSERT(!ids.empty());
+
+    common_speculative_checkpoint & ckpt = spec->checkpoint;
+    const common_speculative_type spec_type_used = spec->curr_impl != nullptr
+        ? spec->curr_impl->type
+        : COMMON_SPECULATIVE_TYPE_NONE;
+    const bool any_rejected = (int) ids.size() - 1 < n_draft;
+    std::vector<float> mtp_hidden_state_pre;
+
+    common_speculative_accept(spec, ids.size() - 1);
+
+    if (common_speculative_has_target_features(spec) &&
+            any_rejected &&
+            ckpt.valid &&
+            !accepted_output_indices.empty()) {
+        if (!common_speculative_copy_output_hidden_rows(spec, ctx, accepted_output_indices, mtp_hidden_state_pre)) {
+            mtp_hidden_state_pre.clear();
+        }
+    }
+
+    if (any_rejected && ckpt.valid) {
+        common_speculative_checkpoint_restore(
+            ckpt,
+            spec,
+            ctx,
+            sampler_dst,
+            seq_id,
+            spec_type_used,
+            sampled_before,
+            ids,
+            n_draft,
+            mtp_hidden_state_pre,
+            pos_base);
+        return;
+    }
+
+    if (common_speculative_has_target_features(spec) && !accepted_output_indices.empty()) {
+        if (!common_speculative_commit_accepted_output(
+                spec,
+                ctx,
+                spec_type_used,
+                seq_id,
+                pos_base,
+                sampled_before,
+                ids,
+                accepted_output_indices)) {
+            common_speculative_clear_sequence_hidden(spec, seq_id);
+        } else if (spec_type_used != COMMON_SPECULATIVE_TYPE_MTP) {
+            LOG_DBG("%s: seq_id=%d synced MTP target hidden state from accepted-prefix rows\n",
+                    __func__, (int) seq_id);
+        }
+    }
+
+    llama_kv_cache_seq_rm(ctx, seq_id, pos_base + (llama_pos) (ids.size() - 1), -1);
+    common_speculative_checkpoint_discard(ckpt, ctx);
+}
+
 void common_speculative_print_stats(const common_speculative * spec, double slot_tps, int n_decoded, int n_past, common_params_speculative * active_params) {
    if (spec == nullptr) {
        return;
@ -1592,6 +2088,50 @@ void common_speculative_clear_sequence_hidden(common_speculative * spec, llama_s
    }
 }

+void common_speculative_clear_sequence(
+        common_speculative * spec,
+        llama_seq_id seq_id,
+        bool clear_companion_ctx) {
+    if (spec != nullptr) {
+        spec->checkpoint.clear();
+        spec->curr_impl = nullptr;
+        spec->last_n_drafted = 0;
+        spec->t_step_start_us = 0;
+    }
+
+    common_speculative_clear_sequence_hidden(spec, seq_id);
+
+    if (clear_companion_ctx) {
+        if (auto * ctx_mtp = common_speculative_get_companion_ctx(spec); ctx_mtp != nullptr) {
+            llama_kv_cache_clear(ctx_mtp);
+        }
+    }
+}
+
+bool common_speculative_trim_sequence(
+        common_speculative * spec,
+        llama_context * ctx,
+        llama_seq_id seq_id,
+        llama_pos pos_begin) {
+    const bool target_trimmed = llama_kv_cache_seq_rm(ctx, seq_id, pos_begin, -1);
+    if (auto * ctx_mtp = common_speculative_get_companion_ctx(spec); ctx_mtp != nullptr) {
+        return target_trimmed && llama_kv_cache_seq_rm(ctx_mtp, seq_id, pos_begin, -1);
+    }
+
+    return target_trimmed;
+}
+
+void common_speculative_clear_sequence_kv(
+        common_speculative * spec,
+        llama_context * ctx,
+        llama_seq_id seq_id) {
+    common_speculative_clear_sequence(spec, seq_id);
+    llama_kv_cache_seq_rm(ctx, seq_id, -1, -1);
+    if (auto * ctx_mtp = common_speculative_get_companion_ctx(spec); ctx_mtp != nullptr) {
+        llama_kv_cache_seq_rm(ctx_mtp, seq_id, -1, -1);
+    }
+}
+
 llama_context * common_speculative_get_companion_ctx(common_speculative * spec) {
    if (auto * mtp_state = common_speculative_get_mtp_state(spec); mtp_state != nullptr) {
        return mtp_state->ctx_mtp;
@ -1858,13 +2398,10 @@ std::vector<llama_token> mtp_speculative_gen_draft(
    // This prevents cache state corruption where two cells map to the same logical position.
    // If the state contained in `last` had a valid token id and probability, it means that we
    // have previously run an "accept" batch, where the token sampled from the main model was included.
-    // In that case, we need to discard all tokens that we ran here to get the KV cache to the correct state.
-    //   => for i0 = 1 we discard from n_past
-    // But if we did not have a valid last token_id, it means the first token we run was sampled from the
-    // main model. Hence we want to keep this token in the KV cache and discard all other tokens.
-    //   => for i0 = 0 we discard from n_past + 1
+    // Even in that case, the token at `n_past` is already committed and must remain in the KV cache,
+    // so we only discard the speculative tail starting at `n_past + 1`.
    if (n_decode > 0) {
-        llama_kv_cache_seq_rm(ctx, seq_id, n_past + 1 - i0, n_past + n_decode + 2);
+        llama_kv_cache_seq_rm(ctx, seq_id, n_past + 1, n_past + n_decode + 2);
    }

    return drafts;
--- a/common/speculative.h
+++ b/common/speculative.h
@ -7,6 +7,14 @@

 struct common_speculative;

+enum common_speculative_init_status {
+    COMMON_SPECULATIVE_INIT_SKIPPED,
+    COMMON_SPECULATIVE_INIT_READY,
+    COMMON_SPECULATIVE_INIT_ERR_RECURRENT,
+    COMMON_SPECULATIVE_INIT_ERR_MTP,
+    COMMON_SPECULATIVE_INIT_ERR_GENERIC,
+};
+
 using common_speculative_feature_kind = llama_spec_feature_kind;
 using common_speculative_feature_row_view = llama_spec_feature_row_view;
 using common_speculative_feature_view = llama_spec_feature_view;
@ -14,6 +22,21 @@ using common_speculative_feature_view = llama_spec_feature_view;
 static constexpr common_speculative_feature_kind COMMON_SPECULATIVE_FEATURE_NONE = LLAMA_SPEC_FEATURE_NONE;
 static constexpr common_speculative_feature_kind COMMON_SPECULATIVE_FEATURE_HIDDEN_STATE = LLAMA_SPEC_FEATURE_HIDDEN_STATE;

+struct common_speculative_checkpoint {
+    bool valid = false;
+    bool per_step_enabled = false;
+    llama_pos n_past = 0;
+    llama_token sampled = LLAMA_TOKEN_NULL;
+    common_sampler * sampler = nullptr;
+
+    void clear();
+};
+
+struct common_speculative_draft_result {
+    llama_tokens tokens;
+    common_speculative_type type = COMMON_SPECULATIVE_TYPE_NONE;
+};
+
 // comma separated list of all types
 std::string common_speculative_type_name_str();

@ -31,6 +54,29 @@ common_speculative * common_speculative_init(
        common_params_speculative & params,
        llama_context             * ctx_tgt);

+common_speculative_init_status common_speculative_try_init(
+        common_params_speculative & params,
+        llama_context             * ctx_tgt,
+        common_speculative      ** out_spec);
+
+void common_speculative_prepare_startup(
+        gpt_params & params_base,
+        bool         allow_parallel_mtp = true);
+
+bool common_speculative_finalize_startup(
+        gpt_params        & params_base,
+        const llama_model * model);
+
+bool common_speculative_load_draft_model(
+        common_params_speculative & params,
+        const gpt_params         & params_base);
+
+bool common_speculative_prepare_mtp_runtime(
+        common_params_speculative & params,
+        const gpt_params         & params_base,
+        const llama_model        * model,
+        bool                       has_external_mtp);
+
 void common_speculative_free(common_speculative * spec);

 // optionally call once at the beginning of a new generation
@ -46,9 +92,30 @@ llama_tokens common_speculative_draft(
                            llama_pos     draft_base_pos = -1,
                            llama_seq_id  draft_seq_id = 0);

+common_speculative_draft_result common_speculative_draft_ex(
+                     common_speculative * spec,
+                     llama_context * ctx,
+                     common_params_speculative & params,
+                     const llama_tokens & prompt,
+                            llama_token   id_last,
+                            llama_pos     draft_base_pos = -1,
+                            llama_seq_id  draft_seq_id = 0);
+
 // informs the speculative decoder that n_accepted tokens were accepted by the target model
 void common_speculative_accept(common_speculative * spec, uint16_t n_accepted);

+bool common_speculative_before_draft(
+    common_speculative * spec,
+    llama_model * model,
+    llama_context * ctx,
+    common_sampler * sampler_src,
+    const common_params_sampling & sparams,
+    llama_seq_id seq_id,
+    llama_pos n_past,
+    llama_token sampled,
+    int max_tokens,
+    int ckpt_mode);
+
 bool common_speculative_ensure_sequence_hidden(
    common_speculative * spec,
    llama_context * ctx,
@ -87,10 +154,56 @@ bool common_speculative_commit_accepted_output(
    const std::vector<llama_token> & ids,
    const std::vector<int32_t> & output_indices);

+const common_speculative_checkpoint * common_speculative_get_checkpoint(const common_speculative * spec);
+
+void common_speculative_checkpoint_discard(
+    common_speculative_checkpoint & ckpt,
+    llama_context * ctx);
+
+void common_speculative_checkpoint_restore(
+    common_speculative_checkpoint & ckpt,
+    common_speculative * spec,
+    llama_context * ctx,
+    common_sampler * sampler_dst,
+    llama_seq_id seq_id,
+    common_speculative_type spec_type_used,
+    llama_token sampled_before,
+    const std::vector<llama_token> & ids,
+    int n_draft,
+    const std::vector<float> & mtp_hidden_state_pre,
+    int32_t mtp_n_past_base);
+
+void common_speculative_commit(
+    common_speculative * spec,
+    llama_context * ctx,
+    common_sampler * sampler_dst,
+    llama_seq_id seq_id,
+    llama_token sampled_before,
+    const std::vector<llama_token> & ids,
+    int n_draft,
+    llama_pos pos_base,
+    const std::vector<int32_t> & accepted_output_indices);
+
 bool common_speculative_has_sequence_hidden(const common_speculative * spec, llama_seq_id seq_id);

 void common_speculative_clear_sequence_hidden(common_speculative * spec, llama_seq_id seq_id);

+void common_speculative_clear_sequence(
+    common_speculative * spec,
+    llama_seq_id seq_id,
+    bool clear_companion_ctx = false);
+
+bool common_speculative_trim_sequence(
+    common_speculative * spec,
+    llama_context * ctx,
+    llama_seq_id seq_id,
+    llama_pos pos_begin);
+
+void common_speculative_clear_sequence_kv(
+    common_speculative * spec,
+    llama_context * ctx,
+    llama_seq_id seq_id);
+
 llama_context * common_speculative_get_companion_ctx(common_speculative * spec);

 int32_t common_speculative_on_target_seq_batch(
--- a/common/suffix-tree.cpp
+++ b/common/suffix-tree.cpp
@ -209,7 +209,7 @@ static bool suffix_corpus_check_limit(const std::string & path, size_t n_tokens,
        return true;
    }

-    LOG_ERR("load_corpus: refusing suffix corpus '%s' - estimated insert work %llu exceeds limit %llu (tokens=%zu, depth=%d); reduce corpus size or --suffix-max-depth\n",
+    LOG_ERR("load_corpus: refusing suffix corpus '%s' - estimated insert work %llu exceeds limit %llu (tokens=%zu, depth=%d); reduce corpus size or lower suffix_max_depth inside --spec-type suffix:suffix_max_depth=...\n",
            path.c_str(),
            (unsigned long long) estimated_work,
            (unsigned long long) SUFFIX_CORPUS_MAX_INSERT_WORK,
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@ -2519,6 +2519,144 @@ class DFlashDraftModel(Qwen3Model):
        return tensors


+@Model.register("MellumForCausalLM")
+class MellumModel(Model):
+    model_arch = gguf.MODEL_ARCH.MELLUM
+
+    def set_vocab(self):
+        tokenizer_path = self.dir_model / "tokenizer.json"
+        with open(tokenizer_path, "r", encoding="utf-8") as f:
+            tokenizer_json = json.load(f)
+
+        from tokenizers import Tokenizer
+        tokenizer = Tokenizer.from_file(str(tokenizer_path))
+
+        class TokenizerShim:
+            def encode(self, text: str) -> list[int]:
+                return tokenizer.encode(text).ids
+
+        vocab: dict[str, int] = tokenizer_json["model"]["vocab"]
+        vocab_size = self.hparams.get("vocab_size", len(vocab))
+        assert max(vocab.values()) < vocab_size
+
+        tokpre = self.get_vocab_base_pre(TokenizerShim())
+        reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in vocab.items()}
+        added_vocab = {
+            item["content"]: item
+            for item in tokenizer_json.get("added_tokens", [])
+            if isinstance(item.get("content"), str)
+        }
+
+        tokens: list[str] = []
+        toktypes: list[int] = []
+        for i in range(vocab_size):
+            if i not in reverse_vocab:
+                tokens.append(f"[PAD{i}]")
+                toktypes.append(gguf.TokenType.UNUSED)
+                continue
+
+            token = reverse_vocab[i]
+            added_token = added_vocab.get(token)
+            if added_token is not None:
+                if added_token.get("special", False) or self.does_token_look_special(token):
+                    toktypes.append(gguf.TokenType.CONTROL)
+                else:
+                    token = token.replace("\u2581", " ")
+                    toktypes.append(gguf.TokenType.USER_DEFINED)
+            else:
+                toktypes.append(gguf.TokenType.NORMAL)
+            tokens.append(token)
+
+        self.gguf_writer.add_tokenizer_model("gpt2")
+        self.gguf_writer.add_tokenizer_pre(tokpre)
+        self.gguf_writer.add_token_list(tokens)
+        self.gguf_writer.add_token_types(toktypes)
+
+        special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
+        special_vocab.add_to_gguf(self.gguf_writer)
+
+    def set_gguf_parameters(self):
+        super().set_gguf_parameters()
+        if self.hparams.get("num_local_experts") is None and (n_experts := self.hparams.get("num_experts")) is not None:
+            self.gguf_writer.add_expert_count(n_experts)
+
+        if (moe_intermediate_size := self.hparams.get("moe_intermediate_size")) is not None:
+            self.gguf_writer.add_expert_feed_forward_length(moe_intermediate_size)
+            logger.info(f"gguf: expert feed forward length = {moe_intermediate_size}")
+
+        use_sliding_window = self.hparams.get("use_sliding_window")
+        sliding_window = self.hparams.get("sliding_window")
+        if (use_sliding_window is True or use_sliding_window is None) and sliding_window is not None:
+            self.gguf_writer.add_sliding_window(sliding_window)
+            logger.info(f"gguf: sliding window = {sliding_window}")
+            self.gguf_writer.add_sliding_window_pattern([t == "sliding_attention" for t in self.hparams["layer_types"]])
+            logger.info(f"gguf: sliding window pattern length = {len(self.hparams['layer_types'])}")
+
+        rope_parameters = self.hparams.get("rope_parameters", {})
+        if full_attention_rope := rope_parameters.get("full_attention"):
+            if rope_theta := full_attention_rope.get("rope_theta"):
+                self.gguf_writer.add_rope_freq_base(rope_theta)
+                logger.info(f"gguf: rope freq base = {rope_theta}")
+
+            if full_attention_rope.get("rope_type") == "yarn":
+                self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.YARN)
+
+                if factor := full_attention_rope.get("factor"):
+                    self.gguf_writer.add_rope_scaling_factor(factor)
+                if original_context_length := full_attention_rope.get("original_max_position_embeddings"):
+                    self.gguf_writer.add_rope_scaling_orig_ctx_len(original_context_length)
+                if attention_factor := full_attention_rope.get("attention_factor"):
+                    self.gguf_writer.add_rope_scaling_yarn_attn_factor(attention_factor)
+                if beta_fast := full_attention_rope.get("beta_fast"):
+                    self.gguf_writer.add_rope_scaling_yarn_beta_fast(beta_fast)
+                if beta_slow := full_attention_rope.get("beta_slow"):
+                    self.gguf_writer.add_rope_scaling_yarn_beta_slow(beta_slow)
+
+        if sliding_attention_rope := rope_parameters.get("sliding_attention"):
+            if rope_theta_swa := sliding_attention_rope.get("rope_theta"):
+                self.gguf_writer.add_rope_freq_base_swa(rope_theta_swa)
+                logger.info(f"gguf: rope freq base swa = {rope_theta_swa}")
+
+    _experts: list[dict[str, Tensor]] | None = None
+
+    def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
+        if "experts" in name:
+            n_experts = self.find_hparam(["num_local_experts", "num_experts"])
+            assert bid is not None
+
+            if self._experts is None:
+                self._experts = [{} for _ in range(self.block_count)]
+
+            self._experts[bid][name] = data_torch
+
+            if len(self._experts[bid]) >= n_experts * 3:
+                tensors: list[tuple[str, Tensor]] = []
+
+                for w_name in ["down_proj", "gate_proj", "up_proj"]:
+                    datas: list[Tensor] = []
+
+                    for xid in range(n_experts):
+                        ename = f"model.layers.{bid}.mlp.experts.{xid}.{w_name}.weight"
+                        datas.append(self._experts[bid][ename])
+                        del self._experts[bid][ename]
+
+                    data_torch = torch.stack(datas, dim=0)
+                    merged_name = f"model.layers.{bid}.mlp.experts.{w_name}.weight"
+                    tensors.append((self.map_tensor_name(merged_name), data_torch))
+                return tensors
+            return []
+
+        return [(self.map_tensor_name(name), data_torch)]
+
+    def prepare_tensors(self):
+        super().prepare_tensors()
+
+        if self._experts is not None:
+            experts = [k for d in self._experts for k in d.keys()]
+            if len(experts) > 0:
+                raise ValueError(f"Unprocessed experts: {experts}")
+
+
@Model.register("Ernie4_5_ForCausalLM", "Ernie4_5ForCausalLM")
 class Ernie4_5Model(Model):
    model_arch = gguf.MODEL_ARCH.ERNIE4_5
--- a/convert_hf_to_gguf_update.py
+++ b/convert_hf_to_gguf_update.py
@ -105,6 +105,7 @@ models = [
    {"name": "kimi-k2",        "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/moonshotai/Kimi-K2-Base", "chkhsh": "81212dc7cdb7e0c1074ca62c5aeab0d43c9f52b8a737be7b12a777c953027890", },
    {"name": "grok-2",         "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/alvarobartt/grok-2-tokenizer", "chkhsh": "66b8d4e19ab16c3bfd89bce5d785fb7e0155e8648708a1f42077cb9fe002c273"},
    {"name": "minimax-m2",     "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/MiniMaxAI/MiniMax-M2", },
+    {"name": "mellum2",        "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Base", },
 ]


--- a/docs/autoparser.md
+++ b/docs/autoparser.md
@ -69,6 +69,7 @@ Three outcomes for reasoning-prefill handling (in `generate_parser()`):
 | `PLAIN`                  | No content markers                                             |
 | `ALWAYS_WRAPPED`         | Content always wrapped: `<response>...</response>`             |
 | `WRAPPED_WITH_REASONING` | Content wrapped only when reasoning is present                 |
+| `END_DELIMITED`          | Content has no start marker but ends at a marker               |

 **`tool_format`**: Classification of tool call structure.

@ -357,6 +358,7 @@ A workaround array in `common/chat-diff-analyzer.cpp` applies post-hoc patches a
 3. **Cohere Command R+** — source contains `<|CHATBOT_TOKEN|>`: sets `ALWAYS_WRAPPED` content mode if no content start is already set
 4. **Functionary 3.1** — source contains `set has_code_interpreter`: forces `PLAIN` content, specific `per_call_start/end`, clears preserved tokens to only keep Functionary-specific markers
 5. **DeepSeek-R1-Distill-Qwen** — source contains `tool▁calls▁begin` markers: overrides tool section/per-call markers with the correct Unicode block characters
+6. **Poolside Laguna** — source contains `laguna_glm_thinking` and the Laguna generation prompt pattern: sets delimiter-style reasoning ending at `</think>` and `END_DELIMITED` content ending at `</assistant>`

 ### Parser Building

@ -380,6 +382,7 @@ Note: The start marker may be empty either because the analyzer detected delimit
 | Tools present                          | Dispatches to `analyze_tools::build_parser()`                                   |
 | `ALWAYS_WRAPPED` with reasoning        | `reasoning + start + content(until(end)) + end + end()`                         |
 | `ALWAYS_WRAPPED` without reasoning     | `content(until(start)) + start + content(until(end)) + end + end()`             |
+| `END_DELIMITED`                        | `reasoning + content(until(end) or rest()) + optional end marker + end()`       |
 | Default (PLAIN)                        | `reasoning + content(rest()) + end()`                                           |

 #### Tool Parsers (`analyze_tools::build_parser`)
@ -392,7 +395,7 @@ Dispatches by `format.mode`:
 - `build_json_tools_nested_keys()` — nested: `{"function": {"name": "X", "arguments": {...}}}`
 - `build_json_tools_flat_keys()` — flat: `{"name": "X", "arguments": {...}}`

-Handles content wrappers, array wrapping (`tools_array_wrapped`), parallel calls, and `parameter_order`.
+Handles content wrappers, array wrapping (`tools_array_wrapped`), parallel calls, and `parameter_order`. If content is `END_DELIMITED`, the content end marker is also accepted after parsed tool calls.

 **`build_tool_parser_tag_json()`**: For each tool function:

@ -417,7 +420,7 @@ For closing: uses `function.close` if present; otherwise uses `peek(per_call_end
 All three tool parsers return:

 ```text
-reasoning + optional(content(until(trigger_marker))) + tool_calls + end()
+reasoning + optional(content(until(trigger_marker))) + tool_calls + optional(content_end) + end()
 ```

 Each returned parser is wrapped by `wrap_for_generation_prompt()`, which prepends a literal for any boilerplate prefix of the generation prompt (the portion before the reasoning start marker).
--- a/docs/parameters.md
+++ b/docs/parameters.md
@ -1,6 +1,6 @@
 # Parameters Documentation

-Overview of the most common command-line parameters in `ik_llama.cpp` and some info how to use them.
+Overview of the most common command-line parameters in `ik_llama.cpp` and some info how to use them. It is not exhaustive and may omit some available options.

 ## Table of Contents

@ -58,6 +58,10 @@ Some often used terms.
 | t/s | Token/second, measures PP and TG. |
 | full gpu | All processes offloaded to the GPU. |
 | hybrid cpu/gpu | Partial offload to the GPU. |
+| RAG | Retrieval Augmented Generation. Provide external documents to the LLM for information lookup. |
+| MCP | Model Context Protocol ), an [open standard](https://en.wikipedia.org/wiki/Model_Context_Protocol) for the way artificial intelligence (AI) systems like large language models (LLMs) integrate and share data with external tools, systems, and data sources |
+| AI agent | Tool/program that uses LLM to achieve a goal/task via a series of planning/steps/actions/tool-calling/etc. `Coding agents` are specialized in software goals. |
+| Agent harness | The tools and the infrastructure around the LLM in an AI Agent. `AI Agent = LLM+ Agent harness` |

 ## General Parameters

@ -65,7 +69,8 @@ Some often used terms.
 | - | - | - | - |
 | `-h, --help, --usage` | Print usage and exit | - | - |
 | `--fit` | Automatically fit to available VRAM | off | Loads as many tensors to the GPU(s) as available VRAM will permit. [PR 1501](https://github.com/ikawrakow/ik_llama.cpp/pull/1501) [PR 1504](https://github.com/ikawrakow/ik_llama.cpp/pull/1504) |
-| `--fit-margin N` | Safety VRAM margin in MiB when using `--fit` | 1024 | Increase this value in case of CUDA OOM when loading the model. Decrease to less than 1024 if the model loads successfully and you feel that too much VRAM has been left unused |  
+| `--fit-margin N` | Safety VRAM margin in MiB when using `--fit` | 1024 | Increase this value in case of CUDA OOM when loading the model. Decrease to less than 1024 if the model loads successfully and you feel that too much VRAM has been left unused |
+| `--gpu-fit-margin GPU1,M1,...` | Per GPU fit margin | - |  Set the fit margin per GPU when auto-fitting the model. [PR 1872](https://github.com/ikawrakow/ik_llama.cpp/pull/1872) |
 | `-wgt, --worst-graph-tokens N` | Number of tokens to use for worst-case graph | - | Control compute buffer sizes for large batches. Provided "as is" for users that understand the limitations, please don't open issues when using this. [PR 1560](https://github.com/ikawrakow/ik_llama.cpp/pull/1560) |
 | `-t, --threads N` | Number of threads to use during generation | 4 | Try to match the number of physical CPU cores. Avoid odd numbers (e.g. 1,3,...). |
 | `-tb, --threads-batch N` | Number of threads to use during batch and prompt processing | Same as `--threads` | Same as `--threads` When doing full GPU offload, use a lower number (e.g. 2) |
@ -80,7 +85,7 @@ Some often used terms.
 | `--minilog` | Print important information | - | For `llama-server`, log request message for completions/response/anthropic and response. The prompt in the json format and the text response are saved in the log file and printed to the console. [PR 1477](https://github.com/ikawrakow/ik_llama.cpp/pull/1477) |
 | `-fa, --flash-attn` | Enables Flash Attention | on | auto / on / off Improves t/s and reduces memory usage. |
 | `--no-fa, --no-flash-attn` | Disable Flash Attention |  | Alternative parameter to turn of FA. See `--flash-attn` |
-| `-mla, --mla-use` | Enable MLA | 3 | 0 / 1 / 2 / 3 For DeepSeek models, and other recent models that are using MLA. [PR 188](https://github.com/ikawrakow/ik_llama.cpp/pull/188) [PR 205](https://github.com/ikawrakow/ik_llama.cpp/pull/205) [PR 235](https://github.com/ikawrakow/ik_llama.cpp/pull/235) [PR 243](https://github.com/ikawrakow/ik_llama.cpp/pull/243) [PR 252](https://github.com/ikawrakow/ik_llama.cpp/pull/252) [PR 253](https://github.com/ikawrakow/ik_llama.cpp/pull/253) [PR 273](https://github.com/ikawrakow/ik_llama.cpp/pull/273) [PR 386](https://github.com/ikawrakow/ik_llama.cpp/pull/386) [PR 497](https://github.com/ikawrakow/ik_llama.cpp/pull/497) [PR 943](https://github.com/ikawrakow/ik_llama.cpp/pull/943)|
+| `-mla, --mla-use` | Enable MLA | 3 | 0 / 1 / 2 / 3 For DeepSeek models, and other recent models that are using MLA. [PR 188](https://github.com/ikawrakow/ik_llama.cpp/pull/188) [PR 205](https://github.com/ikawrakow/ik_llama.cpp/pull/205) [PR 235](https://github.com/ikawrakow/ik_llama.cpp/pull/235) [PR 243](https://github.com/ikawrakow/ik_llama.cpp/pull/243) [PR 252](https://github.com/ikawrakow/ik_llama.cpp/pull/252) [PR 253](https://github.com/ikawrakow/ik_llama.cpp/pull/253) [PR 273](https://github.com/ikawrakow/ik_llama.cpp/pull/273) [PR 386](https://github.com/ikawrakow/ik_llama.cpp/pull/386) [PR 497](https://github.com/ikawrakow/ik_llama.cpp/pull/497) [PR 943](https://github.com/ikawrakow/ik_llama.cpp/pull/943) [PR 1821](https://github.com/ikawrakow/ik_llama.cpp/pull/1821) |
 | `-amb, --attention-max-batch` | Max batch size for attention computations | 0 | Specifies the maximum K*Q size in MB we want to tolerate. [PR 237](https://github.com/ikawrakow/ik_llama.cpp/pull/237) |
 | `-fmoe or --fused-moe` | Fused MoE ffn_up and ffn_gate | - | Speedup for MoE models. [PR 229](https://github.com/ikawrakow/ik_llama.cpp/pull/229) |
 | `--no-fmoe, --no-fused-moe` | Disable fused MoE | Enabled | See `--fused-moe` |
@ -100,6 +105,7 @@ Some often used terms.
 | `--no-warmup` | Skip warming up the model with an empty run | - |  |
 | `--mlock` | Force system to keep model in RAM rather than swapping or compressing | - |  |
 | `--no-mmap` | Do not memory-map model (slower load but may reduce pageouts) | - |  |
+| `--ui-mcp-proxy, --webui-mcp-proxy` | Experimental: whether to enable MCP CORS proxy - do not enable in untrusted environments | disabled | Support CORS Proxy on llama-server backend side. It is required to make external mcp server work on llamacpp webui. [PR 1904](https://github.com/ikawrakow/ik_llama.cpp/pull/1904) |
 | `--defer-experts` | Defer expert mmap residency on Linux to reduce model load time | false | Using this flag, expert tensor pages are faulted in on demand rather than being eagerly loaded during initialization. This allows us to reduce cold-start latency, thus improving the load time of MoE models, particularly on systems where users are running models off of storage. [PR 1634](https://github.com/ikawrakow/ik_llama.cpp/pull/1634) |
 | `-rtr, --run-time-repack` | Repack tensors if interleaved variant is available | - | May improve performance on some systems. [PR 147](https://github.com/ikawrakow/ik_llama.cpp/pull/147) |
 | `--ctx-checkpoints` | set the number of checkpoints per slot | - | enable checkpoint for recurrent models Qwen3-Next and Qwen3.5-MoE. [PR 1310](https://github.com/ikawrakow/ik_llama.cpp/pull/1310) |
@ -120,21 +126,13 @@ Check the details [here](./speculative.md).
 | `-ctkd, --cache-type-k-draft TYPE` | KV cache data type for K for the draft model | - | For draft model, see: `-ctk` |
 | `-ctvd, --cache-type-v-draft TYPE` | KV cache data type for V for the draft model | - | For draft model, see: `-ctk` |
 | `-draft, --draft-params` | Comma-separated list of draft model parameters | - |  |
-| `--spec-ngram-size-n N` | ngram size N for ngram-simple/ngram-map speculative decoding, length of lookup n-gram| 12 | [PR 1261](https://github.com/ikawrakow/ik_llama.cpp/pull/1261) |
-| `--spec-ngram-size-m N` | ngram size M for ngram-simple/ngram-map speculative decoding, length of draft m-gram | 48 | [PR 1261](https://github.com/ikawrakow/ik_llama.cpp/pull/1261) |
-| `--spec-ngram-min-hits N` | minimum hits for ngram-map speculative decoding | 1 | [PR 1261](https://github.com/ikawrakow/ik_llama.cpp/pull/1261) |
-| `--spec-type Name` | Comma-separated list of draft model parameters | - | none / ngram - cache / ngram - simple / ngram - map - k / ngram - map - k4v / ngram - mod / suffix [PR 1261](https://github.com/ikawrakow/ik_llama.cpp/pull/1261) [PR 1646](https://github.com/ikawrakow/ik_llama.cpp/pull/1646) |
-| `--spec-stage SPEC[:k=v,...]` | Add an explicit speculative stage; repeat once for a supported two-stage chain | - | Supported two-stage shape: self-spec first, then `mtp` or `draft` fallback. [PR 1789](https://github.com/ikawrakow/ik_llama.cpp/pull/1789) |
-| `-mtp, --multi-token-prediction` |  | - | MTP decoding [PR 1270](https://github.com/ikawrakow/ik_llama.cpp/pull/1270) [1698](https://github.com/ikawrakow/ik_llama.cpp/pull/1698) |
-| `-no-mtp, --no-multi-token-prediction` |  | - | MTP decoding [PR 1270](https://github.com/ikawrakow/ik_llama.cpp/pull/1270) [1698](https://github.com/ikawrakow/ik_llama.cpp/pull/1698) |
-| `--draft-max` |  | - | MTP decoding [PR 1270](https://github.com/ikawrakow/ik_llama.cpp/pull/1270) [1698](https://github.com/ikawrakow/ik_llama.cpp/pull/1698) |
-| `--draft-p-min` |  | - | MTP decoding [PR 1270](https://github.com/ikawrakow/ik_llama.cpp/pull/1270) [1698](https://github.com/ikawrakow/ik_llama.cpp/pull/1698) |
+| `--spec-type SPEC[:k=v,...]` | Canonical speculative stage entry; repeat to configure the supported two-stage chain | - | Types: `none`, `draft`, `mtp`, `ngram-cache`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `suffix`. Canonical keys: `n_max`, `n_min`, `p_min`, `ngram_size_n`, `ngram_size_m`, `ngram_min_hits`, `suffix_min_match_len`, `suffix_max_depth`, `suffix_corpus`. String values may escape commas as `\,` or quote the value inside the stage payload. Example: `--spec-type ngram-mod:n_max=64,n_min=2,ngram_size_n=8 --spec-type mtp:n_max=1,p_min=0.0` |
 | `--spec-autotune` | Automatically tune speculative params to maximize tokens/sec | - | Automatically determines the near-optimal arguments for the type of speculation being performed [PR 1595](https://github.com/ikawrakow/ik_llama.cpp/pull/1595) |
 | `--recurrent-ckpt-mode MODE` | Checkpoint strategy for recurrent/hybrid speculative decoding | auto | One of: - `auto` auto-select: per-step if CUDA full-GPU, gpu-fallback otherwise - `per-step` save SSM state per draft step in VRAM; no re-decode on rejection - `gpu-fallback` copy state to GPU buffer; re-decode on rejection - `cpu` serialise state via llama_state_seq; re-decode on rejection [PR 1669](https://github.com/ikawrakow/ik_llama.cpp/pull/1669) [PR 1774](https://github.com/ikawrakow/ik_llama.cpp/pull/1774) |

 Notes:

- `--spec-type` cannot be combined with `--spec-stage`.
+- Legacy `--spec-stage`, `--draft-*`, `--spec-ngram-*`, `--suffix-*`, and `-mtp` flags are rejected with replacement guidance.
 - Explicit stage chains currently support at most two stages.
 - Supported self-spec stage names are `ngram-cache`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, and `suffix`.
 - Composite stage chains disable speculative autotune.
@ -163,7 +161,7 @@ Good overview on [kalomaze/llm_samplers_explained.md](https://gist.github.com/ka
 | `--sampling-seq SEQUENCE` | Simplified sequence for samplers | dkfypmxntw | Same as `--samplers`, just shorter format. |
 | `--banned-string-file` | File path of the list of banned strings on each line |  |  |
 | `--banned-n` | Number of tokens banned in the phrase during rewind. | -1 | -1 means all tokens [PR 1185](https://github.com/ikawrakow/ik_llama.cpp/pull/1185) |
-| `--expiring-logit-bias-file FILENAME` | Load bias states from a custom file format | - | [PR 1731](https://github.com/ikawrakow/ik_llama.cpp/pull/1731) |
+| `--expiring-logit-bias-file FILENAME` | Load bias states from a custom file format | - | [PR 1731](https://github.com/ikawrakow/ik_llama.cpp/pull/1731) [PR 1770](https://github.com/ikawrakow/ik_llama.cpp/pull/1770) |

 ## Prompt Template

@ -201,7 +199,8 @@ MLA models already have the cache compressed, it doesn't really makes sense to c
 | `-nkvo, --no-kv-offload` | Disable KV offload | - | Keep KV on CPU. |
 | `-ctk, --cache-type-k TYPE` | KV cache data type for K | f16 | Reduces K size in KV which improves speed and reduces memory requirements, but may reduce output quality. |
 | `-ctv, --cache-type-v TYPE` | KV cache data type for V | f16 | See: `-ctk` |
-| `--mtmd-kq-type type` | Define the type used for the `K*Q` matrix multiplication | - | Use une of `f16`/`bf16` instead of `f32` to improve speed up multimodal |
+| `-mtprot, --mtp-requantize-output-tensor type` | Use output requantized to type for MTP | - |  Improves TG performance for when using MTP. It requantize the tensor on-the-fly while loading the model, see [PR 1809](https://github.com/ikawrakow/ik_llama.cpp/pull/1809) for details and [PR 1810](https://github.com/ikawrakow/ik_llama.cpp/pull/1810) `--extra-output-tensor` as offline requantize alternative. |
+| `--mtmd-kq-type type` | Define the type used for the `K*Q` matrix multiplication | - | Use one of `f16`/`bf16` instead of `f32` to improve speed up multimodal |
 | `--no-context-shift` | Disable context-shift | - |  |
 | `--context-shift` | Set context-shift | on | auto / on / off / 0 / 1 [PR 973](https://github.com/ikawrakow/ik_llama.cpp/pull/973) |

@ -318,7 +317,7 @@ python3 gguf-py/scripts/gguf_dump.py /models/Qwen_Qwen3-0.6B-IQ4_NL.gguf

 - `-ngl`, `-ot`, `--cpu-moe`, `--n-cpu-moe N`
   - For MoE models, use a number greater than the number of model layers with `-ngl`. If unsure, use a large number like `-ngl 999`.
-   - It's good to explicitly put up/down/gate onto the GPU for speedups. 
+   - It's good to explicitly put up/down/gate onto the GPU for speedups.
   - Up/Gate shouldn't be on separate GPU devices because it might cause a bit of a deadlock.
   - For models with shared experts (like GPT-OSS), they should end up on GPU.
   - In some quants the layers aren't uniform so it can be better to skip larger layers if more smaller blocks will fit without empty space where nothing fits.
@ -328,7 +327,7 @@ python3 gguf-py/scripts/gguf_dump.py /models/Qwen_Qwen3-0.6B-IQ4_NL.gguf
   - In general, in a single GPU + CPU system, you just do something like this:

   `-ngl 999` To put all layers in VRAM by default
-   
+
   `-ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn._exps.=CPU"` To create exceptions and put back in ram anything that has "ffn" and "_exps" in its name, and that sits in layers called "blk.n", where "n" (the lawyer number) is any match between 0 and 9, or between 1 to 7 + 0 to 9 (aka a number between 10 and 79), or 8 + 0 to 7 (aka a number between 80 and 87).
   Basically a complicated way of saying put all experts from layer 0 to 87 in ram. Experts from layer 88 to 93 (there's 93 layers in qwen3vl 235b) can sit in VRAM still. (Thats all I can load on a 5090).

@ -342,7 +341,7 @@ C. Other tips
   - If you are not happy with the allocations done by `--fit` across GPUs, use `-ts` to manually tweak.
   - Look for `ReBAR`/`Resizable BAR` support for your Motherboard, CPU, BIOS/UEFI and GPU. Then for the "patched driver" for your GPUs to enable GPU to GPU direct communication.

-### Common GPU configurations and popular models 
+### Common GPU configurations and popular models

 WIP

@ -368,7 +367,7 @@ WIP
 | `-grt, --graph-reduce-type` | Type for data exchange between GPUs | f32 | q8_0 / bf16 / f16 / f32 Reduce the data transferred between GPUs [PR 1154](https://github.com/ikawrakow/ik_llama.cpp/pull/1154) |
 | `-smgs, --split-mode-graph-scheduling` | Force Split Mode Graph Scheduling | 0 | [PR 1068](https://github.com/ikawrakow/ik_llama.cpp/pull/1068) |
 | `--max-gpu N` | Define (and use) a maximum number of GPUs per layer with split mode "graph" |  | This is of interest when there are more than 2 GPUs available, but using all of them leads to a lower performance than using just 2 (or using the default split mode "layer") [PR 1051](https://github.com/ikawrakow/ik_llama.cpp/pull/1051) |
-| `-cuda, --cuda-params` | Comma-separated list of cuda parameters | - | Powerful way to tweak Fusion, GPU offload threshold, and MMQ-ID threshold. [PR 910](https://github.com/ikawrakow/ik_llama.cpp/pull/910) |
+| `-cuda, --cuda-params` | Comma-separated list of cuda parameters | - | Powerful way to tweak Fusion, GPU offload threshold, and MMQ-ID threshold. [PR 910](https://github.com/ikawrakow/ik_llama.cpp/pull/910) [PR 1813](https://github.com/ikawrakow/ik_llama.cpp/pull/1813) |

 ## Model Options

@ -378,9 +377,7 @@ WIP
 | `--override-kv KEY=TYPE:VALUE` | Override model metadata by key | - | Advanced option to override model metadata by key. May be specified multiple times. types: int, float, bool, str. Example: `--override-kv tokenizer.ggml.add_bos_token=bool:false` |
 | `-m, --model FNAME` | Model path | models/$filename | Mandatory, the GGUF model file to be served. |
 | `-md, --model-draft FNAME` | Draft model for speculative decoding | unused | Required when an explicit `draft` stage is used. |
-| `--draft-max, --draft, --draft-n N` | Global speculative draft cap, or fallback value for stages without an explicit `n_max` override | 16 | Also used by single-stage MTP and draft-model speculation. |
-| `--draft-min, --draft-n-min N` | Global minimum speculative draft threshold, or fallback value for stages without an explicit `n_min` override | 0 |  |
-| `--draft-p-min P` | Global minimum speculative decoding probability (greedy), or fallback value for stages without an explicit `p_min` override | 0.8 |  |
+| `--spec-type SPEC[:k=v,...]` | Canonical speculative stage entry; repeat for the supported two-stage chain | none | Use stage-local keys like `n_max`, `n_min`, `p_min`, `ngram_size_n`, `ngram_size_m`, `ngram_min_hits`, `suffix_min_match_len`, `suffix_max_depth`, and `suffix_corpus`. |

 ### Request-Level Speculative Overrides

@ -453,6 +450,7 @@ llama-imatrix -m /models/model-bf16.gguf -f /models/calibration_data_v5_rc.txt -
 | - | - | - | - |
 | `--layer-similarity or -lsim` | Collect statistics about activations change caused by a layer using cosine similarity | - | [PR 328](https://github.com/ikawrakow/ik_llama.cpp/pull/328) |
 | `--hide-imatrix` | Store "top_secret" in the imatrix data file name | - | And in calibration dataset fields, and zeros in the batch size and number of chunks used to compute the imatrix. [PR 329](https://github.com/ikawrakow/ik_llama.cpp/pull/329) |
+| `--output-draft FNAME ` | Paired draft output file |  derived from `--output` | [PR 1803](https://github.com/ikawrakow/ik_llama.cpp/pull/1803) |

 Notes:
 - Use `convert_imatrix_gguf_to_dat.py` to convert the "new" GGUF imatrix files to the format supported here. [PR 1405](https://github.com/ikawrakow/ik_llama.cpp/pull/1405)
@ -479,6 +477,7 @@ llama-gguf-split --split --split-max-size 1G --no-tensor-first-split /models/mod
 | `--partial-requant` | quantize only missing split files in the split quantized .gguf destination directory | - | - |
 | `--symmetric-q40` | Use [-7:7] range for Q4_0 quantization (turns off imatrix) | - | This is useful for some models that have been trained to int4 using this specific quantization range (e.g., Kimi-2.6) [PR 1677](https://github.com/ikawrakow/ik_llama.cpp/pull/1677) |
 | `--slow-iq2ks` | Use the original very slow IQ2_KS quantization method | - | Alternative to the compile-time option [PR 1677](https://github.com/ikawrakow/ik_llama.cpp/pull/1677) |
+| `--extra-output-tensor ggml_type` | Requantize and add output tensor of that type. | - | [PR 1810](https://github.com/ikawrakow/ik_llama.cpp/pull/1810) see `--mtp-requantize-output-tensor type` as on-the-fly alternative. |

 ### Build Arguments

@ -531,7 +530,7 @@ WIP

 ## Graph parallel models

-Models architectures [supported](https://github.com/ikawrakow/ik_llama.cpp/blob/90de8e31db79fb3503da5e20db0d3e46726a2117/src/llama.cpp#L1986) by `--split-mode graph`
+Models architectures [supported](https://github.com/ikawrakow/ik_llama.cpp/blob/022bd00aab9ec8428c4811275de89796c677d278/src/llama.cpp#L3056) by `--split-mode graph`

 ```
 LLM_ARCH_LLAMA,
@ -552,4 +551,9 @@ LLM_ARCH_STEP35,
 LLM_ARCH_QWEN35,
 LLM_ARCH_QWEN35MOE,
 LLM_ARCH_GEMMA4,
+LLM_ARCH_DEEPSEEK2,
+LLM_ARCH_GLM_DSA,
+LLM_ARCH_MISTRAL4,
+LLM_ARCH_MELLUM,
+LLM_ARCH_LAGUNA,
 ```
--- a/docs/speculative.md
+++ b/docs/speculative.md
@ -33,18 +33,18 @@ An example to use this approach can be the rewriting of source code by a LLM.
 This implementation looks for the last n-gram in history that matches the current n-gram and creates a draft using the m tokens following the matched n-gram. It is the simplest self-speculative approach with minimal overhead.

 ```
-llama-server [...] --spec-type ngram-simple --draft-max 64
+llama-server [...] --spec-type ngram-simple:n_max=64
 ```

 #### n-gram Map Key (`ngram-map-k`)

-This implementation looks for the current n-gram of size n (called the _key_) in the token history. If the key n-gram is followed by the same m tokens (called the _mgram_) multiple times, it creates a draft using these m tokens. This approach requires a minimum number of occurrences (argument `--spec-ngram-min-hits`, default is 1) before generating drafts.
+This implementation looks for the current n-gram of size n (called the _key_) in the token history. If the key n-gram is followed by the same m tokens (called the _mgram_) multiple times, it creates a draft using these m tokens. This approach requires a minimum number of occurrences (stage key `ngram_min_hits`, default is 1) before generating drafts.

 The number of accepted tokens is stored for each used n-gram.

 **Example:**
 ```
-llama-server [...] --spec-type ngram-map-k --draft-max 64
+llama-server [...] --spec-type ngram-map-k:n_max=64,ngram_min_hits=1
 ```

 #### n-gram Map Key-4-Values (`ngram-map-k4v`)
@ -55,7 +55,7 @@ The number of accepted tokens is stored for each used n-gram.

 **Example:** Server options to be used if there are a lot of longer repetitions.
 ```
-llama-server [...] --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-max 64
+llama-server [...] --spec-type ngram-map-k4v:n_max=64,ngram_size_n=8,ngram_size_m=8,ngram_min_hits=2
 ```

 ### n-gram Mod (`ngram-mod`)
@ -80,9 +80,9 @@ Currently, a single hash pool is shared across all server slots, so different re
 # notes:
 # - small `n` are not recommended
 # - MoEs require long drafts
-# - dense models: can reduce `--draft-min` and `--draft-max`
+# - dense models: can reduce `n_min` and `n_max`

-llama-server ... --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
+llama-server ... --spec-type ngram-mod:n_max=64,n_min=48,ngram_size_n=24
 ```

 Applications:
@ -103,57 +103,78 @@ Example Video:

 ## Command-Line Options

-If a draft model is combined with a draftless decoding the draftless decoding has higher precedence.
+The canonical startup surface is repeated `--spec-type SPEC[:k=v,...]`. Legacy `--spec-stage`, `--draft-*`, `--spec-ngram-*`, `--suffix-*`, and `-mtp` flags are rejected with replacement guidance.

-```
--draft, --draft-n, --draft-max N       number of tokens to draft for speculative decoding (default: 16)
-                                        (env: LLAMA_ARG_DRAFT_MAX)
--draft-min, --draft-n-min N            minimum number of draft tokens to use for speculative decoding
-                                        (default: 0)
-                                        (env: LLAMA_ARG_DRAFT_MIN)
-[...]
--spec-type [none|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]
-                                        type of speculative decoding to use when no draft model is provided
-                                        (default: none)
--spec-ngram-size-n N                   ngram size N for ngram-simple/ngram-map speculative decoding, length
-                                        of lookup n-gram (default: 12)
--spec-ngram-size-m N                   ngram size M for ngram-simple/ngram-map speculative decoding, length
-                                        of draft m-gram (default: 48)
--spec-ngram-min-hits N                 minimum hits for ngram-map speculative decoding (default: 1)
-```
+### `--spec-type SPEC[:k=v,...]`

-### `--spec-type TYPE`
-
-Specifies a type of speculative decoding without draft model.
+Each `--spec-type` entry defines one speculative stage. Repeat it to configure the supported two-stage path.

 | Type | Description |
 |------|-------------|
-| `none` | No speculative decoding (default) |
+| `none` | No speculative decoding |
+| `draft` | Draft-model speculative decoding; pair with `-md/--model-draft` |
+| `mtp` | Embedded or assistant-backed MTP |
 | `ngram-cache` | Use n-gram cache lookup |
 | `ngram-simple` | Use simple n-gram pattern matching |
-| `ngram-map-k` | Use n-gram pattern matching with n-gram-keys |
-| `ngram-map-k4v` | Use n-gram pattern matching with n-gram-keys and up to four m-gram values (experimental) |
-| `ngram-mod` | Use basic ngram hasher for speculative decoding with shared pool |
+| `ngram-map-k` | Use n-gram pattern matching with n-gram keys |
+| `ngram-map-k4v` | Use n-gram pattern matching with n-gram keys and up to four m-gram values |
+| `ngram-mod` | Use the shared n-gram hasher |
+| `suffix` | Use suffix-tree speculative decoding |
+
+Canonical stage keys:
+
+| Key | Meaning |
+|-----|---------|
+| `n_max` | Maximum drafted tokens for that stage |
+| `n_min` | Minimum usable drafted tokens for that stage |
+| `p_min` | Minimum speculative probability threshold |
+| `ngram_size_n` | Lookup n-gram size |
+| `ngram_size_m` | Draft m-gram size |
+| `ngram_min_hits` | Minimum matching hits for n-gram map stages |
+| `suffix_min_match_len` | Minimum suffix context match length |
+| `suffix_max_depth` | Maximum suffix-tree depth |
+| `suffix_corpus` | Optional suffix corpus file for pre-warming |
+
+String-valued stage keys such as `suffix_corpus` need shell-safe quoting when the value contains commas. From a normal shell, quote the value inside the stage payload so the parser sees the comma as part of the string value.
+
+Example shell-safe form:

-**Example:** Server-instance used to refactor source code.
 ```bash
-./llama-server [...] --spec-type ngram-simple
+./llama-server [...] \
+    --spec-type "suffix:n_max=16,n_min=2,suffix_min_match_len=5,suffix_max_depth=64,suffix_corpus='/tmp/spec,type-corpus.json'"
 ```

-### `--spec-ngram-size-n N`
+If you are constructing `argv` directly without shell unescaping, the parser also accepts escaped commas as `\,`.

-Sets the size N of the lookup n-gram for n-gram map based speculative decoding.
-The n-gram size N determines how many tokens in a row to look back when searching for matching patterns.
+Examples:

-### `--spec-ngram-size-m M`
+```bash
+# Single-stage MTP
+./llama-server [...] --spec-type mtp:n_max=1,p_min=0.0

-Sets the size M of the draft m-gram for n-gram map based speculative decoding.
-The m-gram size determines how many tokens to draft when a match is found.
-Larger values can provide more speedup but may reduce acceptance rate.
+# Single-stage ngram-mod
+./llama-server [...] --spec-type ngram-mod:n_max=64,n_min=48,ngram_size_n=24

-### `--spec-ngram-min-hits H`
+# Draft-model speculation
+./llama-server [...] --model-draft draft.gguf --spec-type draft:n_max=4,p_min=0.0

-This option defines how often a key has to appear in the token history to be used as a draft (default is 1).
+# Two-stage self-spec -> MTP fallback
+./llama-server [...] \
+    --spec-type ngram-mod:n_max=64,n_min=2,ngram_size_n=8 \
+    --spec-type mtp:n_max=1,p_min=0.0
+
+# Suffix stage with pre-warmed corpus
+./llama-server [...] \
+    --spec-type suffix:n_max=16,n_min=2,suffix_min_match_len=5,suffix_max_depth=64,suffix_corpus=/path/to/corpus.json
+
+# Suffix stage with a comma-bearing corpus path from a normal shell
+./llama-server [...] \
+    --spec-type "suffix:n_max=16,n_min=2,suffix_min_match_len=5,suffix_max_depth=64,suffix_corpus='/tmp/spec,type-corpus.json'"
+```
+
+### `--spec-autotune`
+
+Autotunes the active stage parameters and reports the best configuration back as a canonical `--spec-type ...` snippet.

 ## Statistics
 Each speculative decoding implementation prints statistics.
@ -180,4 +201,3 @@ statistics ngram_map_k: #calls(b,g,a) = 6 1690 26, #gen drafts = 26, #acc drafts
 - `#gen tokens`: number of tokens generated by this implementation (including rejected tokens)
 - `#acc tokens`: number of tokens accepted by the main model
 - `dur(b,g,a): durations of begin (new prompt), generation and accumulation (process acceptance).
-
--- a/examples/imatrix/imatrix.cpp
+++ b/examples/imatrix/imatrix.cpp
@ -1232,7 +1232,7 @@ int main(int argc, char ** argv) {
    }

    if (!use_paired_gemma4_mtp && llama_model_is_gemma4_mtp_assistant(model) && !params.process_output) {
-        fprintf(stderr, "%s: warning: standalone Gemma 4 assistant imatrix does not exercise the assistant layers. Use '-m <target> -md <assistant> -mtp' for meaningful calibration.\n", __func__);
+        fprintf(stderr, "%s: warning: standalone Gemma 4 assistant imatrix does not exercise the assistant layers. Use '-m <target> -md <assistant> --spec-type mtp:n_max=1,p_min=0.0' for meaningful calibration.\n", __func__);
    }

    const int n_ctx_train = llama_n_ctx_train(model);
--- a/examples/mtmd/mtmd-helper.cpp
+++ b/examples/mtmd/mtmd-helper.cpp
@ -183,7 +183,7 @@ static int32_t mtmd_helper_decode_image_chunk_impl(
    }

    const llama_model * model = llama_get_model(lctx);
-    int n_mmproj_embd = llama_model_n_embd_inp(model);
+    int n_mmproj_embd = llama_model_n_embd(model);
    int n_pos_per_embd = mtmd_decode_use_mrope(ctx) ? 4 : 1;

    int32_t n_tokens = mtmd_input_chunk_get_n_tokens(chunk);
--- a/examples/mtmd/mtmd.cpp
+++ b/examples/mtmd/mtmd.cpp
@ -307,6 +307,11 @@ struct mtmd_context {
            img_end = "<image|>";
            //image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
        }
+        else if (proj == PROJECTOR_TYPE_KIMIK25) {
+            // template renders: <|media_begin|>image<|media_content|> <pad/embeddings> <|media_end|>
+            img_beg = "<|media_begin|>image<|media_content|>";
+            img_end = "<|media_end|>";
+        }
    }

    void init_audio() {
--- a/examples/server/README.md
+++ b/examples/server/README.md
@ -210,10 +210,10 @@ model:
  -m,    --model FNAME            model path (default: models/$filename with filename from --hf-file
                                  or --model-url if set, otherwise models/7B/ggml-model-f16.gguf)
  -md,   --model-draft FNAME      draft model for speculative decoding (default: unused)
-         --spec-stage SPEC[:k=v,...]
-                                  explicit speculative stage. repeat once for a supported two-stage chain
-                                  examples: --spec-stage ngram-mod:n_max=64,n_min=2 --spec-stage mtp:n_max=1
-                                  supported two-stage shape: self-spec first, then mtp or draft fallback
+      --spec-type SPEC[:k=v,...]
+                canonical speculative stage entry; repeat for a supported two-stage chain
+                examples: --spec-type mtp:n_max=1,p_min=0.0
+                --spec-type ngram-mod:n_max=64,n_min=2,ngram_size_n=8 --spec-type mtp:n_max=1,p_min=0.0
  -mu,   --model-url MODEL_URL    model download url (default: unused)
  -hfr,  --hf-repo REPO           Hugging Face model repository (default: unused)
  -hff,  --hf-file FILE           Hugging Face model file (default: unused)
@ -966,15 +966,15 @@ To know the `id` of the adapter, use GET `/lora-adapters`

 ### Composite speculative decoding

-Use `--spec-stage` for explicit stage chains. The currently supported two-stage shape is self-spec first, then `mtp` or `draft` fallback.
+Use repeated `--spec-type SPEC[:k=v,...]` entries for explicit stage chains. The currently supported two-stage shape is self-spec first, then `mtp` or `draft` fallback.

 Example with `ngram-mod` plus MTP fallback:

 ```bash
 ./build/bin/llama-server \
  --model /models/target-mtp.gguf \
-  --spec-stage ngram-mod:n_max=64,n_min=2,ngram_size_n=8 \
-  --spec-stage mtp:n_max=1,p_min=0.0
+  --spec-type ngram-mod:n_max=64,n_min=2,ngram_size_n=8 \
+  --spec-type mtp:n_max=1,p_min=0.0
 ```

 Example with `ngram-mod` plus draft-model fallback:
@ -983,14 +983,13 @@ Example with `ngram-mod` plus draft-model fallback:
 ./build/bin/llama-server \
  --model /models/target.gguf \
  --model-draft /models/draft.gguf \
-  --spec-stage ngram-mod:n_max=64,n_min=2,ngram_size_n=8 \
-  --spec-stage draft:n_max=4,p_min=0.0
+  --spec-type ngram-mod:n_max=64,n_min=2,ngram_size_n=8 \
+  --spec-type draft:n_max=4,p_min=0.0
 ```

 Notes:

- Use `--spec-type` when you want a single self-spec stage only.
- `--spec-type` cannot be combined with `--spec-stage`.
+- Use `--spec-type` for both single-stage and two-stage startup configuration.
 - Explicit stage chains currently support at most two stages.

 ### Change system prompt on runtime
--- a/examples/server/public/index.html.gz
+++ b/examples/server/public/index.html.gz
--- a/examples/server/public_llamacpp/index_llamacpp.html
+++ b/examples/server/public_llamacpp/index_llamacpp.html
--- a/examples/server/public_llamacpp/index_llamacpp.html.gz
+++ b/examples/server/public_llamacpp/index_llamacpp.html.gz
--- a/examples/server/server-common.cpp
+++ b/examples/server/server-common.cpp
@ -1250,7 +1250,7 @@ const mtmd::input_chunk_ptr& server_tokens::find_chunk(size_t idx) const {
    if (it != map_idx_to_media.end()) {
        return it->second;
    }
-    throw std::runtime_error("Chunk not found");
+    throw std::runtime_error("Chunk not found, or idx is not the first token of a chunk");
 }

 void server_tokens::push_back(llama_token tok) {
@ -1295,7 +1295,7 @@ void server_tokens::push_back(server_tokens& tokens) {
        // Assert if we are copying MTMD chunks to a server_tokens that does not have mtmd.
        // We could also just check, but this will prevent silently dropping MTMD data.
        GGML_ASSERT(has_mtmd);
-        for (auto it = tokens.map_idx_to_media.begin(); it != tokens.map_idx_to_media.end(); ) {
+        for (auto it = tokens.map_idx_to_media.begin(); it != tokens.map_idx_to_media.end(); it++) {
            auto* chunk = tokens.map_idx_to_media[it->first].get();
            mtmd::input_chunk_ptr new_chunk(mtmd_input_chunk_copy(chunk));
            map_idx_to_media[start_idx + it->first] = std::move(new_chunk);
@ -1369,18 +1369,10 @@ void server_tokens::keep_first(size_t n) {
        if (n == tokens.size()) {
            return; // nothing to do
        }
-        // we throw an error if we try to remove a token in the middle of an image
-        // for ex. with input of 5 text tokens and 2 images:
-        //    [0] [1] [2] [3] [4] [img0] [img0] [img0] [img1] [img1]
-        // n  1   2   3   4   5   6      7      8      9      10
-        // allowed to resize      ^                    ^
-        // disallowed to resize          ^      ^             ^
-        if (n > 0) {
-            llama_token last_token = tokens[n - 1];
-            // make sure we never remove tokens in the middle of an image
-            if (last_token == LLAMA_TOKEN_NULL) {
-                find_chunk(n - 1); // will throw an error if the token is not begin-of-chunk
-            }
+        // It is an internal error if the longest common prefix ends in the middle of an image
+        llama_token first_removed_token = tokens[n];
+        if (first_removed_token == LLAMA_TOKEN_NULL) {
+            find_chunk(n); // will throw an error if the token is not begin-of-chunk
        }
        // remove all image chunks that are not used anymore
        for (auto it = map_idx_to_media.begin(); it != map_idx_to_media.end(); ) {
--- a/examples/server/server-context.cpp
+++ b/examples/server/server-context.cpp
@ -136,12 +136,6 @@ static bool server_slot_prompt_batch_overlaps(

    return slot.prompt_batch_i0 < batch_i1 && batch_i0 < slot.prompt_batch_i1;
 }
-
-static bool params_use_gemma4_external_mtp(const gpt_params & params_base) {
-    return params_base.has_mtp &&
-        llama_model_is_gemma4_mtp_assistant(params_base.speculative.model_dft);
-}
-
 struct server_mtp_warmup {
    llama_context * ctx_tgt;
    server_slot   * slot;
@ -164,74 +158,12 @@ static bool server_response_needs_chat_parse(oaicompat_type oaicompat) {
        oaicompat == OAICOMPAT_TYPE_RESP;
 }

-void server_speculative_checkpoint::clear() {
-    valid = false;
-    per_step_enabled = false;
-    n_past = 0;
-    sampled = LLAMA_TOKEN_NULL;
-
-    if (sampler != nullptr) {
-        common_sampler_free(sampler);
-        sampler = nullptr;
-    }
-}
-
-static void discard_speculative_checkpoint(server_slot & slot, llama_context * ctx) {
-    slot.spec_ckpt.clear();
-    llama_spec_ckpt_discard(ctx);
-}
-
-static bool save_speculative_checkpoint(server_slot & slot, llama_model * model, llama_context * ctx, int ckpt_mode) {
-    slot.spec_ckpt.clear();
-    const int32_t n_pre_spec_tokens = slot.cache_tokens.n_tokens() - (int32_t)(slot.drafted.size() + 1);
-    slot.spec_ckpt.n_past = slot.cache_tokens.pos_next(n_pre_spec_tokens);
-    slot.spec_ckpt.sampled = slot.sampled;
-
-    const int max_tokens = (int)slot.drafted.size() + 1;
-    const int actual_mode = llama_spec_ckpt_init(ctx, ckpt_mode, max_tokens);
-    if (actual_mode == LLAMA_SPEC_CKPT_NONE) {
-        return false;
-    }
-    slot.spec_ckpt.per_step_enabled = (actual_mode == LLAMA_SPEC_CKPT_PER_STEP);
-
-    slot.spec_ckpt.valid = llama_spec_ckpt_save(ctx, slot.id);
-    if (!slot.spec_ckpt.valid) {
-        llama_spec_ckpt_discard(ctx);
-        return false;
-    }
-
-    slot.spec_ckpt.sampler = common_sampler_init(model, slot.sparams);
-    if (slot.spec_ckpt.sampler == nullptr) {
-        discard_speculative_checkpoint(slot, ctx);
-        return false;
-    }
-
-    common_sampler_clone(slot.ctx_sampling, slot.spec_ckpt.sampler);
-    return true;
-}
-
-static void server_remove_speculative_stage(common_params_speculative & spec, common_speculative_type type) {
-    spec.stages.erase(std::remove_if(spec.stages.begin(), spec.stages.end(), [type](const common_speculative_stage_params & stage) {
-        return stage.type == type;
-    }), spec.stages.end());
-
-    if (spec.type == type) {
-        spec.type = COMMON_SPECULATIVE_TYPE_NONE;
-        const auto resolved = spec.get_resolved_stages();
-        spec.type = resolved.empty() ? COMMON_SPECULATIVE_TYPE_NONE : resolved.front().type;
-    }
-}
-
-static bool server_speculative_has_mtp(const common_params_speculative & spec) {
-    return spec.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP);
-}
-
 static bool server_speculative_has_dflash(const common_params_speculative & spec) {
    return spec.has_stage_type(COMMON_SPECULATIVE_TYPE_DFLASH);
 }

 static bool server_speculative_has_target_features(const common_params_speculative & spec) {
-    return server_speculative_has_mtp(spec) || server_speculative_has_dflash(spec);
+    return spec.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP) || server_speculative_has_dflash(spec);
 }

 static bool server_speculative_same_stage_types(
@ -262,7 +194,8 @@ static void server_reject_dead_speculative_request_overrides(const json & data)
        json_value_ptr(data, "speculative.ngram_size_m") != nullptr ||
        json_value_ptr(data, "speculative.ngram_min_hits") != nullptr ||
        json_value_ptr(data, "speculative.suffix_min_match_len") != nullptr ||
-        json_value_ptr(data, "speculative.suffix_max_depth") != nullptr) {
+        json_value_ptr(data, "speculative.suffix_max_depth") != nullptr ||
+        json_value_ptr(data, "speculative.suffix_corpus") != nullptr) {
        throw std::runtime_error("Error: structural speculative overrides are startup-only; per-request overrides only support speculative.n_max, speculative.n_min, speculative.p_min, and speculative.stages");
    }
 }
@ -322,11 +255,8 @@ server_context::~server_context() {
        if (slot.ctx_sampling != nullptr) {
            common_sampler_free(slot.ctx_sampling);
        }
-        slot.spec_ckpt.clear();
        common_speculative_free(slot.spec);
        slot.spec = nullptr;
-        slot.ctx_dft = nullptr;
-        llama_batch_free(slot.batch_spec);
    }

    if (ctx) {
@ -340,16 +270,7 @@ server_context::~server_context() {
    }
    // Free multimodal
    mtmd_free(mctx);
-    // Free draft model and context if they exist
-    if (ctx_draft) {
-        llama_free(ctx_draft);
-        ctx_draft = nullptr;
-    }
-    if (model_draft) {
-        llama_free_model(model_draft);
-        model_draft = nullptr;
-    }
-
+    params_base.speculative.clear_dft();
    llama_batch_free(batch);
 }

@ -372,18 +293,7 @@ bool server_context::load_model(const gpt_params& params_) {
    add_bos_token = llama_should_add_bos_token(model);
    has_eos_token = llama_add_eos_token(model) != 1;

-    if (params_base.has_mtp && params_base.n_parallel > 1) {
-        LOG_WARNING("MTP is not supported with parallel slots yet, disabling MTP to avoid cross-slot corruption.\n", {
-            {"n_parallel", params_base.n_parallel},
-        });
-        params_base.has_mtp = false;
-        if (params_base.speculative.type == COMMON_SPECULATIVE_TYPE_MTP) {
-            params_base.speculative.type = COMMON_SPECULATIVE_TYPE_NONE;
-        }
-        params_base.speculative.model.clear();
-        params_base.speculative.params.clear();
-        params_base.speculative.model_dft = nullptr;
-    }
+    common_speculative_prepare_startup(params_base, false);

    if (server_speculative_has_dflash(params_base.speculative) && params_base.n_parallel > 1) {
        LOG_ERROR("DFlash is currently limited to a single server slot (-np 1).\n", {
@ -391,9 +301,8 @@ bool server_context::load_model(const gpt_params& params_) {
        });
        return false;
    }
-
-    bool has_draft_model = !params_base.speculative.model.empty() || !params_base.speculative.params.empty();
-    std::string& mmproj_path = params_base.mmproj.path;
+    const bool has_draft_model = params_base.speculative.has_dft();
+    std::string & mmproj_path = params_base.mmproj.path;
    if (!mmproj_path.empty()) {
        mtmd_context_params mparams = mtmd_context_params_default();
        mparams.use_gpu = params_base.mmproj_use_gpu;
@ -407,10 +316,10 @@ bool server_context::load_model(const gpt_params& params_) {
        mparams.image_max_tokens = params_base.image_max_tokens;
        mctx = mtmd_init_from_file(mmproj_path.c_str(), model, mparams);
        if (mctx == nullptr) {
-            LOG_ERROR("failed to load multimodal model, '%s'\n", mmproj_path.c_str());
+            LOG_ERROR("failed to load multimodal model, %s\n", mmproj_path.c_str());
            return false;
        }
-        LOG_INFO("loaded multimodal model, '%s'\n", mmproj_path.c_str());
+        LOG_INFO("loaded multimodal model, %s\n", mmproj_path.c_str());

        //if (params.n_cache_reuse) {
        //    params_base.n_cache_reuse = 0;
@ -421,86 +330,22 @@ bool server_context::load_model(const gpt_params& params_) {
            LOG_ERROR("%s\n", "err: speculative decode is not supported by multimodal");
            return false;
        }
-    const auto spec_stages = params_base.speculative.get_resolved_stages();
-    const bool multimodal_spec_supported = spec_stages.empty() ||
-        (spec_stages.size() == 1 && spec_stages.front().type == COMMON_SPECULATIVE_TYPE_MTP);
-    if (!multimodal_spec_supported) {
+
+        const auto spec_stages = params_base.speculative.get_resolved_stages();
+        const bool multimodal_spec_supported = spec_stages.empty() ||
+            (spec_stages.size() == 1 && spec_stages.front().type == COMMON_SPECULATIVE_TYPE_MTP);
+        if (!multimodal_spec_supported) {
            params_base.speculative.type = COMMON_SPECULATIVE_TYPE_NONE;
            params_base.speculative.stages.clear();
            params_base.has_mtp = false;
            SRV_WRN("%s\n", "speculative decoding is not supported by multimodal, it will be disabled");
        }
    }
-    // Load draft model for speculative decoding if specified
-    if (has_draft_model) {
-        LLAMA_LOG_INFO("\n\n==================================loading DRAFT model==================================\n\n");
-
-        gpt_params params_dft;
-        params_dft.devices      = params_base.speculative.devices;
-        params_dft.model        = params_base.speculative.model;
-        params_dft.main_gpu     = params_base.main_gpu;
-        params_dft.n_gpu_layers = params_base.speculative.n_gpu_layers;
-        params_dft.rpc_servers  = params_base.rpc_servers;
-        params_dft.cache_type_k = params_base.speculative.cache_type_k.empty() ? params_base.cache_type_k : params_base.speculative.cache_type_k;
-        params_dft.cache_type_v = params_base.speculative.cache_type_v.empty() ? params_base.cache_type_v : params_base.speculative.cache_type_v;
-        params_dft.flash_attn   = params_base.flash_attn;
-        params_dft.k_cache_hadamard = params_base.k_cache_hadamard;
-        params_dft.v_cache_hadamard = params_base.v_cache_hadamard;
-        if (server_speculative_has_dflash(params_base.speculative)) {
-            params_dft.split_mode = params_base.split_mode;
-            for (size_t i = 0; i < std::size(params_dft.tensor_split); ++i) {
-                params_dft.tensor_split[i] = params_base.tensor_split[i];
-            }
-            params_dft.attn_max_batch = params_base.attn_max_batch;
-            params_dft.graph_reuse = params_base.graph_reuse;
-            params_dft.split_mode_graph_scheduling = params_base.split_mode_graph_scheduling;
-            params_dft.scheduler_async = params_base.scheduler_async;
-            params_dft.max_extra_alloc_MiB = params_base.max_extra_alloc_MiB;
-            params_dft.reduce_type = params_base.reduce_type;
-        }
-        if (!params_base.speculative.params.empty()) {
-            auto [argc, argv] = parse_command_line("llama-server " + params_base.speculative.params);
-            if (!gpt_params_parse(argc, argv, params_dft)) {
-                gpt_params_print_usage(argc, argv, params_dft);
-                free_command_line(argc, argv);
-                return false;
-            };
-            free_command_line(argc, argv);
-        }
-        LOG_INFO("", { {"model", params_dft.model} });
-        if (params_dft.n_ctx == 0) {
-            params_dft.n_ctx = params_base.speculative.n_ctx;
-        }
-        if (server_speculative_has_dflash(params_base.speculative) && params_dft.n_gpu_layers < 0) {
-            params_dft.n_gpu_layers = params_base.n_gpu_layers;
-        }
-        params_dft.n_ctx = params_dft.n_ctx == 0 ? params_base.n_ctx / params_base.n_parallel : params_dft.n_ctx;
-        params_dft.n_parallel = 1;
-        params_dft.n_batch = params_dft.n_ctx;
-
-        params_base.speculative.mparams_dft.path = params_dft.model; //
-
-        llama_model_params mparams_dft = common_model_params_to_llama(params_dft);
-
-        llama_model * model_dft = llama_model_load_from_file(params_dft.model.c_str(), mparams_dft);
-        if (model_dft == nullptr) {
-            LOG_ERROR("failed to load draft model", { {"model", params_base.speculative.model} });
-            return false;
-        }
-
-        cparams_dft = common_context_params_to_llama(params_dft);
-
-        params_base.speculative.model_dft = model_dft;
-        params_base.speculative.cparams_dft = cparams_dft;

+    if (!common_speculative_finalize_startup(params_base, model)) {
+        return false;
    }
-    if (server_speculative_has_mtp(params_base.speculative) &&
-        llama_model_n_nextn_layer(model) == 0 &&
-        !params_use_gemma4_external_mtp(params_base)) {
-        LOG_WARNING("WARNING: MTP speculative stage requested, but model has 0 NextN layers. MTP will be disabled.\n", {});
-        params_base.has_mtp = false;
-        server_remove_speculative_stage(params_base.speculative, COMMON_SPECULATIVE_TYPE_MTP);
-    }
+
    return true;
 }

@ -509,6 +354,20 @@ void server_context::init() {

    LOG_INFO("initializing slots", { {"n_slots", params_base.n_parallel} });

+    if (params_base.has_mtp) {
+        SRV_INF("%s\n", "MTP needs embeddings on decode, enabling");
+        llama_set_embeddings(ctx, true);
+    }
+
+    const bool requested_spec = params_base.speculative.has_stage_chain();
+    bool can_spec = true;
+    if (!params_base.dry_run) {
+        can_spec = common_speculative_is_compat(ctx);
+    }
+    if (!can_spec && requested_spec) {
+        SRV_WRN("%s", "speculative decoding not supported by this context\n");
+    }
+
    for (int i = 0; i < params_base.n_parallel; i++) {
        server_slot slot;

@ -552,69 +411,27 @@ void server_context::init() {

        slot.params.speculative = params_base.speculative;
        slot.sparams = params_base.sparams;
-
-        const bool wants_mtp_stage = server_speculative_has_mtp(params_base.speculative);
-        if (wants_mtp_stage) {
-            const bool has_external_mtp = params_use_gemma4_external_mtp(params_base);
-
-            if (llama_model_n_nextn_layer(model) > 0 || has_external_mtp) {
-                params_base.pooling_type = LLAMA_POOLING_TYPE_NONE;
-
-                if (!has_external_mtp) {
-                    params_base.speculative.cparams_dft = common_context_params_to_llama(params_base);
-                }
-
-                params_base.speculative.cparams_dft.mtp          = true;
-                params_base.speculative.cparams_dft.mtp_op_type  = MTP_OP_WARMUP;
-                params_base.speculative.cparams_dft.embeddings   = true;
-
-                slot.has_mtp = true;
-                slot.params.speculative.cparams_dft = params_base.speculative.cparams_dft;
-
-                slot.batch_spec = llama_batch_init(slot.params.speculative.get_max_stage_n_max() + 1, 0, 1);
-                SLT_DBG(slot, "batch_spec contains %d tokens\n", slot.batch_spec.n_tokens);
-
-                SRV_INF("%s\n", "MTP needs embeddings on decode, enabling");
-                llama_set_embeddings(ctx, true);
-            }
-            else {
-                SRV_WRN("%s\n", "MTP speculative stage requested, but model has 0 NextN layers. Removing MTP from the configured stage chain.");
-                params_base.has_mtp = false;
-                server_remove_speculative_stage(params_base.speculative, COMMON_SPECULATIVE_TYPE_MTP);
-                slot.params.speculative = params_base.speculative;
-                slot.has_mtp = false;
-            }
-        }
-
-        const bool requested_spec = !params_base.speculative.get_resolved_stages().empty();
-
-        bool can_spec = true;
-        if (!params_base.dry_run) {
-            can_spec = common_speculative_is_compat(ctx);
-        }
-        if (!can_spec) {
-            SRV_WRN("%s", "speculative decoding not supported by this context\n");
-        }
        // try speculative decoding
        if (can_spec && requested_spec) {
-            slot.spec = common_speculative_init(params_base.speculative, slot.ctx);
-            if (slot.spec) {
-                if (mctx && !slot.has_mtp) {
+            switch (common_speculative_try_init(params_base.speculative, slot.ctx, &slot.spec)) {
+            case COMMON_SPECULATIVE_INIT_READY:
+                if (mctx && !slot.uses_mtp()) {
                    SRV_ERR("%s\n", "speculative decoding is not supported with multimodal");
                    return;
                }
                SLT_INF(slot, "%s", "speculative decoding context initialized\n");
-            } else {
-                if (llama_model_has_recurrent(model)) {
-                    SRV_ERR("%s", "failed to initialize recurrent speculative context\n");
-                    throw std::runtime_error("recurrent speculative context initialization failed");
-                } else if (slot.has_mtp) {
-                    SRV_ERR("%s", "failed to initialize MTP speculative context\n");
-                    throw std::runtime_error("MTP speculative context initialization failed");
-                } else {
-                    SRV_ERR("%s", "failed to initialize speculative decoding context\n");
-                    throw std::runtime_error("speculative decoding context initialization failed");
-                }
+                break;
+            case COMMON_SPECULATIVE_INIT_ERR_RECURRENT:
+                SRV_ERR("%s", "failed to initialize recurrent speculative context\n");
+                throw std::runtime_error("recurrent speculative context initialization failed");
+            case COMMON_SPECULATIVE_INIT_ERR_MTP:
+                SRV_ERR("%s", "failed to initialize MTP speculative context\n");
+                throw std::runtime_error("MTP speculative context initialization failed");
+            case COMMON_SPECULATIVE_INIT_ERR_GENERIC:
+                SRV_ERR("%s", "failed to initialize speculative decoding context\n");
+                throw std::runtime_error("speculative decoding context initialization failed");
+            case COMMON_SPECULATIVE_INIT_SKIPPED:
+                break;
            }
        }

@ -735,9 +552,7 @@ void server_slot::reset() {
    prompt_batch_i1 = -1;
    n_sent_text = 0;
    drafted.clear();
-    drafted_spec_type = COMMON_SPECULATIVE_TYPE_NONE;
    i_batch_dft.clear();
-    spec_ckpt.clear();
    spec_prompt_warmup_failed = false;
    n_sent_token_probs = 0;
    infill = false;
@ -756,7 +571,7 @@ void server_slot::reset() {
    image_just_processed = false;
    do_checkpoint = false;
    if (spec != nullptr) {
-        common_speculative_clear_sequence_hidden(spec, id);
+        common_speculative_clear_sequence(spec, id);
    }

    positional_bans.clear();
@ -791,7 +606,11 @@ void server_slot::reset() {
 }

 bool server_slot::need_embd() const {
-    return embedding || has_mtp;
+    return embedding || uses_mtp();
+}
+
+bool server_slot::uses_mtp() const {
+    return params.speculative.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP);
 }

 bool server_slot::has_budget(gpt_params& global_params) {
@ -827,7 +646,7 @@ void server_slot::add_token_string(const completion_token_output& token) {
 }

 bool server_slot::can_speculate() const {
-    return !spec_prompt_warmup_failed && (!!spec || has_mtp);
+    return !spec_prompt_warmup_failed && (!!spec || uses_mtp());
 }

 int server_slot::get_n_draft_max() const {
@ -1367,6 +1186,10 @@ bool server_context::launch_slot_with_task(server_slot& slot, server_task& task)
    // speculative decoding parameters
    try {
        slot.params.speculative = defaults.speculative;
+        const bool has_flat_n_max = json_value_ptr(data, "speculative.n_max") != nullptr;
+        const bool has_flat_n_min = json_value_ptr(data, "speculative.n_min") != nullptr;
+        const bool has_flat_p_min = json_value_ptr(data, "speculative.p_min") != nullptr;
+
        slot.params.speculative.n_max = json_value(data, "speculative.n_max", params_base.speculative.n_max);
        slot.params.speculative.n_min = json_value(data, "speculative.n_min", params_base.speculative.n_min);
        slot.params.speculative.p_min = json_value(data, "speculative.p_min", params_base.speculative.p_min);
@ -1374,6 +1197,20 @@ bool server_context::launch_slot_with_task(server_slot& slot, server_task& task)
        server_reject_dead_speculative_request_overrides(data);

        const json stages = json_value(data, "speculative.stages", json());
+        if (stages.is_null() && !slot.params.speculative.stages.empty()) {
+            for (auto & stage : slot.params.speculative.stages) {
+                if (has_flat_n_max) {
+                    stage.n_max = -1;
+                }
+                if (has_flat_n_min) {
+                    stage.n_min = -1;
+                }
+                if (has_flat_p_min) {
+                    stage.p_min = -1.0f;
+                }
+            }
+        }
+
        if (!stages.is_null()) {
            if (!stages.is_array()) {
                throw std::runtime_error("Error: speculative.stages must be an array");
@ -1412,11 +1249,11 @@ bool server_context::launch_slot_with_task(server_slot& slot, server_task& task)

        if (slot.can_speculate() &&
            llama_model_has_recurrent(model) &&
-            slot.params.speculative.n_max > params_base.speculative.n_max) {
+            slot.params.speculative.get_max_stage_n_max() > params_base.speculative.get_max_stage_n_max()) {
            send_error(task,
-                    "Error: speculative.n_max=" + std::to_string(slot.params.speculative.n_max) +
-                    " exceeds the recurrent speculative startup limit of " + std::to_string(params_base.speculative.n_max) +
-                    "; restart the server with a higher --draft-max to reserve checkpoint capacity",
+                "Error: speculative n_max=" + std::to_string(slot.params.speculative.get_max_stage_n_max()) +
+                " exceeds the recurrent speculative startup limit of " + std::to_string(params_base.speculative.get_max_stage_n_max()) +
+                "; restart the server with a higher n_max inside the configured --spec-type stages to reserve checkpoint capacity",
                    ERROR_TYPE_INVALID_REQUEST);
            return false;
        }
@ -1425,7 +1262,7 @@ bool server_context::launch_slot_with_task(server_slot& slot, server_task& task)
            throw std::runtime_error("Error: per-request speculative stages must match the server startup stage types; only stage parameter overrides are supported");
        }

-        if (slot.params.speculative.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP) && !slot.has_mtp) {
+        if (slot.params.speculative.has_stage_type(COMMON_SPECULATIVE_TYPE_MTP) && !params_base.has_mtp) {
            throw std::runtime_error("Error: MTP speculative stage requested, but the server was not started with MTP support");
        }

@ -2205,10 +2042,7 @@ void server_context::kv_cache_clear() {
            continue;
        }

-        common_speculative_clear_sequence_hidden(slot.spec, slot.id);
-        if (auto * ctx_companion = common_speculative_get_companion_ctx(slot.spec); ctx_companion != nullptr) {
-            llama_kv_cache_clear(ctx_companion);
-        }
+        common_speculative_clear_sequence(slot.spec, slot.id, true);
    }
    clean_kv_cache = false;
 }
@ -2759,7 +2593,7 @@ void server_context::apply_server_biases(server_slot& slot) {
    }
 }

-void server_context::request_completion(int id_task, int id_multi, json data, bool infill, bool embedding, server_tokens&& inputs) {
+void server_context::request_completion(int id_task, int id_multi, json data, bool infill, bool embedding, server_tokens & inputs) {
    server_task task;
    task.id = id_task;
    task.id_multi = id_multi;
@ -2768,7 +2602,7 @@ void server_context::request_completion(int id_task, int id_multi, json data, bo
    task.infill = infill;
    task.embedding = embedding;
    task.type = SERVER_TASK_TYPE_COMPLETION;
-    task.tokens = std::move(inputs);
+    task.tokens = inputs.clone();
    // when a completion task's prompt array is not a singleton, we split it into multiple requests
    // otherwise, it's a single-prompt task, we actually queue it
    // if there's numbers in the prompt array it will be treated as an array of tokens
@ -2807,7 +2641,8 @@ void server_context::request_cancel(int id_task) {
 }

 void server_context::split_multiprompt_task(int id_multi, server_task& multiprompt_task) {
-    const int prompt_count = multiprompt_task.data.at("prompt").size();
+    auto prompts = multiprompt_task.data.at("prompt");
+    const int prompt_count = prompts.size();
    if (prompt_count <= 1) {
        send_error(multiprompt_task, "error while handling multiple prompts");
        return;
@ -2825,11 +2660,11 @@ void server_context::split_multiprompt_task(int id_multi, server_task& multiprom
    // add subtasks
    for (int i = 0; i < prompt_count; i++) {
        json subtask_data = multiprompt_task.data;
-        subtask_data["prompt"] = subtask_data.at("prompt")[i];
+        subtask_data["prompt"] = prompts[i];

        // subtasks inherit everything else (infill mode, embedding mode, etc.)
        request_completion(subtask_ids[i], id_multi, subtask_data, multiprompt_task.infill, multiprompt_task.embedding,
-            std::move(multiprompt_task.tokens));
+            multiprompt_task.tokens);
    }
 }

@ -3666,33 +3501,27 @@ void server_context::add_sampled_tokens() {
        //       perform the speculative drafting for all sequences at the same time in a single batch
        const int n_draft_max_pre = slot.get_n_draft_max();
        if (n_draft_max_pre > 0) {
-            if (mctx && !slot.has_mtp) {
+            if (mctx && !slot.uses_mtp()) {
                // we should never reach this, as speculative is automatically disabled if mmproj is loaded
                GGML_ABORT("not supported by multimodal");
            }

            static const llama_tokens empty_prompt;
-            const llama_tokens & cached_text_tokens = slot.has_mtp && !slot.params.speculative.has_composite_stage_chain()
+            const llama_tokens & cached_text_tokens = slot.uses_mtp() && !slot.params.speculative.has_composite_stage_chain()
                ? empty_prompt
                : slot.cache_tokens.get_text_tokens();

            auto & params_spec = slot.params.speculative;
-            const llama_pos draft_base_pos = slot.has_mtp ? slot.cache_tokens.pos_next() : -1;
-
-            if (slot.has_mtp) {
-                if (!common_speculative_ensure_sequence_hidden(slot.spec, ctx, slot.id, draft_base_pos - 1)) {
-                    LOG_ERROR("MTP hidden state is empty during speculation", {});
-                }
-            }
-
-            llama_tokens draft = common_speculative_draft(
+            const llama_pos draft_base_pos = slot.uses_mtp() ? slot.cache_tokens.pos_next() : -1;
+            common_speculative_draft_result draft_result = common_speculative_draft_ex(
                slot.spec,
+                ctx,
                params_spec,
                cached_text_tokens,
                slot.sampled,
                draft_base_pos,
                slot.id);
-            slot.drafted_spec_type = common_speculative_current_type(slot.spec);
+            llama_tokens & draft = draft_result.tokens;

            const int n_draft_max = slot.get_n_draft_max();

@ -3717,7 +3546,6 @@ void server_context::add_sampled_tokens() {
                // fallback to normal decoding
                slot.i_batch = slot.i_batch_dft[0];
                slot.drafted.clear();
-                slot.drafted_spec_type = COMMON_SPECULATIVE_TYPE_NONE;
                slot.i_batch_dft.clear();
            } else {
                // keep track of total number of drafted tokens tested
@ -3734,7 +3562,6 @@ void server_context::add_sampled_tokens() {
        }
        else {
            // no speculative decoding
-            slot.drafted_spec_type = COMMON_SPECULATIVE_TYPE_NONE;
            slot.i_batch = batch.n_tokens;

            common_batch_add(batch, slot.sampled, slot.cache_tokens.pos_next(), { slot.id }, true);
@ -4077,15 +3904,10 @@ void server_context::batch_pending_prompt(const int32_t n_ubatch, const int32_t
                slot.cache_tokens.keep_first(slot.n_past);
                int p0 = (int)system_tokens.size() + slot.n_past;
                p0 = system_tokens.size() + slot.cache_tokens.pos_next();
-                auto * ctx_companion = slot.spec ? common_speculative_get_companion_ctx(slot.spec) : nullptr;
-                const bool target_trimmed = llama_kv_cache_seq_rm(ctx, slot.id, p0, -1);
-                const bool companion_trimmed = ctx_companion == nullptr || llama_kv_cache_seq_rm(ctx_companion, slot.id, p0, -1);
-                if (!target_trimmed || !companion_trimmed) {
+                const bool trimmed = common_speculative_trim_sequence(slot.spec, ctx, slot.id, p0);
+                if (!trimmed) {
                    // could not partially delete (likely using a non-Transformer model)
-                    llama_kv_cache_seq_rm(ctx, slot.id, -1, -1);
-                    if (ctx_companion != nullptr) {
-                        llama_kv_cache_seq_rm(ctx_companion, slot.id, -1, -1);
-                    }
+                    common_speculative_clear_sequence_kv(slot.spec, ctx, slot.id);

                    p0 = (int)system_tokens.size();
                    if (p0 != 0) {
@ -4122,7 +3944,7 @@ void server_context::batch_pending_prompt(const int32_t n_ubatch, const int32_t
                    llama_pos p1 = slot.cache_tokens.pos_next() + slot.n_past_prompt - slot.n_past; // add offset to prompt
                    server_mtp_warmup mtp_media_warmup {
                        ctx,
-                        slot.has_mtp && slot.spec ? &slot : nullptr,
+                        slot.uses_mtp() && slot.spec ? &slot : nullptr,
                    };
                    mtmd_helper_eval_batch_callback mtp_media_callback =
                        mtp_media_warmup.slot ? server_mtp_media_warmup_callback : nullptr;
@ -4268,113 +4090,6 @@ void server_context::extend_context(const int32_t n_tokens) {
    }
 }

-// Restore recurrent state and re-decode accepted tokens after speculative-decode rejection.
-static void restore_speculative_checkpoint(
-        server_slot & slot, llama_context * ctx, llama_model * model,
-        common_speculative_type spec_type_used,
-    llama_token sampled_before,
-    const std::vector<llama_token> & ids, int n_draft,
-    const std::vector<float> & spec_feature_rows_pre, int32_t spec_n_past_base) {
-    if (slot.spec_ckpt.per_step_enabled) {
-        const int step = (int)ids.size() - 1;
-        llama_spec_ckpt_restore(ctx, slot.id, slot.spec_ckpt.n_past, step);
-
-        if (slot.spec_ckpt.sampler) {
-            common_sampler_clone(slot.spec_ckpt.sampler, slot.ctx_sampling);
-        }
-        for (llama_token id : ids) {
-            common_sampler_accept(slot.ctx_sampling, ctx, id, true);
-        }
-
-        // Update speculative target features using rows collected before checkpoint restore.
-        if (server_speculative_has_target_features(slot.params.speculative) && !spec_feature_rows_pre.empty()) {
-            if (!common_speculative_commit_accepted_hidden_rows(
-                    slot.spec,
-                    spec_type_used,
-                    slot.id,
-                    spec_n_past_base,
-                    sampled_before,
-                    ids,
-                    spec_feature_rows_pre)) {
-                common_speculative_clear_sequence_hidden(slot.spec, slot.id);
-            } else if (spec_type_used != COMMON_SPECULATIVE_TYPE_MTP) {
-                SLT_DBG(slot, "%s", "synced MTP target hidden state from accepted-prefix rows after per-step restore");
-            }
-        }
-
-        SLT_DBG(slot, "per-step restore: step=%d (rejected %d drafts)\n",
-            step, (int)(n_draft - (ids.size() - 1)));
-    } else {
-        // Restore pre-speculation recurrent state then re-decode accepted tokens.
-        llama_spec_ckpt_restore(ctx, slot.id, slot.spec_ckpt.n_past, 0);
-
-        if (slot.spec_ckpt.sampler) {
-            common_sampler_clone(slot.spec_ckpt.sampler, slot.ctx_sampling);
-        }
-
-        if (!ids.empty()) {
-            // Re-decode to advance recurrent state to the accepted position.
-            const int n_re = (int)ids.size();
-            llama_batch re_batch = llama_batch_init(n_re, 0, 1);
-            common_batch_add(re_batch, slot.spec_ckpt.sampled, slot.spec_ckpt.n_past, { slot.id }, n_re == 1);
-            for (int j = 0; j < n_re - 1; j++) {
-                common_batch_add(re_batch, ids[j], slot.spec_ckpt.n_past + 1 + j, { slot.id }, j == n_re - 2);
-            }
-
-            if (slot.has_mtp) {
-                for (int j = 0; j < re_batch.n_tokens; j++) {
-                    re_batch.logits[j] = true;
-                }
-                llama_set_embeddings(ctx, true);
-            }
-
-            const int ret = llama_decode(ctx, re_batch);
-            if (ret != 0) {
-                SLT_ERR(slot, "failed to re-decode accepted tokens after checkpoint restore: %d\n", ret);
-            }
-            if (server_speculative_has_target_features(slot.params.speculative)) {
-                const int n_accepted = (int)ids.size();
-                std::vector<int32_t> redecoded_indices(n_accepted);
-                for (int j = 0; j < n_accepted; ++j) {
-                    redecoded_indices[j] = j;
-                }
-
-                server_dflash_contract_log_accept(
-                    slot,
-                    spec_type_used,
-                    "restore",
-                    true,
-                    n_draft,
-                    ids,
-                    slot.spec_ckpt.n_past,
-                    redecoded_indices);
-
-                if (!common_speculative_commit_accepted_output(
-                        slot.spec,
-                        ctx,
-                        spec_type_used,
-                        slot.id,
-                        slot.spec_ckpt.n_past,
-                        sampled_before,
-                        ids,
-                        redecoded_indices)) {
-                    common_speculative_clear_sequence_hidden(slot.spec, slot.id);
-                }
-            }
-
-            for (llama_token id : ids) {
-                common_sampler_accept(slot.ctx_sampling, ctx, id, true);
-            }
-
-            llama_batch_free(re_batch);
-            SLT_DBG(slot, "spec checkpoint restored: re-decoded %d tokens (rejected %d drafts)\n",
-                n_re, (int)(n_draft - (ids.size() - 1)));
-        }
-    }
-
-    discard_speculative_checkpoint(slot, ctx);
-}
-
 void server_context::speculative_decoding_accept() {
    for (auto& slot : slots) {
        if (slot.state != SLOT_STATE_PROCESSING || slot.i_batch_dft.empty()) {
@ -4382,7 +4097,6 @@ void server_context::speculative_decoding_accept() {
        }

        const llama_token sampled_before = slot.sampled;
-        const common_speculative_type spec_type_used = slot.drafted_spec_type;
        size_t n_draft = slot.drafted.size();

        slot.ctx_sampling->to_generated_text = &slot.generated_text;
@ -4412,28 +4126,15 @@ void server_context::speculative_decoding_accept() {
            continue;
        }

-        const bool any_rejected = (ids.size() - 1) < n_draft;
-        int32_t spec_n_past_base = 0;
-        std::vector<float> spec_feature_rows_pre;
        std::vector<int32_t> accepted_output_indices;
        if (server_speculative_has_target_features(slot.params.speculative)) {
-            const int32_t n_pre_spec_tokens = slot.cache_tokens.n_tokens() - (int32_t)(slot.drafted.size() + 1);
-            spec_n_past_base = slot.cache_tokens.pos_next(n_pre_spec_tokens);
-
            if (!ids.empty()) {
                accepted_output_indices.assign(slot.i_batch_dft.begin(), slot.i_batch_dft.begin() + ids.size());
            }
-
-            if (any_rejected && slot.spec_ckpt.valid && !accepted_output_indices.empty()) {
-                if (!common_speculative_copy_output_hidden_rows(slot.spec, ctx, accepted_output_indices, spec_feature_rows_pre)) {
-                    spec_feature_rows_pre.clear();
-                }
-            }
        }

        slot.i_batch_dft.clear();
        slot.drafted.clear();
-        slot.drafted_spec_type = COMMON_SPECULATIVE_TYPE_NONE;

        slot.n_past += ids.size();
        slot.n_decoded += ids.size();
@ -4443,11 +4144,9 @@ void server_context::speculative_decoding_accept() {
        // update how many tokens out of those tested were accepted
        slot.n_draft_accepted += ids.size() - 1;

-        // inform the speculative decoding about the number of accepted tokens
-        common_speculative_accept(slot.spec, ids.size() - 1);
-
        // rollback to the state before sampling the draft tokens
        slot.cache_tokens.keep_first(slot.cache_tokens.n_tokens() - n_draft);
+        const llama_pos spec_pos_base = slot.cache_tokens.pos_next();

        // add accepted tokens to the prompt
        for (auto it = ids.begin(); it != ids.end() - 1; ++it) {
@ -4456,39 +4155,34 @@ void server_context::speculative_decoding_accept() {
        slot.sampled = ids.back(); // last accepted token
        slot.n_past = slot.cache_tokens.n_tokens();

-        // for recurrent/hybrid models: if any drafts were rejected, restore recurrent state
-        if (any_rejected && slot.spec_ckpt.valid) {
-            restore_speculative_checkpoint(slot, ctx, model, spec_type_used, sampled_before, ids, n_draft, spec_feature_rows_pre, spec_n_past_base);
-        } else {
-            if (server_speculative_has_target_features(slot.params.speculative) && !accepted_output_indices.empty()) {
-                server_dflash_contract_log_accept(
-                        slot,
-                        spec_type_used,
-                        "direct",
-                        false,
-                        n_draft,
-                        ids,
-                        spec_n_past_base,
-                        accepted_output_indices);
+        const common_speculative_type spec_type_used = common_speculative_current_type(slot.spec);
+        const bool any_rejected = (ids.size() - 1) < n_draft;
+        const common_speculative_checkpoint * ckpt = common_speculative_get_checkpoint(slot.spec);
+        const bool will_restore = any_rejected && ckpt != nullptr && ckpt->valid;

-                if (!common_speculative_commit_accepted_output(
-                        slot.spec,
-                        ctx,
-                        spec_type_used,
-                        slot.id,
-                        spec_n_past_base,
-                        sampled_before,
-                        ids,
-                        accepted_output_indices)) {
-                    common_speculative_clear_sequence_hidden(slot.spec, slot.id);
-                } else if (spec_type_used != COMMON_SPECULATIVE_TYPE_MTP) {
-                    SLT_DBG(slot, "%s", "synced MTP target hidden state from accepted-prefix rows");
-                }
-            }
-            llama_kv_cache_seq_rm(ctx, slot.id, slot.cache_tokens.pos_next(slot.n_past), -1);
-            discard_speculative_checkpoint(slot, ctx);
+        if (server_speculative_has_target_features(slot.params.speculative) && !accepted_output_indices.empty()) {
+            server_dflash_contract_log_accept(
+                    slot,
+                    spec_type_used,
+                    will_restore ? "restore" : "direct",
+                    any_rejected,
+                    n_draft,
+                    ids,
+                    spec_pos_base,
+                    accepted_output_indices);
        }

+        common_speculative_commit(
+            slot.spec,
+            ctx,
+            slot.ctx_sampling,
+            slot.id,
+            sampled_before,
+            ids,
+            n_draft,
+            spec_pos_base,
+            accepted_output_indices);
+
        for (size_t i = 0; i < ids.size(); ++i) {
            completion_token_output result;

@ -4911,7 +4605,7 @@ void server_context::process_batch_tokens(int32_t & n_batch) {

            if (slot.n_decoded == 0 && slot.can_speculate()) {
                static const llama_tokens empty_prompt;
-                const llama_tokens & spec_prompt = slot.has_mtp && !slot.params.speculative.has_composite_stage_chain()
+                const llama_tokens & spec_prompt = slot.uses_mtp() && !slot.params.speculative.has_composite_stage_chain()
                    ? empty_prompt
                    : slot.cache_tokens.get_text_tokens();
                common_speculative_begin(slot.spec, spec_prompt);
@ -4935,7 +4629,7 @@ void server_context::process_batch_tokens(int32_t & n_batch) {
            completion_token_output result;
            const int tok_idx = slot.i_batch - i;

-            if (slot.has_mtp && slot.n_decoded == 0) {
+            if (slot.uses_mtp() && slot.n_decoded == 0) {
                (void) common_speculative_capture_output_hidden(slot.spec, ctx, tok_idx, slot.id, slot.n_past);
            }

@ -5076,10 +4770,25 @@ void server_context::update_slots() {
            if (slot.state != SLOT_STATE_PROCESSING || slot.i_batch_dft.empty()) {
                continue;
            }
-            if (save_speculative_checkpoint(slot, model, ctx, ckpt_mode)) {
-                const char * mode_name = slot.spec_ckpt.per_step_enabled ? "per-step" : "shadow/cpu";
+            const int32_t n_pre_spec_tokens = slot.cache_tokens.n_tokens() - (int32_t) (slot.drafted.size() + 1);
+            const llama_pos n_past_pre_spec = slot.cache_tokens.pos_next(n_pre_spec_tokens);
+            const int max_tokens = (int) slot.drafted.size() + 1;
+            if (common_speculative_before_draft(
+                    slot.spec,
+                    model,
+                    ctx,
+                    slot.ctx_sampling,
+                    slot.sparams,
+                    slot.id,
+                    n_past_pre_spec,
+                    slot.sampled,
+                    max_tokens,
+                    ckpt_mode)) {
+                const common_speculative_checkpoint * ckpt = common_speculative_get_checkpoint(slot.spec);
+                GGML_ASSERT(ckpt != nullptr);
+                const char * mode_name = ckpt->per_step_enabled ? "per-step" : "shadow/cpu";
                SLT_DBG(slot, "spec checkpoint saved (mode=%s), n_past_pre_spec=%d\n",
-                    mode_name, slot.spec_ckpt.n_past);
+                    mode_name, ckpt->n_past);
            } else {
                SLT_WRN(slot, "%s", "failed to save spec checkpoint\n");
            }
--- a/examples/server/server-context.h
+++ b/examples/server/server-context.h
@ -22,16 +22,6 @@ enum slot_command {
    SLOT_COMMAND_RELEASE,
 };

-struct server_speculative_checkpoint {
-    bool valid = false;
-    bool per_step_enabled = false; // per-step SSM checkpoints active
-    llama_pos n_past = 0;
-    llama_token sampled = LLAMA_TOKEN_NULL;
-    common_sampler * sampler = nullptr; // saved sampler state
-
-    void clear();
-};
-
 struct server_slot {
    int id;
    int id_task = -1;
@ -39,9 +29,6 @@ struct server_slot {

    struct slot_params params;

-    llama_batch batch_spec = {};
-    llama_context * ctx_dft = nullptr;
-
    bool released = false;
    slot_state state = SLOT_STATE_IDLE;
    slot_command command = SLOT_COMMAND_NONE;
@ -138,7 +125,6 @@ struct server_slot {
    // sampling
    llama_token sampled; // in speculative mode, this is the last accepted token
    llama_tokens drafted;
-    common_speculative_type drafted_spec_type = COMMON_SPECULATIVE_TYPE_NONE;

    json json_schema;

@ -173,13 +159,7 @@ struct server_slot {
    // expiring logit bias
    std::vector<common_sampler::elb_state> prev_elb_states;

-    bool has_mtp = false;
-
-    // saves recurrent state before a speculative batch so it can be restored on rejection
-    server_speculative_checkpoint spec_ckpt;
-
    bool spec_prompt_warmup_failed = false;
-
    // speculative decoding stats
    int32_t n_draft_total = 0;      // Total draft tokens generated
    int32_t n_draft_accepted = 0;   // Draft tokens actually accepted
@ -199,6 +179,7 @@ struct server_slot {
    void reset();

    bool need_embd() const;
+    bool uses_mtp() const;

    bool has_budget(gpt_params& global_params);

@ -270,11 +251,6 @@ struct server_context {
    // multimodal
    mtmd_context* mctx = nullptr;

-    // For speculative decoding
-    llama_model* model_draft = nullptr;
-    llama_context* ctx_draft = nullptr;
-    llama_context_params cparams_dft;
-
    int32_t n_ctx; // total context for all clients / slots

    // system prompt
@ -354,7 +330,7 @@ struct server_context {

    void apply_server_biases(server_slot& slot);

-    void request_completion(int id_task, int id_multi, json data, bool infill, bool embedding, server_tokens&& inputs);
+    void request_completion(int id_task, int id_multi, json data, bool infill, bool embedding, server_tokens & inputs);

    void request_cancel(int id_task);

--- a/examples/server/server-cors-proxy.h
+++ b/examples/server/server-cors-proxy.h
@ -0,0 +1,170 @@
+#pragma once
+
+#include "common.h"
+#include "http.h"
+#include <string>
+#include <unordered_set>
+#include <list>
+#include <map>
+
+static std::string to_lower_copy(const std::string & value) {
+    std::string lowered(value.size(), '\0');
+    std::transform(value.begin(), value.end(), lowered.begin(), [](unsigned char c) { return std::tolower(c); });
+    return lowered;
+}
+
+static httplib::Request prepare_proxy_req_header(const std::string & method,
+    const std::string & scheme,
+    const std::string & host,
+    int port,
+    const std::string & path,
+    const std::map<std::string, std::string> & headers,
+    const std::string & body,
+    const httplib::FormFiles & files) {
+        httplib::Request  req;
+        bool has_files = !files.empty();
+        req.form.files = files;
+        std::string effective_body = body;
+        std::string override_content_type;
+        req.method = method;
+        req.path = path;
+        for (const auto & [key, value] : headers) {
+            const auto lowered = to_lower_copy(key);
+            if (lowered == "accept-encoding") {
+                // disable Accept-Encoding to avoid compressed responses
+                continue;
+            }
+            if (lowered == "transfer-encoding") {
+                // the body is already decoded
+                continue;
+            }
+            if (lowered == "content-length") {
+                // let httplib calculate Content-Length from the actual body
+                continue;
+            }
+            if (lowered == "content-type") {
+                if (has_files) {
+                    // we set our own Content-Type with the new boundary
+                    continue;
+                }
+                // when no files but the original request was multipart,
+                // the body is now JSON, so correct the Content-Type
+                if (value.find("multipart/form-data") != std::string::npos) {
+                    override_content_type = "application/json; charset=utf-8";
+                    continue;
+                }
+            }
+            if (lowered == "host") {
+                bool is_default_port = (scheme == "https" && port == 443) || (scheme == "http" && port == 80);
+                req.set_header(key, is_default_port ? host : host + ":" + std::to_string(port));
+            } else {
+                req.set_header(key, value);
+            }
+        }
+        req.body = effective_body;
+        if (!override_content_type.empty()) {
+            req.set_header("Content-Type", override_content_type);
+        }
+        //req.response_handler = response_handler;
+        //req.content_receiver = content_receiver;
+    
+    return req;
+}
+
+static std::string get_param(httplib::Params params,const std::string & key, const std::string & def = "") {
+    auto it = params.find("url");
+    if (it != params.end()) {
+        return  it->second;
+    }
+    return def;
+}
+
+static void proxy_request(const httplib::Request & req,
+    httplib::Response & res,
+    const std::string & method) {
+    std::string target_url = get_param(req.params, "url");
+    common_http_url parsed_url = common_http_parse_url(target_url);
+    if (parsed_url.host.empty()) {
+        throw std::runtime_error("invalid target URL: missing host");
+    }
+
+    if (parsed_url.path.empty()) {
+        parsed_url.path = "/";
+    }
+
+    if (!parsed_url.password.empty()) {
+        throw std::runtime_error("authentication in target URL is not supported");
+    }
+
+    if (parsed_url.scheme != "http" && parsed_url.scheme != "https") {
+        throw std::runtime_error("unsupported URL scheme in target URL: " + parsed_url.scheme);
+    }
+
+    SRV_INF("proxying %s request to %s://%s:%i%s\n", method.c_str(), parsed_url.scheme.c_str(), parsed_url.host.c_str(), parsed_url.port, parsed_url.path.c_str());
+    std::map<std::string, std::string> headers;
+    for (auto [key, value] : req.headers) {
+        auto new_key = key;
+        if (string_starts_with(new_key, "x-proxy-header-")) {
+            string_replace_all(new_key, "x-proxy-header-", "");
+        }
+        headers[new_key] = value;
+    }
+
+    httplib::Request proxy_req = prepare_proxy_req_header(method,
+        parsed_url.scheme,
+        parsed_url.host,
+        parsed_url.port,
+        parsed_url.path,
+        headers,
+        req.body,
+        req.form.files);
+
+    // Make the proxied request
+    httplib::Result proxy_res;
+    
+    if (parsed_url.scheme == "https") {
+#ifdef CPPHTTPLIB_OPENSSL_SUPPORT
+        httplib::SSLClient cli(parsed_url.host, parsed_url.port);
+        // set timeouts, follow redirects as needed
+        cli.set_connection_timeout(600);
+        cli.set_read_timeout(600);
+        cli.set_write_timeout(600);
+        cli.set_follow_location(true);
+        proxy_res = cli.send(proxy_req);
+#else
+        res.status = 501;
+        res.set_content("HTTPS not supported (build with OpenSSL)", "text/plain");
+        return;
+#endif
+    } else {
+        httplib::Client cli(parsed_url.host, parsed_url.port);
+        cli.set_connection_timeout(600);
+        cli.set_read_timeout(600);
+        cli.set_write_timeout(600);
+        proxy_res = cli.send(std::move(proxy_req));
+    }
+
+    if (!proxy_res) {
+        std::string error_data = "Proxy failed: " + httplib::to_string(proxy_res.error());
+        json final_response{ {"error", error_data} };
+        res.set_content(safe_json_to_str(final_response), "application/json; charset=utf-8");
+        res.status = json_value(error_data, "code", 500);
+        return;
+    }
+
+    res.status = proxy_res->status;
+    res.set_content(proxy_res->body, proxy_res->get_header_value("Content-Type"));
+    for (const auto & h : proxy_res->headers) {
+        // skip hop-by-hop headers
+        if (h.first != "Transfer-Encoding" && h.first != "Connection")
+            res.set_header(h.first, h.second);
+    }
+}
+
+static void proxy_handler_get(const httplib::Request & req, httplib::Response & res) {
+    proxy_request(req, res, "GET");
+}
+
+static void proxy_handler_post(const httplib::Request & req, httplib::Response & res) {
+    proxy_request(req, res, "POST");
+}
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@ -2,6 +2,7 @@
 #include "server-context.h"
 #include "server-common.h"
 #include "server-chat.h"
+#include "server-cors-proxy.h"
 #include "chat.h"

 #include "common.h"
@ -329,6 +330,18 @@ struct server_response_reader {
        return !cancelled && received_count < id_tasks.size();
    }

+    // cancel-cascade fix: true only if one of THIS reader's tasks is on a
+    // slot (the active decode). Used to gate llama_decode_stop() so a queued/
+    // deferred task's disconnect cannot abort another task's active decode via
+    // the process-global stop_internal_decode flag. Best-effort cross-thread
+    // read (slots are not resized at runtime; same race class as the global).
+    bool any_task_on_slot() const {
+        for (const auto & slot : ctx_server.slots) {
+            if (slot.is_processing() && id_tasks.count(slot.id_task)) return true;
+        }
+        return false;
+    }
+
    // return nullptr if should_stop() is true before receiving a result
    // note: if one error is received, it will stop further processing and return error result
    server_task_result_ptr next(const std::function<bool()>& should_stop) {
@ -1020,7 +1033,8 @@ int main(int argc, char ** argv) {
                {"vision", ctx_server.chat_params.allow_image},
                {"audio",  ctx_server.chat_params.allow_audio},
            } },
-            { "n_ctx",                       ctx_server.n_ctx }
+            { "n_ctx",                       ctx_server.n_ctx },
+            { "cors_proxy_enabled",          ctx_server.params_base.webui_mcp_proxy},

        };

@ -1125,7 +1139,7 @@ int main(int argc, char ** argv) {
                // non-stream, wait for the results
                auto all_results = rd->wait_for_all(is_connection_closed);
                if (all_results.is_terminated) {
-                    llama_decode_stop(); // send a signal to stop decode process
+                    if (rd->any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
                    return; // connection is closed
                }
                else if (all_results.error) {
@ -1139,8 +1153,8 @@ int main(int argc, char ** argv) {
                        arr.push_back(res->to_json());
                    }
                    // if single request, return single object instead of array
-                    res_ok(res, arr.size() == 1 ? arr[0] : arr);              
-                }                       
+                    res_ok(res, arr.size() == 1 ? arr[0] : arr);
+                }
            }
            else {
                // in streaming mode, the first error must be treated as non-stream response
@ -1148,7 +1162,7 @@ int main(int argc, char ** argv) {
                // ref: https://github.com/ggml-org/llama.cpp/pull/16486#discussion_r2419657309
                server_task_result_ptr first_result = rd->next(is_connection_closed);
                if (first_result == nullptr) {
-                    llama_decode_stop(); // send a signal to stop decode process
+                    if (rd->any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
                    return; // connection is closed
                }
                else if (first_result->is_error()) {
@ -1356,10 +1370,11 @@ int main(int argc, char ** argv) {
    const auto handle_infill = [&ctx_server, &handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
        log_prompt(ctx_server.params_base, json::parse(req.body));
        json data = json::parse(req.body);
-        const int id_task = ctx_server.queue_tasks.get_new_id();
-        server_tokens token; // dummy tokens
-        ctx_server.queue_results.add_waiting_task_id(id_task);
-        ctx_server.request_completion(id_task, -1, data, true, false, std::move(token));
+        //avoid double submits
+        //const int id_task = ctx_server.queue_tasks.get_new_id();
+        //server_tokens token; // dummy tokens
+        //ctx_server.queue_results.add_waiting_task_id(id_task);
+        //ctx_server.request_completion(id_task, -1, data, true, false, token);
        std::vector<raw_buffer> files; // dummy
        handle_completions_impl(
            SERVER_TASK_TYPE_INFILL,
@ -1477,7 +1492,7 @@ int main(int argc, char ** argv) {

        // collect results
        if (all_results.is_terminated) {
-            llama_decode_stop();
+            if (rd.any_task_on_slot()) llama_decode_stop(); // cancel-cascade fix: stop only if OUR task is the active decode
            return; // connection is closed
        }
        else if (all_results.error) {
@ -2108,6 +2123,16 @@ int main(int argc, char ** argv) {
 	}
 #endif
    }
+
+    // CORS proxy (EXPERIMENTAL, only used by the Web UI for MCP)
+    if (params.webui_mcp_proxy) {
+        SRV_WRN("%s", "-----------------\n");
+        SRV_WRN("%s", "CORS proxy is enabled, do not expose server to untrusted environments\n");
+        SRV_WRN("%s", "This feature is EXPERIMENTAL and may be removed or changed in future versions\n");
+        SRV_WRN("%s", "-----------------\n");
+        svr->Get("/cors-proxy", proxy_handler_get);
+        svr->Post("/cors-proxy", proxy_handler_post);
+    }
    //
    // Start the server
    //
--- a/examples/server/webui/dist/index.html
+++ b/examples/server/webui/dist/index.html
--- a/examples/server/webui/package-lock.json
+++ b/examples/server/webui/package-lock.json
@ -6603,20 +6603,6 @@
      "dev": true,
      "license": "ISC"
    },
-    "node_modules/yaml": {
-      "version": "2.7.0",
-      "resolved": "https://registry.npmjs.org/yaml/-/yaml-2.7.0.tgz",
-      "integrity": "sha512-+hSoy/QHluxmC9kCIJyL/uyFmLmc+e5CFR5Wa+bpIhIj85LVb9ZH2nVnqrHoSvKogwODv0ClqZkmiSSaIH5LTA==",
-      "license": "ISC",
-      "optional": true,
-      "peer": true,
-      "bin": {
-        "yaml": "bin.mjs"
-      },
-      "engines": {
-        "node": ">= 14"
-      }
-    },
    "node_modules/yocto-queue": {
      "version": "0.1.0",
      "resolved": "https://registry.npmjs.org/yocto-queue/-/yocto-queue-0.1.0.tgz",
--- a/examples/server/webui_llamacpp/.prettierignore
+++ b/examples/server/webui_llamacpp/.prettierignore
@ -7,3 +7,12 @@ bun.lockb

 # Miscellaneous
 /static/
+dist/
+.svelte-kit/
+build/
+
+# Build output
+/dist/
+/build/
+/.svelte-kit/
+test-results
--- a/examples/server/webui_llamacpp/.storybook/decorators/ModeWatcherDecorator.svelte
+++ b/examples/server/webui_llamacpp/.storybook/decorators/ModeWatcherDecorator.svelte
--- a/examples/server/webui_llamacpp/.storybook/decorators/TooltipProviderDecorator.svelte
+++ b/examples/server/webui_llamacpp/.storybook/decorators/TooltipProviderDecorator.svelte
@ -1,5 +1,5 @@
 <script lang="ts">
-	import * as Tooltip from '../src/lib/components/ui/tooltip';
+	import * as Tooltip from '../../src/lib/components/ui/tooltip';

 	interface Props {
 		children: any;
--- a/examples/server/webui_llamacpp/.storybook/main.ts
+++ b/examples/server/webui_llamacpp/.storybook/main.ts
@ -1,17 +1,24 @@
 import type { StorybookConfig } from '@storybook/sveltekit';
+import { dirname, resolve } from 'path';
+import { fileURLToPath } from 'url';
+
+const __dirname = dirname(fileURLToPath(import.meta.url));

 const config: StorybookConfig = {
-	stories: ['../src/**/*.mdx', '../src/**/*.stories.@(js|ts|svelte)'],
+	stories: ['../tests/stories/**/*.mdx', '../tests/stories/**/*.stories.@(js|ts|svelte)'],
 	addons: [
 		'@storybook/addon-svelte-csf',
 		'@chromatic-com/storybook',
-		'@storybook/addon-docs',
+		'@storybook/addon-vitest',
 		'@storybook/addon-a11y',
-		'@storybook/addon-vitest'
+		'@storybook/addon-docs'
 	],
-	framework: {
-		name: '@storybook/sveltekit',
-		options: {}
+	framework: '@storybook/sveltekit',
+	viteFinal: async (config) => {
+		config.server = config.server || {};
+		config.server.fs = config.server.fs || {};
+		config.server.fs.allow = [...(config.server.fs.allow || []), resolve(__dirname, '../tests')];
+		return config;
 	}
 };
 export default config;
--- a/examples/server/webui_llamacpp/.storybook/preview.ts
+++ b/examples/server/webui_llamacpp/.storybook/preview.ts
@ -1,28 +1,28 @@
 import type { Preview } from '@storybook/sveltekit';
 import '../src/app.css';
-import ModeWatcherDecorator from './ModeWatcherDecorator.svelte';
-import TooltipProviderDecorator from './TooltipProviderDecorator.svelte';
+import ModeWatcherDecorator from './decorators/ModeWatcherDecorator.svelte';
+import TooltipProviderDecorator from './decorators/TooltipProviderDecorator.svelte';

 const preview: Preview = {
 	parameters: {
-        controls: {
+		controls: {
 			matchers: {
 				color: /(background|color)$/i,
 				date: /Date$/i
 			}
 		},

-        backgrounds: {
-			disable: true
+		backgrounds: {
+			disabled: true
 		},

-        a11y: {
-            // 'todo' - show a11y violations in the test UI only
-            // 'error' - fail CI on a11y violations
-            // 'off' - skip a11y checks entirely
-            test: 'todo'
-        }
-    },
+		a11y: {
+			// 'todo' - show a11y violations in the test UI only
+			// 'error' - fail CI on a11y violations
+			// 'off' - skip a11y checks entirely
+			test: 'todo'
+		}
+	},
 	decorators: [
 		(story) => ({
 			Component: ModeWatcherDecorator,
--- a/examples/server/webui_llamacpp/.storybook/vitest.setup.ts
+++ b/examples/server/webui_llamacpp/.storybook/vitest.setup.ts
@ -1,4 +1,4 @@
-import * as a11yAddonAnnotations from "@storybook/addon-a11y/preview";
+import * as a11yAddonAnnotations from '@storybook/addon-a11y/preview';
 import { setProjectAnnotations } from '@storybook/sveltekit';
 import * as previewAnnotations from './preview';
 import { beforeAll } from 'vitest';
--- a/examples/server/webui_llamacpp/README.md
+++ b/examples/server/webui_llamacpp/README.md
@ -1,66 +1,688 @@
-# llama.cpp Web UI
+# llama-ui

-A modern, feature-rich web interface for llama.cpp built with SvelteKit. This UI provides an intuitive chat interface with advanced file handling, conversation management, and comprehensive model interaction capabilities.
+A modern, feature-rich web interface for llama-server built with SvelteKit. This UI provides an intuitive chat interface with advanced file handling, conversation management, and comprehensive model interaction capabilities.
+
+Llama UI supports two server operation modes:
+
+- **MODEL mode** - Single model operation (standard llama-server)
+- **ROUTER mode** - Multi-model operation with dynamic model loading/unloading
+
+---
+
+## Table of Contents
+
+- [Features](#features)
+- [Getting Started](#getting-started)
+- [Tech Stack](#tech-stack)
+- [Build Pipeline](#build-pipeline)
+- [Architecture](#architecture)
+- [Data Flows](#data-flows)
+- [Architectural Patterns](#architectural-patterns)
+- [Testing](#testing)
+
+---

 ## Features

- **Modern Chat Interface** - Clean, responsive design with dark/light mode
- **File Attachments** - Support for images, text files, PDFs, and audio with rich previews and drag-and-drop support
- **Conversation Management** - Create, edit, branch, and search conversations
- **Advanced Markdown** - Code highlighting, math formulas (KaTeX), and content blocks
- **Reasoning Content** - Support for models with thinking blocks
- **Keyboard Shortcuts** - Keyboard navigation (Shift+Ctrl/Cmd+O for new chat, Shift+Ctrl/Cmdt+E for edit conversation, Shift+Ctrl/Cmdt+D for delete conversation, Ctrl/Cmd+K for search, Ctrl/Cmd+V for paste, Ctrl/Cmd+B for opening/collapsing sidebar)
- **Request Tracking** - Monitor processing with slots endpoint integration
- **UI Testing** - Storybook component library with automated tests
+### Chat Interface

-## Development
+- **Streaming responses** with real-time updates
+- **Reasoning content** - Support for models with thinking/reasoning blocks
+- **Dark/light theme** with system preference detection
+- **Responsive design** for desktop and mobile

-Install dependencies:
+### File Attachments
+
+- **Images** - JPEG, PNG, GIF, WebP, SVG (with PNG conversion)
+- **Documents** - PDF (text extraction or image conversion for vision models)
+- **Audio** - MP3, WAV for audio-capable models
+- **Text files** - Source code, markdown, and other text formats
+- **Drag-and-drop** and paste support with rich previews
+
+### Conversation Management
+
+- **Branching** - Branch messages conversations at any point by editing messages or regenerating responses, navigate between branches
+- **Regeneration** - Regenerate responses with optional model switching (ROUTER mode)
+- **Import/Export** - JSON format for backup and sharing
+- **Search** - Find conversations by title or content
+
+### Advanced Rendering
+
+- **Syntax highlighting** - Code blocks with language detection
+- **Math formulas** - KaTeX rendering for LaTeX expressions
+- **Markdown** - Full GFM support with tables, lists, and more
+
+### Multi-Model Support (ROUTER mode)
+
+- **Model selector** with Loaded/Available groups
+- **Automatic loading** - Models load on selection
+- **Modality validation** - Prevents sending images to non-vision models
+- **LRU unloading** - Server auto-manages model cache
+
+### Keyboard Shortcuts
+
+| Shortcut           | Action               |
+| ------------------ | -------------------- |
+| `Shift+Ctrl/Cmd+O` | New chat             |
+| `Shift+Ctrl/Cmd+E` | Edit conversation    |
+| `Shift+Ctrl/Cmd+D` | Delete conversation  |
+| `Ctrl/Cmd+K`       | Search conversations |
+| `Ctrl/Cmd+B`       | Toggle sidebar       |
+
+### Developer Experience
+
+- **Request tracking** - Monitor token generation with `/slots` endpoint
+- **Storybook** - Component library with visual testing
+- **Hot reload** - Instant updates during development
+
+---
+
+## Getting Started
+
+### Prerequisites
+
+- **Node.js** 18+ (20+ recommended)
+- **npm** 9+
+- **llama-server** running locally (for API access)
+
+### 1. Install Dependencies

 ```bash
+cd tools/ui
 npm install
 ```

-Start the development server + Storybook:
+### 2. Start llama-server
+
+In a separate terminal, start the backend server:
+
+```bash
+# Single model (MODEL mode)
+./llama-server -m model.gguf
+
+# Multi-model (ROUTER mode)
+./llama-server --models-dir /path/to/models
+```
+
+### 3. Start Development Servers

 ```bash
 npm run dev
 ```

-This will start both the SvelteKit dev server and Storybook on port 6006.
+This starts:

-## Building
+- **Vite dev server** at `http://localhost:5173` - The main UI frontend app
+- **Storybook** at `http://localhost:6006` - Component documentation

-Create a production build:
+The Vite dev server proxies API requests to `SERVER_ORIGIN` (with fallback to default llama-server `8080` port):
+
+```typescript
+// vite.config.ts proxy configuration
+proxy: {
+	'/v1': SERVER_ORIGIN,
+	'/props': SERVER_ORIGIN,
+	'/models': SERVER_ORIGIN,
+	'/tools': SERVER_ORIGIN,
+	'/slots': SERVER_ORIGIN,
+	'/cors-proxy': SERVER_ORIGIN
+},
+```
+
+### Development Workflow
+
+1. Open `http://localhost:5173` in your browser
+2. Make changes to `.svelte`, `.ts`, or `.css` files
+3. Changes hot-reload instantly
+4. Use Storybook at `http://localhost:6006` for isolated component development
+
+---
+
+## Tech Stack
+
+| Layer             | Technology                      | Purpose                                                  |
+| ----------------- | ------------------------------- | -------------------------------------------------------- |
+| **Framework**     | SvelteKit + Svelte 5            | Reactive UI with runes (`$state`, `$derived`, `$effect`) |
+| **UI Components** | shadcn-svelte + bits-ui         | Accessible, customizable component library               |
+| **Styling**       | TailwindCSS 4                   | Utility-first CSS with design tokens                     |
+| **Database**      | IndexedDB (Dexie)               | Client-side storage for conversations and messages       |
+| **Build**         | Vite                            | Fast bundling with static adapter                        |
+| **Testing**       | Playwright + Vitest + Storybook | E2E, unit, and visual testing                            |
+| **Markdown**      | remark + rehype                 | Markdown processing with KaTeX and syntax highlighting   |
+
+### Key Dependencies
+
+```json
+{
+	"svelte": "^5.0.0",
+	"bits-ui": "^2.8.11",
+	"dexie": "^4.0.11",
+	"pdfjs-dist": "^5.4.54",
+	"highlight.js": "^11.11.1",
+	"rehype-katex": "^7.0.1"
+}
+```
+
+---
+
+## Build Pipeline
+
+### Development Build
+
+```bash
+npm run dev
+```
+
+Runs Vite in development mode with:
+
+- Hot Module Replacement (HMR)
+- Source maps
+- Proxy to llama-server
+
+### Production Build

 ```bash
 npm run build
 ```

-The build outputs static files to `../public` directory for deployment with llama.cpp server.
+The build process:

-## Testing
+1. **Vite Build** - Bundles all TypeScript, Svelte, and CSS
+2. **Static Adapter** - Outputs to `../../build/tools/ui/dist` (llama-server's static file directory)
+3. **Post-Build Script** - Cleans up intermediate files
+4. **Custom Plugin** - Creates `index.html` with:
+   - Inlined favicon as base64
+   - GZIP compression (level 9)
+   - Deterministic output (zeroed timestamps)

-Run the test suite:
-
-```bash
-# E2E tests
-npm run test:e2e
-
-# Unit tests
-npm run test:unit
-
-# UI tests
-npm run test:ui
-
-# All tests
-npm run test
+```text
+tools/ui/        →  build  →  build/tools/ui/dist/
+├── src/                                 ├── index.html  (served by llama-server)
+├── static/                              └── (favicon inlined)
+└── ...
 ```

+### SvelteKit Configuration
+
+```javascript
+// svelte.config.js
+adapter: adapter({
+  pages: '../../build/tools/ui/dist',      // Output directory
+  assets: '../../build/tools/ui/dist',     // Static assets
+  fallback: 'index.html',  // SPA fallback
+  strict: true
+}),
+output: {
+  bundleStrategy: 'inline' // Single-file bundle
+}
+```
+
+### Integration with llama-server
+
+llama-ui is embedded directly into the llama-server binary:
+
+1. `npm run build` outputs `index.html` to `build/tools/ui/dist/`
+2. llama-server compiles this into the binary at build time
+3. When accessing `/`, llama-server serves the bundled HTML
+
+This results in a **single portable binary** with the full Llama UI included.
+
+---
+
 ## Architecture

- **Framework**: SvelteKit with Svelte 5 runes
- **Components**: ShadCN UI + bits-ui design system
- **Database**: IndexedDB with Dexie for local storage
- **Build**: Static adapter for deployment with llama.cpp server
- **Testing**: Playwright (E2E) + Vitest (unit) + Storybook (components)
+Llama UI follows a layered architecture with unidirectional data flow:
+
+```text
+Routes → Components → Hooks → Stores → Services → Storage/API
+```
+
+### High-Level Architecture
+
+See: [`docs/architecture/high-level-architecture-simplified.md`](docs/architecture/high-level-architecture-simplified.md)
+
+```mermaid
+flowchart TB
+    subgraph Routes["📍 Routes"]
+        R1["/ (Welcome)"]
+        R2["/chat/[id]"]
+        RL["+layout.svelte"]
+    end
+
+    subgraph Components["🧩 Components"]
+        C_Sidebar["ChatSidebar"]
+        C_Screen["ChatScreen"]
+        C_Form["ChatForm"]
+        C_Messages["ChatMessages"]
+        C_ModelsSelector["ModelsSelector"]
+        C_Settings["ChatSettings"]
+    end
+
+    subgraph Stores["🗄️ Stores"]
+        S1["chatStore"]
+        S2["conversationsStore"]
+        S3["modelsStore"]
+        S4["serverStore"]
+        S5["settingsStore"]
+    end
+
+    subgraph Services["⚙️ Services"]
+        SV1["ChatService"]
+        SV2["ModelsService"]
+        SV3["PropsService"]
+        SV4["DatabaseService"]
+    end
+
+    subgraph Storage["💾 Storage"]
+        ST1["IndexedDB"]
+        ST2["LocalStorage"]
+    end
+
+    subgraph APIs["🌐 llama-server"]
+        API1["/v1/chat/completions"]
+        API2["/props"]
+        API3["/models/*"]
+    end
+
+    R1 & R2 --> C_Screen
+    RL --> C_Sidebar
+    C_Screen --> C_Form & C_Messages & C_Settings
+    C_Screen --> S1 & S2
+    C_ModelsSelector --> S3 & S4
+    S1 --> SV1 & SV4
+    S3 --> SV2 & SV3
+    SV4 --> ST1
+    SV1 --> API1
+    SV2 --> API3
+    SV3 --> API2
+```
+
+### Layer Breakdown
+
+#### Routes (`src/routes/`)
+
+- **`/`** - Welcome screen, creates new conversation
+- **`/chat/[id]`** - Active chat interface
+- **`+layout.svelte`** - Sidebar, navigation, global initialization
+
+#### Components (`src/lib/components/`)
+
+Components are organized in `app/` (application-specific) and `ui/` (shadcn-svelte primitives).
+
+**Chat Components** (`app/chat/`):
+
+| Component          | Responsibility                                                              |
+| ------------------ | --------------------------------------------------------------------------- |
+| `ChatScreen/`      | Main chat container, coordinates message list, input form, and attachments  |
+| `ChatForm/`        | Message input textarea with file upload, paste handling, keyboard shortcuts |
+| `ChatMessages/`    | Message list with branch navigation, regenerate/continue/edit actions       |
+| `ChatAttachments/` | File attachment previews, drag-and-drop, PDF/image/audio handling           |
+| `ChatSettings/`    | Parameter sliders (temperature, top-p, etc.) with server default sync       |
+| `ChatSidebar/`     | Conversation list, search, import/export, navigation                        |
+
+**Dialog Components** (`app/dialogs/`):
+
+| Component                       | Responsibility                                           |
+| ------------------------------- | -------------------------------------------------------- |
+| `DialogChatSettings`            | Full-screen settings configuration                       |
+| `DialogModelInformation`        | Model details (context size, modalities, parallel slots) |
+| `DialogChatAttachmentPreview`   | Full preview for images, PDFs (text or page view), code  |
+| `DialogConfirmation`            | Generic confirmation for destructive actions             |
+| `DialogConversationTitleUpdate` | Edit conversation title                                  |
+
+**Server/Model Components** (`app/server/`, `app/models/`):
+
+| Component           | Responsibility                                            |
+| ------------------- | --------------------------------------------------------- |
+| `ServerErrorSplash` | Error display when server is unreachable                  |
+| `ModelsSelector`    | Model dropdown with Loaded/Available groups (ROUTER mode) |
+
+**Shared UI Components** (`app/misc/`):
+
+| Component                        | Responsibility                                                   |
+| -------------------------------- | ---------------------------------------------------------------- |
+| `MarkdownContent`                | Markdown rendering with KaTeX, syntax highlighting, copy buttons |
+| `SyntaxHighlightedCode`          | Code blocks with language detection and highlighting             |
+| `ActionButton`, `ActionDropdown` | Reusable action buttons and menus                                |
+| `BadgeModality`, `BadgeInfo`     | Status and capability badges                                     |
+
+#### Hooks (`src/lib/hooks/`)
+
+- **`useModelChangeValidation`** - Validates model switch against conversation modalities
+- **`useProcessingState`** - Tracks streaming progress and token generation
+
+#### Stores (`src/lib/stores/`)
+
+| Store                | Responsibility                                            |
+| -------------------- | --------------------------------------------------------- |
+| `chatStore`          | Message sending, streaming, abort control, error handling |
+| `conversationsStore` | CRUD for conversations, message branching, navigation     |
+| `modelsStore`        | Model list, selection, loading/unloading (ROUTER)         |
+| `serverStore`        | Server properties, role detection, modalities             |
+| `settingsStore`      | User preferences, parameter sync with server defaults     |
+
+#### Services (`src/lib/services/`)
+
+| Service                | Responsibility                                  |
+| ---------------------- | ----------------------------------------------- |
+| `ChatService`          | API calls to`/v1/chat/completions`, SSE parsing |
+| `ModelsService`        | `/models`, `/models/load`, `/models/unload`     |
+| `PropsService`         | `/props`, `/props?model=`                       |
+| `DatabaseService`      | IndexedDB operations via Dexie                  |
+| `ParameterSyncService` | Syncs settings with server defaults             |
+
+---
+
+## Data Flows
+
+### MODEL Mode (Single Model)
+
+See: [`docs/flows/data-flow-simplified-model-mode.md`](docs/flows/data-flow-simplified-model-mode.md)
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant UI
+    participant Stores
+    participant DB as IndexedDB
+    participant API as llama-server
+
+    Note over User,API: Initialization
+    UI->>Stores: initialize()
+    Stores->>DB: load conversations
+    Stores->>API: GET /props
+    API-->>Stores: server config
+    Stores->>API: GET /v1/models
+    API-->>Stores: single model (auto-selected)
+
+    Note over User,API: Chat Flow
+    User->>UI: send message
+    Stores->>DB: save user message
+    Stores->>API: POST /v1/chat/completions (stream)
+    loop streaming
+        API-->>Stores: SSE chunks
+        Stores-->>UI: reactive update
+    end
+    Stores->>DB: save assistant message
+```
+
+### ROUTER Mode (Multi-Model)
+
+See: [`docs/flows/data-flow-simplified-router-mode.md`](docs/flows/data-flow-simplified-router-mode.md)
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant UI
+    participant Stores
+    participant API as llama-server
+
+    Note over User,API: Initialization
+    Stores->>API: GET /props
+    API-->>Stores: {role: "router"}
+    Stores->>API: GET /models
+    API-->>Stores: models[] with status
+
+    Note over User,API: Model Selection
+    User->>UI: select model
+    alt model not loaded
+        Stores->>API: POST /models/load
+        loop poll status
+            Stores->>API: GET /models
+        end
+        Stores->>API: GET /props?model=X
+    end
+    Stores->>Stores: validate modalities
+
+    Note over User,API: Chat Flow
+    Stores->>API: POST /v1/chat/completions {model: X}
+    loop streaming
+        API-->>Stores: SSE chunks + model info
+    end
+```
+
+### Detailed Flow Diagrams
+
+| Flow          | Description                                | File                                                        |
+| ------------- | ------------------------------------------ | ----------------------------------------------------------- |
+| Chat          | Message lifecycle, streaming, regeneration | [`chat-flow.md`](docs/flows/chat-flow.md)                   |
+| Models        | Loading, unloading, modality caching       | [`models-flow.md`](docs/flows/models-flow.md)               |
+| Server        | Props fetching, role detection             | [`server-flow.md`](docs/flows/server-flow.md)               |
+| Conversations | CRUD, branching, import/export             | [`conversations-flow.md`](docs/flows/conversations-flow.md) |
+| Database      | IndexedDB schema, operations               | [`database-flow.md`](docs/flows/database-flow.md)           |
+| Settings      | Parameter sync, user overrides             | [`settings-flow.md`](docs/flows/settings-flow.md)           |
+
+---
+
+## Architectural Patterns
+
+### 1. Reactive State with Svelte 5 Runes
+
+All stores use Svelte 5's fine-grained reactivity:
+
+```typescript
+// Store with reactive state
+class ChatStore {
+	#isLoading = $state(false);
+	#currentResponse = $state('');
+
+	// Derived values auto-update
+	get isStreaming() {
+		return $derived(this.#isLoading && this.#currentResponse.length > 0);
+	}
+}
+
+// Exported reactive accessors
+export const isLoading = () => chatStore.isLoading;
+export const currentResponse = () => chatStore.currentResponse;
+```
+
+### 2. Unidirectional Data Flow
+
+Data flows in one direction, making state predictable:
+
+```mermaid
+flowchart LR
+    subgraph UI["UI Layer"]
+        A[User Action] --> B[Component]
+    end
+
+    subgraph State["State Layer"]
+        B --> C[Store Method]
+        C --> D[State Update]
+    end
+
+    subgraph IO["I/O Layer"]
+        C --> E[Service]
+        E --> F[API / IndexedDB]
+        F -.->|Response| D
+    end
+
+    D -->|Reactive| B
+```
+
+Components dispatch actions to stores, stores coordinate with services for I/O, and state updates reactively propagate back to the UI.
+
+### 3. Per-Conversation State
+
+Enables concurrent streaming across multiple conversations:
+
+```typescript
+class ChatStore {
+	chatLoadingStates = new Map<string, boolean>();
+	chatStreamingStates = new Map<string, { response: string; messageId: string }>();
+	abortControllers = new Map<string, AbortController>();
+}
+```
+
+### 4. Message Branching with Tree Structure
+
+Conversations are stored as a tree, not a linear list:
+
+```typescript
+interface DatabaseMessage {
+	id: string;
+	parent: string | null; // Points to parent message
+	children: string[]; // List of child message IDs
+	// ...
+}
+
+interface DatabaseConversation {
+	currentNode: string; // Currently viewed branch tip
+	// ...
+}
+```
+
+Navigation between branches updates `currentNode` without losing history.
+
+### 5. Layered Service Architecture
+
+Stores handle state; services handle I/O:
+
+```text
+┌─────────────────┐
+│     Stores      │  Business logic, state management
+├─────────────────┤
+│    Services     │  API calls, database operations
+├─────────────────┤
+│   Storage/API   │  IndexedDB, LocalStorage, HTTP
+└─────────────────┘
+```
+
+### 6. Server Role Abstraction
+
+Single codebase handles both MODEL and ROUTER modes:
+
+```typescript
+// serverStore.ts
+get isRouterMode() {
+  return this.role === ServerRole.ROUTER;
+}
+
+// Components conditionally render based on mode
+{#if isRouterMode()}
+  <ModelsSelector />
+{/if}
+```
+
+### 7. Modality Validation
+
+Prevents sending attachments to incompatible models:
+
+```typescript
+// useModelChangeValidation hook
+const validate = (modelId: string) => {
+	const modelModalities = modelsStore.getModelModalities(modelId);
+	const conversationModalities = conversationsStore.usedModalities;
+
+	// Check if model supports all used modalities
+	if (conversationModalities.hasImages && !modelModalities.vision) {
+		return { valid: false, reason: 'Model does not support images' };
+	}
+	// ...
+};
+```
+
+### 8. Persistent Storage Strategy
+
+Data is persisted across sessions using two storage mechanisms:
+
+```mermaid
+flowchart TB
+    subgraph Browser["Browser Storage"]
+        subgraph IDB["IndexedDB (Dexie)"]
+            C[Conversations]
+            M[Messages]
+        end
+        subgraph LS["LocalStorage"]
+            S[Settings Config]
+            O[User Overrides]
+            T[Theme Preference]
+        end
+    end
+
+    subgraph Stores["Svelte Stores"]
+        CS[conversationsStore] --> C
+        CS --> M
+        SS[settingsStore] --> S
+        SS --> O
+        SS --> T
+    end
+```
+
+- **IndexedDB**: Conversations and messages (large, structured data)
+- **LocalStorage**: Settings, user parameter overrides, theme (small key-value data)
+- **Memory only**: Server props, model list (fetched fresh on each session)
+
+---
+
+## Testing
+
+### Test Types
+
+| Type          | Tool               | Location         | Command             |
+| ------------- | ------------------ | ---------------- | ------------------- |
+| **Unit**      | Vitest             | `tests/unit/`    | `npm run test:unit` |
+| **UI/Visual** | Storybook + Vitest | `tests/stories/` | `npm run test:ui`   |
+| **E2E**       | Playwright         | `tests/e2e/`     | `npm run test:e2e`  |
+| **Client**    | Vitest             | `tests/client/`. | `npm run test:unit` |
+
+### Running Tests
+
+```bash
+# All tests
+npm run test
+
+# Individual test suites
+npm run test:e2e      # End-to-end (requires llama-server)
+npm run test:client   # Client-side unit tests
+npm run test:server   # Server-side unit tests
+npm run test:ui       # Storybook visual tests
+```
+
+### Storybook Development
+
+```bash
+npm run storybook     # Start Storybook dev server on :6006
+npm run build-storybook  # Build static Storybook
+```
+
+### Linting and Formatting
+
+```bash
+npm run lint          # Check code style
+npm run format        # Auto-format with Prettier
+npm run check         # TypeScript type checking
+```
+
+---
+
+## Project Structure
+
+```text
+tools/ui/
+├── src/
+│   ├── lib/
+│   │   ├── components/   # UI components (app/, ui/)
+│   │   ├── hooks/        # Svelte hooks
+│   │   ├── stores/       # State management
+│   │   ├── services/     # API and database services
+│   │   ├── types/        # TypeScript interfaces
+│   │   └── utils/        # Utility functions
+│   ├── routes/           # SvelteKit routes
+│   └── styles/           # Global styles
+├── static/               # Static assets
+├── tests/                # Test files
+├── docs/                 # Architecture diagrams
+│   ├── architecture/     # High-level architecture
+│   └── flows/            # Feature-specific flows
+└── .storybook/           # Storybook configuration
+```
+
+---
+
+## Related Documentation
+
+- [llama.cpp Server README](../server/README.md) - Full server documentation
+- [Multimodal Documentation](../../docs/multimodal.md) - Image and audio support
+- [Function Calling](../../docs/function-calling.md) - Tool use capabilities
--- a/examples/server/webui_llamacpp/e2e/demo.test.ts
+++ b/examples/server/webui_llamacpp/e2e/demo.test.ts
@ -1,6 +0,0 @@
-import { expect, test } from '@playwright/test';
-
-test('home page has expected h1', async ({ page }) => {
-	await page.goto('/');
-	await expect(page.locator('h1')).toBeVisible();
-});
--- a/examples/server/webui_llamacpp/eslint.config.js
+++ b/examples/server/webui_llamacpp/eslint.config.js
@ -20,14 +20,17 @@ export default ts.config(
 	prettier,
 	...svelte.configs.prettier,
 	{
-		languageOptions: {
-			globals: { ...globals.browser, ...globals.node }
-		},
+		languageOptions: { globals: { ...globals.browser, ...globals.node } },
 		rules: {
 			// typescript-eslint strongly recommend that you do not use the no-undef lint rule on TypeScript projects.
 			// see: https://typescript-eslint.io/troubleshooting/faqs/eslint/#i-get-errors-from-the-no-undef-rule-about-global-variables-not-being-defined-even-though-there-are-no-typescript-errors
 			'no-undef': 'off',
-			'svelte/no-at-html-tags': 'off'
+			'svelte/no-at-html-tags': 'off',
+			// This app uses hash-based routing (#/) where resolve() from $app/paths does not apply
+			'svelte/no-navigation-without-resolve': 'off',
+
+			// Enforce empty line at end of file
+			'eol-last': 'error'
 		}
 	},
 	{
@ -42,8 +45,8 @@ export default ts.config(
 		}
 	},
 	{
-		// Exclude Storybook files from main ESLint rules
-		ignores: ['.storybook/**/*']
+		// Exclude generated build output and Storybook files from ESLint
+		ignores: ['dist/**', 'build/**', '.svelte-kit/**', 'test-results/**', '.storybook/**/*']
 	},
 	storybook.configs['flat/recommended']
 );
--- a/examples/server/webui_llamacpp/package-lock.json
+++ b/examples/server/webui_llamacpp/package-lock.json
--- a/examples/server/webui_llamacpp/package.json
+++ b/examples/server/webui_llamacpp/package.json
@ -1,95 +1,98 @@
 {
-  "name": "webui",
-  "private": true,
-  "version": "1.0.0",
-  "type": "module",
-  "scripts": {
-    "dev": "bash scripts/dev.sh",
-    "build": "vite build",
-    "preview": "vite preview",
-    "prepare": "svelte-kit sync || echo ''",
-    "check": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json",
-    "check:watch": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json --watch",
-    "reset": "rm -rf .svelte-kit node_modules",
-    "format": "prettier --write .",
-    "lint": "prettier --check . && eslint .",
-    "test": "npm run test:ui -- --run && npm run test:client -- --run && npm run test:server -- --run && npm run test:e2e",
-    "test:e2e": "playwright test",
-    "test:client": "vitest --project=client",
-    "test:server": "vitest --project=server",
-    "test:ui": "vitest --project=ui",
-    "test:unit": "vitest",
-    "storybook": "storybook dev -p 6006",
-    "build-storybook": "storybook build",
-    "cleanup": "rm -rf .svelte-kit build node_modules test-results"
-  },
-  "devDependencies": {
-    "@chromatic-com/storybook": "^4.1.3",
-    "@eslint/compat": "^1.2.5",
-    "@eslint/js": "^9.18.0",
-    "@internationalized/date": "^3.8.2",
-    "@lucide/svelte": "^0.515.0",
-    "@playwright/test": "^1.49.1",
-    "@storybook/addon-a11y": "^10.0.8",
-    "@storybook/addon-docs": "^10.0.8",
-    "@storybook/addon-svelte-csf": "^5.0.10",
-    "@storybook/addon-vitest": "^10.0.8",
-    "@storybook/sveltekit": "^10.0.8",
-    "@sveltejs/adapter-static": "^3.0.8",
-    "@sveltejs/kit": "^2.22.0",
-    "@sveltejs/vite-plugin-svelte": "^6.0.0",
-    "@tailwindcss/forms": "^0.5.9",
-    "@tailwindcss/typography": "^0.5.15",
-    "@tailwindcss/vite": "^4.0.0",
-    "@types/node": "^22",
-    "@vitest/browser": "^3.2.3",
-    "bits-ui": "^2.8.11",
-    "clsx": "^2.1.1",
-    "dexie": "^4.0.11",
-    "eslint": "^9.18.0",
-    "eslint-config-prettier": "^10.0.1",
-    "eslint-plugin-storybook": "^10.0.8",
-    "eslint-plugin-svelte": "^3.0.0",
-    "fflate": "^0.8.2",
-    "globals": "^16.0.0",
-    "http-server": "^14.1.1",
-    "mdast": "^3.0.0",
-    "mdsvex": "^0.12.3",
-    "playwright": "^1.53.0",
-    "prettier": "^3.4.2",
-    "prettier-plugin-svelte": "^3.3.3",
-    "prettier-plugin-tailwindcss": "^0.6.11",
-    "rehype-katex": "^7.0.1",
-    "remark-math": "^6.0.0",
-    "sass": "^1.93.3",
-    "storybook": "^10.0.8",
-    "svelte": "^5.0.0",
-    "svelte-check": "^4.0.0",
-    "tailwind-merge": "^3.3.1",
-    "tailwind-variants": "^1.0.0",
-    "tailwindcss": "^4.0.0",
-    "tw-animate-css": "^1.3.5",
-    "typescript": "^5.0.0",
-    "typescript-eslint": "^8.20.0",
-    "unified": "^11.0.5",
-    "uuid": "^13.0.0",
-    "vite": "^7.0.4",
-    "vite-plugin-devtools-json": "^0.2.0",
-    "vitest": "^3.2.3",
-    "vitest-browser-svelte": "^0.1.0"
-  },
-  "dependencies": {
-    "highlight.js": "^11.11.1",
-    "mode-watcher": "^1.1.0",
-    "pdfjs-dist": "^5.4.54",
-    "rehype-highlight": "^7.0.2",
-    "rehype-stringify": "^10.0.1",
-    "remark": "^15.0.1",
-    "remark-breaks": "^4.0.0",
-    "remark-gfm": "^4.0.1",
-    "remark-html": "^16.0.1",
-    "remark-rehype": "^11.1.2",
-    "svelte-sonner": "^1.0.5",
-    "unist-util-visit": "^5.0.0"
-  }
+	"name": "llama-ui",
+	"private": true,
+	"version": "1.0.0",
+	"type": "module",
+	"scripts": {
+		"dev": "bash scripts/dev.sh",
+		"build": "vite build",
+		"preview": "vite preview",
+		"prepare": "svelte-kit sync || echo ''",
+		"check": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json",
+		"check:watch": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json --watch",
+		"reset": "rm -rf .svelte-kit node_modules",
+		"format": "prettier --write .",
+		"lint": "prettier --check . && eslint .",
+		"test": "npm run test:ui -- --run && npm run test:client -- --run && npm run test:unit -- --run && npm run test:e2e",
+		"test:e2e": "playwright test",
+		"test:client": "vitest --project=client",
+		"test:unit": "vitest --project=unit",
+		"test:ui": "vitest --project=ui",
+		"storybook": "storybook dev -p 6006",
+		"build-storybook": "storybook build",
+		"cleanup": "rm -rf .svelte-kit build node_modules test-results"
+	},
+	"devDependencies": {
+		"@chromatic-com/storybook": "^5.0.0",
+		"@eslint/compat": "^1.2.5",
+		"@eslint/js": "^9.18.0",
+		"@internationalized/date": "^3.10.1",
+		"@lucide/svelte": "^0.515.0",
+		"@playwright/test": "^1.49.1",
+		"@storybook/addon-a11y": "^10.2.4",
+		"@storybook/addon-docs": "^10.2.4",
+		"@storybook/addon-svelte-csf": "^5.0.10",
+		"@storybook/addon-vitest": "^10.2.4",
+		"@storybook/sveltekit": "^10.2.4",
+		"@sveltejs/adapter-static": "^3.0.10",
+		"@sveltejs/kit": "^2.48.4",
+		"@sveltejs/vite-plugin-svelte": "^6.2.1",
+		"@tailwindcss/forms": "^0.5.9",
+		"@tailwindcss/typography": "^0.5.15",
+		"@tailwindcss/vite": "^4.0.0",
+		"@types/node": "^24",
+		"@vitest/browser": "^3.2.3",
+		"@vitest/coverage-v8": "^3.2.3",
+		"bits-ui": "^2.14.4",
+		"clsx": "^2.1.1",
+		"dexie": "^4.0.11",
+		"eslint": "^9.18.0",
+		"eslint-config-prettier": "^10.0.1",
+		"eslint-plugin-storybook": "^10.2.4",
+		"eslint-plugin-svelte": "^3.0.0",
+		"fflate": "^0.8.2",
+		"globals": "^16.0.0",
+		"http-server": "^14.1.1",
+		"mdast": "^3.0.0",
+		"mdsvex": "^0.12.3",
+		"playwright": "^1.56.1",
+		"prettier": "^3.4.2",
+		"prettier-plugin-svelte": "^3.3.3",
+		"prettier-plugin-tailwindcss": "^0.6.11",
+		"rehype-katex": "^7.0.1",
+		"remark-math": "^6.0.0",
+		"sass": "^1.93.3",
+		"storybook": "^10.2.4",
+		"svelte": "^5.38.2",
+		"svelte-check": "^4.0.0",
+		"tailwind-merge": "^3.3.1",
+		"tailwind-variants": "^3.2.2",
+		"tailwindcss": "^4.0.0",
+		"tw-animate-css": "^1.3.5",
+		"typescript": "^5.0.0",
+		"typescript-eslint": "^8.20.0",
+		"unified": "^11.0.5",
+		"uuid": "^13.0.0",
+		"vite": "^7.2.2",
+		"vite-plugin-devtools-json": "^0.2.0",
+		"vitest": "^3.2.3",
+		"vitest-browser-svelte": "^0.1.0"
+	},
+	"dependencies": {
+		"@modelcontextprotocol/sdk": "^1.25.1",
+		"highlight.js": "^11.11.1",
+		"mermaid": "^11.15.0",
+		"mode-watcher": "^1.1.0",
+		"pdfjs-dist": "^5.4.54",
+		"rehype-highlight": "^7.0.2",
+		"rehype-stringify": "^10.0.1",
+		"remark": "^15.0.1",
+		"remark-breaks": "^4.0.0",
+		"remark-gfm": "^4.0.1",
+		"remark-html": "^16.0.1",
+		"remark-rehype": "^11.1.2",
+		"svelte-sonner": "^1.0.5",
+		"unist-util-visit": "^5.0.0",
+		"zod": "^4.2.1"
+	}
 }
--- a/examples/server/webui_llamacpp/playwright.config.ts
+++ b/examples/server/webui_llamacpp/playwright.config.ts
@ -2,10 +2,10 @@ import { defineConfig } from '@playwright/test';

 export default defineConfig({
 	webServer: {
-		command: 'npm run build && http-server ../public -p 8181',
+		command: 'npm run build && npx http-server ./dist -p 8181',
 		port: 8181,
 		timeout: 120000,
 		reuseExistingServer: false
 	},
-	testDir: 'e2e'
+	testDir: 'tests/e2e'
 });
--- a/examples/server/webui_llamacpp/scripts/dev.sh
+++ b/examples/server/webui_llamacpp/scripts/dev.sh
@ -1,34 +1,38 @@
 #!/bin/bash

-# Development script for llama.cpp webui
-# 
-# This script starts the webui development servers (Storybook and Vite).
+# Development script for llama-ui
+#
+# This script starts the llama-ui development servers (Storybook and Vite).
 # Note: You need to start llama-server separately.
 #
 # Usage:
 #   bash scripts/dev.sh
 #   npm run dev

-cd ../../../
+cd ../../
+
+# Ensure node_modules are installed
+if [ ! -d "tools/ui/node_modules" ]; then
+    echo "📦 Installing npm dependencies..."
+    cd tools/ui && npm install && cd ../../
+fi

 # Check and install git hooks if missing
 check_and_install_hooks() {
    local hooks_missing=false
-    
+
    # Check for required hooks
-    if [ ! -f ".git/hooks/pre-commit" ] || [ ! -f ".git/hooks/pre-push" ] || [ ! -f ".git/hooks/post-push" ]; then
+    if [ ! -f ".git/hooks/pre-commit" ] || [ ! -f ".git/hooks/pre-push" ]; then
        hooks_missing=true
    fi
-    
+
    if [ "$hooks_missing" = true ]; then
        echo "🔧 Git hooks missing, installing them..."
-        cd tools/server/webui
-        if bash scripts/install-git-hooks.sh; then
+        if bash "$(dirname "$0")/git-hooks/install.sh"; then
            echo "✅ Git hooks installed successfully"
        else
            echo "⚠️  Failed to install git hooks, continuing anyway..."
        fi
-        cd ../../../
    else
        echo "✅ Git hooks already installed"
    fi
@ -48,8 +52,10 @@ trap cleanup SIGINT SIGTERM

 echo "🚀 Starting development servers..."
 echo "📝 Note: Make sure to start llama-server separately if needed"
-cd tools/server/webui
-storybook dev -p 6006 --ci & vite dev --host 0.0.0.0 &
+cd tools/ui
+# Use --insecure-http-parser to handle malformed HTTP responses from llama-server
+# (some responses have both Content-Length and Transfer-Encoding headers)
+storybook dev -p 6006 --ci & NODE_OPTIONS="--insecure-http-parser" vite dev --host 0.0.0.0 &

 # Wait for all background processes
 wait
--- a/examples/server/webui_llamacpp/scripts/install-git-hooks.sh
+++ b/examples/server/webui_llamacpp/scripts/install-git-hooks.sh
@ -1,202 +0,0 @@
-#!/bin/bash
-
-# Script to install pre-commit and pre-push hooks for webui
-# Pre-commit: formats code and runs checks
-# Pre-push: builds the project, stashes unstaged changes
-
-REPO_ROOT=$(git rev-parse --show-toplevel)
-PRE_COMMIT_HOOK="$REPO_ROOT/.git/hooks/pre-commit"
-PRE_PUSH_HOOK="$REPO_ROOT/.git/hooks/pre-push"
-
-echo "Installing pre-commit and pre-push hooks for webui..."
-
-# Create the pre-commit hook
-cat > "$PRE_COMMIT_HOOK" << 'EOF'
-#!/bin/bash
-
-# Check if there are any changes in the webui directory
-if git diff --cached --name-only | grep -q "^tools/server/webui/"; then
-    echo "Formatting and checking webui code..."
-    
-    # Change to webui directory and run format
-    cd tools/server/webui
-    
-    # Check if npm is available and package.json exists
-    if [ ! -f "package.json" ]; then
-        echo "Error: package.json not found in tools/server/webui"
-        exit 1
-    fi
-    
-    # Run the format command
-    npm run format
-
-    # Check if format command succeeded
-    if [ $? -ne 0 ]; then
-        echo "Error: npm run format failed"
-        exit 1
-    fi
-
-    # Run the lint command
-    npm run lint
-    
-    # Check if lint command succeeded
-    if [ $? -ne 0 ]; then
-        echo "Error: npm run lint failed"
-        exit 1
-    fi
-
-    # Run the check command
-    npm run check
-    
-    # Check if check command succeeded
-    if [ $? -ne 0 ]; then
-        echo "Error: npm run check failed"
-        exit 1
-    fi
-
-    # Go back to repo root
-    cd ../../..
-    
-    echo "✅ Webui code formatted and checked successfully"
-fi
-
-exit 0
-EOF
-
-# Create the pre-push hook
-cat > "$PRE_PUSH_HOOK" << 'EOF'
-#!/bin/bash
-
-# Check if there are any webui changes that need building
-WEBUI_CHANGES=$(git diff --name-only @{push}..HEAD | grep "^tools/server/webui/" || true)
-
-if [ -n "$WEBUI_CHANGES" ]; then
-    echo "Webui changes detected, checking if build is up-to-date..."
-    
-    # Change to webui directory
-    cd tools/server/webui
-    
-    # Check if npm is available and package.json exists
-    if [ ! -f "package.json" ]; then
-        echo "Error: package.json not found in tools/server/webui"
-        exit 1
-    fi
-    
-    # Check if build output exists and is newer than source files
-    BUILD_FILE="../public/index.html.gz"
-    NEEDS_BUILD=false
-    
-    if [ ! -f "$BUILD_FILE" ]; then
-        echo "Build output not found, building..."
-        NEEDS_BUILD=true
-    else
-        # Check if any source files are newer than the build output
-        if find src -newer "$BUILD_FILE" -type f | head -1 | grep -q .; then
-            echo "Source files are newer than build output, rebuilding..."
-            NEEDS_BUILD=true
-        fi
-    fi
-    
-    if [ "$NEEDS_BUILD" = true ]; then
-        echo "Building webui..."
-        
-        # Stash any unstaged changes to avoid conflicts during build
-        echo "Checking for unstaged changes..."
-        if ! git diff --quiet || ! git diff --cached --quiet --diff-filter=A; then
-            echo "Stashing unstaged changes..."
-            git stash push --include-untracked -m "Pre-push hook: stashed unstaged changes"
-            STASH_CREATED=$?
-        else
-            echo "No unstaged changes to stash"
-            STASH_CREATED=1
-        fi
-        
-        # Run the build command
-        npm run build
-        
-        # Check if build command succeeded
-        if [ $? -ne 0 ]; then
-            echo "Error: npm run build failed"
-            if [ $STASH_CREATED -eq 0 ]; then
-                echo "You can restore your unstaged changes with: git stash pop"
-            fi
-            exit 1
-        fi
-
-        # Go back to repo root
-        cd ../../..
-        
-        # Check if build output was created/updated
-        if [ -f "tools/server/public/index.html.gz" ]; then
-            # Add the build output and commit it
-            git add tools/server/public/index.html.gz
-            if ! git diff --cached --quiet; then
-                echo "Committing updated build output..."
-                git commit -m "chore: update webui build output"
-                echo "✅ Build output committed successfully"
-            else
-                echo "Build output unchanged"
-            fi
-        else
-            echo "Error: Build output not found after build"
-            if [ $STASH_CREATED -eq 0 ]; then
-                echo "You can restore your unstaged changes with: git stash pop"
-            fi
-            exit 1
-        fi
-        
-        if [ $STASH_CREATED -eq 0 ]; then
-            echo "✅ Build completed. Your unstaged changes have been stashed."
-            echo "They will be automatically restored after the push."
-            # Create a marker file to indicate stash was created by pre-push hook
-            touch .git/WEBUI_PUSH_STASH_MARKER
-        fi
-    else
-        echo "✅ Build output is up-to-date"
-    fi
-    
-    echo "✅ Webui ready for push"
-fi
-
-exit 0
-EOF
-
-# Create the post-push hook (for restoring stashed changes after push)
-cat > "$REPO_ROOT/.git/hooks/post-push" << 'EOF'
-#!/bin/bash
-
-# Check if we have a stash marker from the pre-push hook
-if [ -f .git/WEBUI_PUSH_STASH_MARKER ]; then
-    echo "Restoring your unstaged changes after push..."
-    git stash pop
-    rm -f .git/WEBUI_PUSH_STASH_MARKER
-    echo "✅ Your unstaged changes have been restored."
-fi
-
-exit 0
-EOF
-
-# Make all hooks executable
-chmod +x "$PRE_COMMIT_HOOK"
-chmod +x "$PRE_PUSH_HOOK"
-chmod +x "$REPO_ROOT/.git/hooks/post-push"
-
-if [ $? -eq 0 ]; then
-    echo "✅ Git hooks installed successfully!"
-    echo "   Pre-commit: $PRE_COMMIT_HOOK"
-    echo "   Pre-push:   $PRE_PUSH_HOOK"
-    echo "   Post-push:  $REPO_ROOT/.git/hooks/post-push"
-    echo ""
-    echo "The hooks will automatically:"
-    echo "  • Format and check webui code before commits (pre-commit)"
-    echo "  • Build webui code before pushes (pre-push)"
-    echo "  • Stash unstaged changes during build process"
-    echo "  • Restore your unstaged changes after the push"
-    echo ""
-    echo "To test the hooks:"
-    echo "  • Make a change to a file in the webui directory and commit it (triggers format/check)"
-    echo "  • Push your commits to trigger the build process"
-else
-    echo "❌ Failed to make hooks executable"
-    exit 1
-fi
--- a/examples/server/webui_llamacpp/scripts/post-build.sh
+++ b/examples/server/webui_llamacpp/scripts/post-build.sh
@ -1,3 +0,0 @@
-rm -rf ../public_llamacpp/_app;
-rm ../public_llamacpp/favicon.svg;
-rm ../public_llamacpp/index_llamacpp.html;
--- a/examples/server/webui_llamacpp/scripts/vite-plugin-llama-cpp-build.ts
+++ b/examples/server/webui_llamacpp/scripts/vite-plugin-llama-cpp-build.ts
@ -0,0 +1,80 @@
+import {
+	readFileSync,
+	writeFileSync,
+	existsSync,
+	readdirSync,
+	copyFileSync,
+	rmSync,
+	unlinkSync
+} from 'fs';
+import { resolve } from 'path';
+import type { Plugin } from 'vite';
+import * as fflate from 'fflate';
+
+const GUIDE_FOR_FRONTEND = `
+<!--
+  This is a static build of the frontend.
+  It is automatically generated by the build process.
+  Do not edit this file directly.
+  To make changes, refer to the "Web UI" section in the README.
+-->
+`.trim();
+
+const OUTPUT_DIR = process.env.LLAMA_UI_OUT_DIR ?? './dist';
+const MAX_BUNDLE_SIZE = 3 * 1024 * 1024;
+
+export function llamaCppBuildPlugin() {
+	return {
+		name: 'llamacpp:build',
+		apply: 'build' as const,
+		closeBundle() {
+			// Ensure the SvelteKit adapter has finished writing to ../public
+			setTimeout(() => {
+				try {
+					const indexPath = resolve('../public_llamacpp/index_llamacpp.html');
+					const gzipPath = resolve('../public_llamacpp/index_llamacpp.html.gz');
+
+					if (!existsSync(indexPath)) {
+						return;
+					}
+
+					let content = readFileSync(indexPath, 'utf-8');
+
+					const faviconPath = resolve('static/favicon.svg');
+					if (existsSync(faviconPath)) {
+						const faviconContent = readFileSync(faviconPath, 'utf-8');
+						const faviconBase64 = Buffer.from(faviconContent).toString('base64');
+						const faviconDataUrl = `data:image/svg+xml;base64,${faviconBase64}`;
+
+						content = content.replace(/href="[^"]*favicon\.svg"/g, `href="${faviconDataUrl}"`);
+
+						console.log('✓ Inlined favicon.svg as base64 data URL');
+					}
+
+					content = content.replace(/\r/g, '');
+					content = GUIDE_FOR_FRONTEND + '\n' + content;
+
+					const compressed = fflate.gzipSync(Buffer.from(content, 'utf-8'), { level: 9 });
+
+					compressed[0x4] = 0;
+					compressed[0x5] = 0;
+					compressed[0x6] = 0;
+					compressed[0x7] = 0;
+					compressed[0x9] = 0;
+
+					if (compressed.byteLength > MAX_BUNDLE_SIZE) {
+						throw new Error(
+							`Bundle size is too large (${Math.ceil(compressed.byteLength / 1024)} KB).\n` +
+								`Please reduce the size of the frontend or increase MAX_BUNDLE_SIZE in vite.config.ts.\n`
+						);
+					}
+
+					writeFileSync(gzipPath, compressed);
+					console.log('✓ Created index_llamacpp.html.gz');
+				} catch (error) {
+					console.error('Failed to create gzip file:', error);
+				}
+			}, 100);
+		}
+	};
+}
--- a/examples/server/webui_llamacpp/src/app.css
+++ b/examples/server/webui_llamacpp/src/app.css
@ -1,5 +1,7 @@
@import 'tailwindcss';
-
+@source '.';
+@plugin '@tailwindcss/forms';
+@plugin '@tailwindcss/typography';
@import 'tw-animate-css';

@custom-variant dark (&:is(.dark *));
@ -14,11 +16,11 @@
 	--popover-foreground: oklch(0.145 0 0);
 	--primary: oklch(0.205 0 0);
 	--primary-foreground: oklch(0.985 0 0);
-	--secondary: oklch(0.97 0 0);
+	--secondary: oklch(0.95 0 0);
 	--secondary-foreground: oklch(0.205 0 0);
 	--muted: oklch(0.97 0 0);
 	--muted-foreground: oklch(0.556 0 0);
-	--accent: oklch(0.97 0 0);
+	--accent: oklch(0.95 0 0);
 	--accent-foreground: oklch(0.205 0 0);
 	--destructive: oklch(0.577 0.245 27.325);
 	--border: oklch(0.875 0 0);
@ -37,9 +39,23 @@
 	--sidebar-accent-foreground: oklch(0.205 0 0);
 	--sidebar-border: oklch(0.922 0 0);
 	--sidebar-ring: oklch(0.708 0 0);
-	--code-background: oklch(0.975 0 0);
+	--code-background: oklch(0.985 0 0);
 	--code-foreground: oklch(0.145 0 0);
+	--font-mono:
+		ui-monospace, SFMono-Regular, 'SF Mono', Monaco, 'Cascadia Code', 'Roboto Mono', Consolas,
+		'Liberation Mono', Menlo, monospace;
 	--layer-popover: 1000000;
+
+	--chat-form-area-height: 8rem;
+	--chat-form-area-offset: 2rem;
+	--max-message-height: max(24rem, min(80dvh, calc(100dvh - var(--chat-form-area-height) - 12rem)));
+}
+
+@media (min-width: 640px) {
+	:root {
+		--chat-form-area-height: 24rem;
+		--chat-form-area-offset: 12rem;
+	}
 }

 .dark {
@ -51,7 +67,7 @@
 	--popover-foreground: oklch(0.985 0 0);
 	--primary: oklch(0.922 0 0);
 	--primary-foreground: oklch(0.205 0 0);
-	--secondary: oklch(0.269 0 0);
+	--secondary: oklch(0.29 0 0);
 	--secondary-foreground: oklch(0.985 0 0);
 	--muted: oklch(0.269 0 0);
 	--muted-foreground: oklch(0.708 0 0);
@ -66,7 +82,7 @@
 	--chart-3: oklch(0.769 0.188 70.08);
 	--chart-4: oklch(0.627 0.265 303.9);
 	--chart-5: oklch(0.645 0.246 16.439);
-	--sidebar: oklch(0.205 0 0);
+	--sidebar: oklch(0.2 0 0);
 	--sidebar-foreground: oklch(0.985 0 0);
 	--sidebar-primary: oklch(0.488 0.243 264.376);
 	--sidebar-primary-foreground: oklch(0.985 0 0);
@ -120,8 +136,50 @@
 	* {
 		@apply border-border outline-ring/50;
 	}
+
 	body {
 		@apply bg-background text-foreground;
+		scrollbar-width: thin;
+		scrollbar-gutter: stable;
+		overflow: hidden; /* Added due to Mermaid rendering somehow causing the double scrollbar */
+	}
+
+	/* Global scrollbar styling - visible only on hover */
+	* {
+		scrollbar-width: thin;
+		scrollbar-color: transparent transparent;
+		transition: scrollbar-color 0.2s ease;
+	}
+
+	*:hover {
+		scrollbar-color: hsl(var(--muted-foreground) / 0.3) transparent;
+	}
+
+	*::-webkit-scrollbar {
+		width: 6px;
+		height: 6px;
+	}
+
+	*::-webkit-scrollbar-track {
+		background: transparent;
+	}
+
+	*::-webkit-scrollbar-thumb {
+		background: transparent;
+		border-radius: 3px;
+		transition: background 0.2s ease;
+	}
+
+	*:hover::-webkit-scrollbar-thumb {
+		background: hsl(var(--muted-foreground) / 0.3);
+	}
+
+	*::-webkit-scrollbar-thumb:hover {
+		background: hsl(var(--muted-foreground) / 0.5);
+	}
+
+	:where(code, pre, kbd, samp) {
+		font-family: var(--font-mono);
 	}
 }

--- a/examples/server/webui_llamacpp/src/app.d.ts
+++ b/examples/server/webui_llamacpp/src/app.d.ts
@ -4,42 +4,59 @@
 // Import chat types from dedicated module

 import type {
+	// API types
 	ApiChatCompletionRequest,
 	ApiChatCompletionResponse,
 	ApiChatCompletionStreamChunk,
+	ApiChatCompletionToolCall,
+	ApiChatCompletionToolCallDelta,
 	ApiChatMessageData,
 	ApiChatMessageContentPart,
 	ApiContextSizeError,
 	ApiErrorResponse,
 	ApiLlamaCppServerProps,
-	ApiProcessingState
-} from '$lib/types/api';
-
-import type {
+	ApiModelDataEntry,
+	ApiModelListResponse,
+	ApiProcessingState,
+	ApiRouterModelMeta,
+	ApiRouterModelsLoadRequest,
+	ApiRouterModelsLoadResponse,
+	ApiRouterModelsStatusRequest,
+	ApiRouterModelsStatusResponse,
+	ApiRouterModelsListResponse,
+	ApiRouterModelsUnloadRequest,
+	ApiRouterModelsUnloadResponse,
+	// Chat types
+	ChatAttachmentDisplayItem,
 	ChatMessageType,
 	ChatRole,
 	ChatUploadedFile,
 	ChatMessageSiblingInfo,
 	ChatMessagePromptProgress,
-	ChatMessageTimings
-} from '$lib/types/chat';
-
-import type {
+	ChatMessageTimings,
+	// Database types
 	DatabaseConversation,
 	DatabaseMessage,
 	DatabaseMessageExtra,
 	DatabaseMessageExtraAudioFile,
+	DatabaseMessageExtraVideoFile,
 	DatabaseMessageExtraImageFile,
 	DatabaseMessageExtraTextFile,
 	DatabaseMessageExtraPdfFile,
-	DatabaseMessageExtraLegacyContext
-} from '$lib/types/database';
-
-import type {
+	DatabaseMessageExtraLegacyContext,
+	ExportedConversation,
+	ExportedConversations,
+	// Model types
+	ModelModalities,
+	ModelOption,
+	// Settings types
+	SettingsChatServiceOptions,
 	SettingsConfigValue,
 	SettingsFieldConfig,
 	SettingsConfigType
-} from '$lib/types/settings';
+} from '$lib/types';
+
+import { ServerRole, ServerModelStatus, ModelModality } from '$lib/enums';

 declare global {
 	// namespace App {
@ -51,33 +68,66 @@ declare global {
 	// }

 	export {
+		// API types
 		ApiChatCompletionRequest,
 		ApiChatCompletionResponse,
 		ApiChatCompletionStreamChunk,
+		ApiChatCompletionToolCall,
+		ApiChatCompletionToolCallDelta,
 		ApiChatMessageData,
 		ApiChatMessageContentPart,
 		ApiContextSizeError,
 		ApiErrorResponse,
 		ApiLlamaCppServerProps,
+		ApiModelDataEntry,
+		ApiModelListResponse,
 		ApiProcessingState,
-		ChatMessageData,
+		ApiRouterModelMeta,
+		ApiRouterModelsLoadRequest,
+		ApiRouterModelsLoadResponse,
+		ApiRouterModelsStatusRequest,
+		ApiRouterModelsStatusResponse,
+		ApiRouterModelsListResponse,
+		ApiRouterModelsUnloadRequest,
+		ApiRouterModelsUnloadResponse,
+		// Chat types
+		ChatAttachmentDisplayItem,
 		ChatMessagePromptProgress,
 		ChatMessageSiblingInfo,
 		ChatMessageTimings,
 		ChatMessageType,
 		ChatRole,
 		ChatUploadedFile,
+		// Database types
 		DatabaseConversation,
 		DatabaseMessage,
 		DatabaseMessageExtra,
 		DatabaseMessageExtraAudioFile,
+		DatabaseMessageExtraVideoFile,
 		DatabaseMessageExtraImageFile,
 		DatabaseMessageExtraTextFile,
 		DatabaseMessageExtraPdfFile,
 		DatabaseMessageExtraLegacyContext,
+		ExportedConversation,
+		ExportedConversations,
+		// Enum types
+		ModelModality,
+		ServerRole,
+		ServerModelStatus,
+		// Model types
+		ModelModalities,
+		ModelOption,
+		// Settings types
+		SettingsChatServiceOptions,
 		SettingsConfigValue,
 		SettingsFieldConfig,
-		SettingsConfigType,
-		SettingsChatServiceOptions
+		SettingsConfigType
 	};
 }
+
+declare global {
+	interface Window {
+		idxThemeStyle?: number;
+		idxCodeBlock?: number;
+	}
+}
--- a/examples/server/webui_llamacpp/src/demo.spec.ts
+++ b/examples/server/webui_llamacpp/src/demo.spec.ts
@ -1,7 +0,0 @@
-import { describe, it, expect } from 'vitest';
-
-describe('sum test', () => {
-	it('adds 1 + 2 to equal 3', () => {
-		expect(1 + 2).toBe(3);
-	});
-});
--- a/examples/server/webui_llamacpp/src/lib/actions/fade-in-view.svelte.ts
+++ b/examples/server/webui_llamacpp/src/lib/actions/fade-in-view.svelte.ts
@ -0,0 +1,47 @@
+import { isElementInViewport } from '$lib/utils/viewport';
+
+/**
+ * Svelte action that fades in an element when it enters the viewport.
+ * Uses IntersectionObserver for efficient viewport detection.
+ *
+ * If skipIfVisible is set and the element is already visible in the viewport
+ * when the action attaches (e.g. a markdown block promoted from unstable
+ * during streaming), the fade is skipped entirely to avoid a flash.
+ */
+export function fadeInView(
+	node: HTMLElement,
+	options: { duration?: number; y?: number; skipIfVisible?: boolean } = {}
+) {
+	const { duration = 300, y = 0, skipIfVisible = false } = options;
+
+	if (skipIfVisible && isElementInViewport(node)) {
+		return;
+	}
+
+	node.style.opacity = '0';
+	node.style.transform = `translateY(${y}px)`;
+	node.style.transition = `opacity ${duration}ms ease-out, transform ${duration}ms ease-out`;
+
+	$effect(() => {
+		const observer = new IntersectionObserver(
+			(entries) => {
+				for (const entry of entries) {
+					if (entry.isIntersecting) {
+						requestAnimationFrame(() => {
+							node.style.opacity = '1';
+							node.style.transform = 'translateY(0)';
+						});
+						observer.disconnect();
+					}
+				}
+			},
+			{ threshold: 0.05 }
+		);
+
+		observer.observe(node);
+
+		return () => {
+			observer.disconnect();
+		};
+	});
+}
--- a/examples/server/webui_llamacpp/src/lib/components/app/actions/ActionIcon.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/actions/ActionIcon.svelte
@ -0,0 +1,60 @@
+<script lang="ts">
+	import { Button, type ButtonVariant, type ButtonSize } from '$lib/components/ui/button';
+	import * as Tooltip from '$lib/components/ui/tooltip';
+	import type { Component } from 'svelte';
+	import { TooltipSide } from '$lib/enums';
+
+	interface Props {
+		ariaLabel?: string;
+		class?: string;
+		disabled?: boolean;
+		icon: Component;
+		iconSize?: string;
+		onclick: (e?: MouseEvent) => void;
+		size?: ButtonSize;
+		stopPropagationOnClick?: boolean;
+		tooltip: string;
+		variant?: ButtonVariant;
+		tooltipSide?: TooltipSide;
+	}
+
+	let {
+		icon,
+		tooltip,
+		variant = 'ghost',
+		size = 'sm',
+		class: className = '',
+		disabled = false,
+		iconSize = 'h-3 w-3',
+		tooltipSide = TooltipSide.TOP,
+		stopPropagationOnClick = false,
+		onclick,
+		ariaLabel
+	}: Props = $props();
+</script>
+
+<Tooltip.Root>
+	<Tooltip.Trigger>
+		<Button
+			{variant}
+			{size}
+			{disabled}
+			onclick={(e: MouseEvent) => {
+				if (stopPropagationOnClick) e.stopPropagation();
+
+				onclick?.(e);
+			}}
+			class="h-6 w-6 p-0 {className} flex hover:bg-transparent data-[state=open]:bg-transparent!"
+			aria-label={ariaLabel || tooltip}
+		>
+			{#if icon}
+				{@const IconComponent = icon}
+				<IconComponent class={iconSize} />
+			{/if}
+		</Button>
+	</Tooltip.Trigger>
+
+	<Tooltip.Content side={tooltipSide}>
+		<p>{tooltip}</p>
+	</Tooltip.Content>
+</Tooltip.Root>
--- a/examples/server/webui_llamacpp/src/lib/components/app/actions/ActionIconCopyToClipboard.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/actions/ActionIconCopyToClipboard.svelte
@ -0,0 +1,17 @@
+<script lang="ts">
+	import { Copy } from '@lucide/svelte';
+	import { copyToClipboard } from '$lib/utils';
+	import ActionIcon from './ActionIcon.svelte';
+
+	export let ariaLabel: string = 'Copy to clipboard';
+	export let canCopy: boolean = true;
+	export let text: string;
+</script>
+
+<ActionIcon
+	icon={Copy}
+	tooltip={ariaLabel}
+	iconSize="h-4 w-4"
+	disabled={!canCopy}
+	onclick={() => canCopy && copyToClipboard(text)}
+/>
--- a/examples/server/webui_llamacpp/src/lib/components/app/actions/index.ts
+++ b/examples/server/webui_llamacpp/src/lib/components/app/actions/index.ts
@ -0,0 +1,13 @@
+/**
+ *
+ * ACTIONS
+ *
+ * Small interactive components for user actions.
+ *
+ */
+
+/** Styled icon button for action triggers with tooltip. */
+export { default as ActionIcon } from './ActionIcon.svelte';
+
+/** Copy-to-clipboard icon button with clipboard logic. */
+export { default as ActionIconCopyToClipboard } from './ActionIconCopyToClipboard.svelte';
--- a/examples/server/webui_llamacpp/src/lib/components/app/badges/BadgeInfo.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/badges/BadgeInfo.svelte
@ -0,0 +1,26 @@
+<script lang="ts">
+	import type { Snippet } from 'svelte';
+
+	interface Props {
+		children: Snippet;
+		class?: string;
+		icon?: Snippet;
+		onclick?: () => void;
+	}
+
+	let { children, class: className = '', icon, onclick }: Props = $props();
+</script>
+
+<button
+	class={[
+		'inline-flex cursor-pointer items-center gap-1 rounded-sm bg-muted-foreground/15 px-1.5 py-0.75',
+		className
+	]}
+	{onclick}
+>
+	{#if icon}
+		{@render icon()}
+	{/if}
+
+	{@render children()}
+</button>
--- a/examples/server/webui_llamacpp/src/lib/components/app/badges/BadgesModality.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/badges/BadgesModality.svelte
@ -0,0 +1,36 @@
+<script lang="ts">
+	import { Eye, Mic, Video } from '@lucide/svelte';
+	import { ModelModality } from '$lib/enums';
+
+	interface Props {
+		modalities: ModelModality[];
+		class?: string;
+	}
+
+	let { modalities, class: className = '' }: Props = $props();
+</script>
+
+{#each modalities as modality (modality)}
+	{#if modality === ModelModality.VISION || modality === ModelModality.AUDIO || modality === ModelModality.VIDEO}
+		<span
+			class={[
+				'inline-flex items-center gap-1 rounded-md bg-muted px-2 py-1 text-xs font-medium',
+				className
+			]}
+		>
+			{#if modality === ModelModality.VISION}
+				<Eye class="h-3 w-3" />
+
+				Vision (Image)
+			{:else if modality === ModelModality.VIDEO}
+				<Video class="h-3 w-3" />
+
+				Vision (Video)
+			{:else}
+				<Mic class="h-3 w-3" />
+
+				Audio
+			{/if}
+		</span>
+	{/if}
+{/each}
--- a/examples/server/webui_llamacpp/src/lib/components/app/badges/index.ts
+++ b/examples/server/webui_llamacpp/src/lib/components/app/badges/index.ts
@ -0,0 +1,13 @@
+/**
+ *
+ * BADGES & INDICATORS
+ *
+ * Small visual indicators for status and metadata.
+ *
+ */
+
+/** Generic info badge with optional tooltip and click handler. */
+export { default as BadgeInfo } from './BadgeInfo.svelte';
+
+/** Badge indicating model modality (vision, audio, tools). */
+export { default as BadgesModality } from './BadgesModality.svelte';
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentPreview.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentPreview.svelte
@ -1,273 +0,0 @@
-<script lang="ts">
-	import { FileText, Image, Music, FileIcon, Eye } from '@lucide/svelte';
-	import { FileTypeCategory, MimeTypeApplication } from '$lib/enums/files';
-	import { convertPDFToImage } from '$lib/utils/pdf-processing';
-	import { Button } from '$lib/components/ui/button';
-	import { getFileTypeCategory } from '$lib/utils/file-type';
-
-	interface Props {
-		// Either an uploaded file or a stored attachment
-		uploadedFile?: ChatUploadedFile;
-		attachment?: DatabaseMessageExtra;
-		// For uploaded files
-		preview?: string;
-		name?: string;
-		type?: string;
-		textContent?: string;
-	}
-
-	let { uploadedFile, attachment, preview, name, type, textContent }: Props = $props();
-
-	let displayName = $derived(uploadedFile?.name || attachment?.name || name || 'Unknown File');
-
-	let displayPreview = $derived(
-		uploadedFile?.preview || (attachment?.type === 'imageFile' ? attachment.base64Url : preview)
-	);
-
-	let displayType = $derived(
-		uploadedFile?.type ||
-			(attachment?.type === 'imageFile'
-				? 'image'
-				: attachment?.type === 'textFile'
-					? 'text'
-					: attachment?.type === 'audioFile'
-						? attachment.mimeType || 'audio'
-						: attachment?.type === 'pdfFile'
-							? MimeTypeApplication.PDF
-							: type || 'unknown')
-	);
-
-	let displayTextContent = $derived(
-		uploadedFile?.textContent ||
-			(attachment?.type === 'textFile'
-				? attachment.content
-				: attachment?.type === 'pdfFile'
-					? attachment.content
-					: textContent)
-	);
-
-	let isAudio = $derived(
-		getFileTypeCategory(displayType) === FileTypeCategory.AUDIO || displayType === 'audio'
-	);
-
-	let isImage = $derived(
-		getFileTypeCategory(displayType) === FileTypeCategory.IMAGE || displayType === 'image'
-	);
-
-	let isPdf = $derived(displayType === MimeTypeApplication.PDF);
-
-	let isText = $derived(
-		getFileTypeCategory(displayType) === FileTypeCategory.TEXT || displayType === 'text'
-	);
-
-	let IconComponent = $derived(() => {
-		if (isImage) return Image;
-		if (isText || isPdf) return FileText;
-		if (isAudio) return Music;
-
-		return FileIcon;
-	});
-
-	let pdfViewMode = $state<'text' | 'pages'>('pages');
-
-	let pdfImages = $state<string[]>([]);
-
-	let pdfImagesLoading = $state(false);
-
-	let pdfImagesError = $state<string | null>(null);
-
-	async function loadPdfImages() {
-		if (!isPdf || pdfImages.length > 0 || pdfImagesLoading) return;
-
-		pdfImagesLoading = true;
-		pdfImagesError = null;
-
-		try {
-			let file: File | null = null;
-
-			if (uploadedFile?.file) {
-				file = uploadedFile.file;
-			} else if (attachment?.type === 'pdfFile') {
-				// Check if we have pre-processed images
-				if (attachment.images && Array.isArray(attachment.images)) {
-					pdfImages = attachment.images;
-					return;
-				}
-
-				// Convert base64 back to File for processing
-				if (attachment.base64Data) {
-					const base64Data = attachment.base64Data;
-					const byteCharacters = atob(base64Data);
-					const byteNumbers = new Array(byteCharacters.length);
-					for (let i = 0; i < byteCharacters.length; i++) {
-						byteNumbers[i] = byteCharacters.charCodeAt(i);
-					}
-					const byteArray = new Uint8Array(byteNumbers);
-					file = new File([byteArray], displayName, { type: MimeTypeApplication.PDF });
-				}
-			}
-
-			if (file) {
-				pdfImages = await convertPDFToImage(file);
-			} else {
-				throw new Error('No PDF file available for conversion');
-			}
-		} catch (error) {
-			pdfImagesError = error instanceof Error ? error.message : 'Failed to load PDF images';
-		} finally {
-			pdfImagesLoading = false;
-		}
-	}
-
-	export function reset() {
-		pdfImages = [];
-		pdfImagesLoading = false;
-		pdfImagesError = null;
-		pdfViewMode = 'pages';
-	}
-
-	$effect(() => {
-		if (isPdf && pdfViewMode === 'pages') {
-			loadPdfImages();
-		}
-	});
-</script>
-
-<div class="space-y-4">
-	<div class="flex items-center justify-end gap-6">
-		{#if isPdf}
-			<div class="flex items-center gap-2">
-				<Button
-					variant={pdfViewMode === 'text' ? 'default' : 'outline'}
-					size="sm"
-					onclick={() => (pdfViewMode = 'text')}
-					disabled={pdfImagesLoading}
-				>
-					<FileText class="mr-1 h-4 w-4" />
-
-					Text
-				</Button>
-
-				<Button
-					variant={pdfViewMode === 'pages' ? 'default' : 'outline'}
-					size="sm"
-					onclick={() => {
-						pdfViewMode = 'pages';
-						loadPdfImages();
-					}}
-					disabled={pdfImagesLoading}
-				>
-					{#if pdfImagesLoading}
-						<div
-							class="mr-1 h-4 w-4 animate-spin rounded-full border-2 border-current border-t-transparent"
-						></div>
-					{:else}
-						<Eye class="mr-1 h-4 w-4" />
-					{/if}
-
-					Pages
-				</Button>
-			</div>
-		{/if}
-	</div>
-
-	<div class="flex-1 overflow-auto">
-		{#if isImage && displayPreview}
-			<div class="flex items-center justify-center">
-				<img
-					src={displayPreview}
-					alt={displayName}
-					class="max-h-full rounded-lg object-contain shadow-lg"
-				/>
-			</div>
-		{:else if isPdf && pdfViewMode === 'pages'}
-			{#if pdfImagesLoading}
-				<div class="flex items-center justify-center p-8">
-					<div class="text-center">
-						<div
-							class="mx-auto mb-4 h-8 w-8 animate-spin rounded-full border-4 border-primary border-t-transparent"
-						></div>
-
-						<p class="text-muted-foreground">Converting PDF to images...</p>
-					</div>
-				</div>
-			{:else if pdfImagesError}
-				<div class="flex items-center justify-center p-8">
-					<div class="text-center">
-						<FileText class="mx-auto mb-4 h-16 w-16 text-muted-foreground" />
-
-						<p class="mb-4 text-muted-foreground">Failed to load PDF images</p>
-
-						<p class="text-sm text-muted-foreground">{pdfImagesError}</p>
-
-						<Button class="mt-4" onclick={() => (pdfViewMode = 'text')}>View as Text</Button>
-					</div>
-				</div>
-			{:else if pdfImages.length > 0}
-				<div class="max-h-[70vh] space-y-4 overflow-auto">
-					{#each pdfImages as image, index (image)}
-						<div class="text-center">
-							<p class="mb-2 text-sm text-muted-foreground">Page {index + 1}</p>
-
-							<img
-								src={image}
-								alt="PDF Page {index + 1}"
-								class="mx-auto max-w-full rounded-lg shadow-lg"
-							/>
-						</div>
-					{/each}
-				</div>
-			{:else}
-				<div class="flex items-center justify-center p-8">
-					<div class="text-center">
-						<FileText class="mx-auto mb-4 h-16 w-16 text-muted-foreground" />
-
-						<p class="mb-4 text-muted-foreground">No PDF pages available</p>
-					</div>
-				</div>
-			{/if}
-		{:else if (isText || (isPdf && pdfViewMode === 'text')) && displayTextContent}
-			<div
-				class="max-h-[60vh] overflow-auto rounded-lg bg-muted p-4 font-mono text-sm break-words whitespace-pre-wrap"
-			>
-				{displayTextContent}
-			</div>
-		{:else if isAudio}
-			<div class="flex items-center justify-center p-8">
-				<div class="w-full max-w-md text-center">
-					<Music class="mx-auto mb-4 h-16 w-16 text-muted-foreground" />
-
-					{#if attachment?.type === 'audioFile'}
-						<audio
-							controls
-							class="mb-4 w-full"
-							src="data:{attachment.mimeType};base64,{attachment.base64Data}"
-						>
-							Your browser does not support the audio element.
-						</audio>
-					{:else if uploadedFile?.preview}
-						<audio controls class="mb-4 w-full" src={uploadedFile.preview}>
-							Your browser does not support the audio element.
-						</audio>
-					{:else}
-						<p class="mb-4 text-muted-foreground">Audio preview not available</p>
-					{/if}
-
-					<p class="text-sm text-muted-foreground">
-						{displayName}
-					</p>
-				</div>
-			</div>
-		{:else}
-			<div class="flex items-center justify-center p-8">
-				<div class="text-center">
-					{#if IconComponent}
-						<IconComponent class="mx-auto mb-4 h-16 w-16 text-muted-foreground" />
-					{/if}
-
-					<p class="mb-4 text-muted-foreground">Preview not available for this file type</p>
-				</div>
-			</div>
-		{/if}
-	</div>
-</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentThumbnailFile.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentThumbnailFile.svelte
@ -1,129 +0,0 @@
-<script lang="ts">
-	import { RemoveButton } from '$lib/components/app';
-	import { formatFileSize, getFileTypeLabel, getPreviewText } from '$lib/utils/file-preview';
-	import { FileTypeCategory, MimeTypeText } from '$lib/enums/files';
-
-	interface Props {
-		class?: string;
-		id: string;
-		onClick?: (event?: MouseEvent) => void;
-		onRemove?: (id: string) => void;
-		name: string;
-		readonly?: boolean;
-		size?: number;
-		textContent?: string;
-		type: string;
-	}
-
-	let {
-		class: className = '',
-		id,
-		onClick,
-		onRemove,
-		name,
-		readonly = false,
-		size,
-		textContent,
-		type
-	}: Props = $props();
-</script>
-
-{#if type === MimeTypeText.PLAIN || type === FileTypeCategory.TEXT}
-	{#if readonly}
-		<!-- Readonly mode (ChatMessage) -->
-		<button
-			class="cursor-pointer rounded-lg border border-border bg-muted p-3 transition-shadow hover:shadow-md {className} w-full max-w-2xl"
-			onclick={onClick}
-			aria-label={`Preview ${name}`}
-			type="button"
-		>
-			<div class="flex items-start gap-3">
-				<div class="flex min-w-0 flex-1 flex-col items-start text-left">
-					<span class="w-full truncate text-sm font-medium text-foreground">{name}</span>
-
-					{#if size}
-						<span class="text-xs text-muted-foreground">{formatFileSize(size)}</span>
-					{/if}
-
-					{#if textContent && type === 'text'}
-						<div class="relative mt-2 w-full">
-							<div
-								class="overflow-hidden font-mono text-xs leading-relaxed break-words whitespace-pre-wrap text-muted-foreground"
-							>
-								{getPreviewText(textContent)}
-							</div>
-
-							{#if textContent.length > 150}
-								<div
-									class="pointer-events-none absolute right-0 bottom-0 left-0 h-6 bg-gradient-to-t from-muted to-transparent"
-								></div>
-							{/if}
-						</div>
-					{/if}
-				</div>
-			</div>
-		</button>
-	{:else}
-		<!-- Non-readonly mode (ChatForm) -->
-		<button
-			class="group relative rounded-lg border border-border bg-muted p-3 {className} {textContent
-				? 'max-h-24 max-w-72'
-				: 'max-w-36'} cursor-pointer text-left"
-			onclick={onClick}
-		>
-			<div class="absolute top-2 right-2 opacity-0 transition-opacity group-hover:opacity-100">
-				<RemoveButton {id} {onRemove} />
-			</div>
-
-			<div class="pr-8">
-				<span class="mb-3 block truncate text-sm font-medium text-foreground">{name}</span>
-
-				{#if textContent}
-					<div class="relative">
-						<div
-							class="overflow-hidden font-mono text-xs leading-relaxed break-words whitespace-pre-wrap text-muted-foreground"
-							style="max-height: 3rem; line-height: 1.2em;"
-						>
-							{getPreviewText(textContent)}
-						</div>
-
-						{#if textContent.length > 150}
-							<div
-								class="pointer-events-none absolute right-0 bottom-0 left-0 h-4 bg-gradient-to-t from-muted to-transparent"
-							></div>
-						{/if}
-					</div>
-				{/if}
-			</div>
-		</button>
-	{/if}
-{:else}
-	<button
-		class="group flex items-center gap-3 rounded-lg border border-border bg-muted p-3 {className} relative"
-		onclick={onClick}
-	>
-		<div
-			class="flex h-8 w-8 items-center justify-center rounded bg-primary/10 text-xs font-medium text-primary"
-		>
-			{getFileTypeLabel(type)}
-		</div>
-
-		<div class="flex flex-col gap-1">
-			<span
-				class="max-w-24 truncate text-sm font-medium text-foreground group-hover:pr-6 md:max-w-32"
-			>
-				{name}
-			</span>
-
-			{#if size}
-				<span class="text-left text-xs text-muted-foreground">{formatFileSize(size)}</span>
-			{/if}
-		</div>
-
-		{#if !readonly}
-			<div class="absolute top-2 right-2 opacity-0 transition-opacity group-hover:opacity-100">
-				<RemoveButton {id} {onRemove} />
-			</div>
-		{/if}
-	</button>
-{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList.svelte
@ -1,278 +0,0 @@
-<script lang="ts">
-	import { ChatAttachmentThumbnailImage, ChatAttachmentThumbnailFile } from '$lib/components/app';
-	import { Button } from '$lib/components/ui/button';
-	import { ChevronLeft, ChevronRight } from '@lucide/svelte';
-	import { FileTypeCategory } from '$lib/enums/files';
-	import { getFileTypeCategory } from '$lib/utils/file-type';
-	import { DialogChatAttachmentPreview, DialogChatAttachmentsViewAll } from '$lib/components/app';
-	import type { ChatAttachmentDisplayItem, ChatAttachmentPreviewItem } from '$lib/types/chat';
-
-	interface Props {
-		class?: string;
-		style?: string;
-		// For ChatMessage - stored attachments
-		attachments?: DatabaseMessageExtra[];
-		readonly?: boolean;
-		// For ChatForm - pending uploads
-		onFileRemove?: (fileId: string) => void;
-		uploadedFiles?: ChatUploadedFile[];
-		// Image size customization
-		imageClass?: string;
-		imageHeight?: string;
-		imageWidth?: string;
-		// Limit display to single row with "+ X more" button
-		limitToSingleRow?: boolean;
-	}
-
-	let {
-		class: className = '',
-		style = '',
-		attachments = [],
-		readonly = false,
-		onFileRemove,
-		uploadedFiles = $bindable([]),
-		// Default to small size for form previews
-		imageClass = '',
-		imageHeight = 'h-24',
-		imageWidth = 'w-auto',
-		limitToSingleRow = false
-	}: Props = $props();
-
-	let displayItems = $derived(getDisplayItems());
-
-	let canScrollLeft = $state(false);
-	let canScrollRight = $state(false);
-	let isScrollable = $state(false);
-	let previewDialogOpen = $state(false);
-	let previewItem = $state<ChatAttachmentPreviewItem | null>(null);
-	let scrollContainer: HTMLDivElement | undefined = $state();
-	let showViewAll = $derived(limitToSingleRow && displayItems.length > 0 && isScrollable);
-	let viewAllDialogOpen = $state(false);
-
-	function getDisplayItems(): ChatAttachmentDisplayItem[] {
-		const items: ChatAttachmentDisplayItem[] = [];
-
-		// Add uploaded files (ChatForm)
-		for (const file of uploadedFiles) {
-			items.push({
-				id: file.id,
-				name: file.name,
-				size: file.size,
-				preview: file.preview,
-				type: file.type,
-				isImage: getFileTypeCategory(file.type) === FileTypeCategory.IMAGE,
-				uploadedFile: file,
-				textContent: file.textContent
-			});
-		}
-
-		// Add stored attachments (ChatMessage)
-		for (const [index, attachment] of attachments.entries()) {
-			if (attachment.type === 'imageFile') {
-				items.push({
-					id: `attachment-${index}`,
-					name: attachment.name,
-					preview: attachment.base64Url,
-					type: 'image',
-					isImage: true,
-					attachment,
-					attachmentIndex: index
-				});
-			} else if (attachment.type === 'textFile') {
-				items.push({
-					id: `attachment-${index}`,
-					name: attachment.name,
-					type: 'text',
-					isImage: false,
-					attachment,
-					attachmentIndex: index,
-					textContent: attachment.content
-				});
-			} else if (attachment.type === 'context') {
-				// Legacy format from old webui - treat as text file
-				items.push({
-					id: `attachment-${index}`,
-					name: attachment.name,
-					type: 'text',
-					isImage: false,
-					attachment,
-					attachmentIndex: index,
-					textContent: attachment.content
-				});
-			} else if (attachment.type === 'audioFile') {
-				items.push({
-					id: `attachment-${index}`,
-					name: attachment.name,
-					type: attachment.mimeType || 'audio',
-					isImage: false,
-					attachment,
-					attachmentIndex: index
-				});
-			} else if (attachment.type === 'pdfFile') {
-				items.push({
-					id: `attachment-${index}`,
-					name: attachment.name,
-					type: 'application/pdf',
-					isImage: false,
-					attachment,
-					attachmentIndex: index,
-					textContent: attachment.content
-				});
-			}
-		}
-
-		return items.reverse();
-	}
-
-	function openPreview(item: ChatAttachmentDisplayItem, event?: MouseEvent) {
-		event?.stopPropagation();
-		event?.preventDefault();
-
-		previewItem = {
-			uploadedFile: item.uploadedFile,
-			attachment: item.attachment,
-			preview: item.preview,
-			name: item.name,
-			type: item.type,
-			size: item.size,
-			textContent: item.textContent
-		};
-		previewDialogOpen = true;
-	}
-
-	function scrollLeft(event?: MouseEvent) {
-		event?.stopPropagation();
-		event?.preventDefault();
-
-		if (!scrollContainer) return;
-
-		scrollContainer.scrollBy({ left: scrollContainer.clientWidth * -0.67, behavior: 'smooth' });
-	}
-
-	function scrollRight(event?: MouseEvent) {
-		event?.stopPropagation();
-		event?.preventDefault();
-
-		if (!scrollContainer) return;
-
-		scrollContainer.scrollBy({ left: scrollContainer.clientWidth * 0.67, behavior: 'smooth' });
-	}
-
-	function updateScrollButtons() {
-		if (!scrollContainer) return;
-
-		const { scrollLeft, scrollWidth, clientWidth } = scrollContainer;
-
-		canScrollLeft = scrollLeft > 0;
-		canScrollRight = scrollLeft < scrollWidth - clientWidth - 1;
-		isScrollable = scrollWidth > clientWidth;
-	}
-
-	$effect(() => {
-		if (scrollContainer && displayItems.length) {
-			scrollContainer.scrollLeft = 0;
-
-			setTimeout(() => {
-				updateScrollButtons();
-			}, 0);
-		}
-	});
-</script>
-
-{#if displayItems.length > 0}
-	<div class={className} {style}>
-		<div class="relative">
-			<button
-				class="absolute top-1/2 left-4 z-10 flex h-6 w-6 -translate-y-1/2 items-center justify-center rounded-full bg-foreground/15 shadow-md backdrop-blur-xs transition-opacity hover:bg-foreground/35 {canScrollLeft
-					? 'opacity-100'
-					: 'pointer-events-none opacity-0'}"
-				onclick={scrollLeft}
-				aria-label="Scroll left"
-			>
-				<ChevronLeft class="h-4 w-4" />
-			</button>
-
-			<div
-				class="scrollbar-hide flex items-start gap-3 overflow-x-auto"
-				bind:this={scrollContainer}
-				onscroll={updateScrollButtons}
-			>
-				{#each displayItems as item (item.id)}
-					{#if item.isImage && item.preview}
-						<ChatAttachmentThumbnailImage
-							class="flex-shrink-0 cursor-pointer {limitToSingleRow ? 'first:ml-4 last:mr-4' : ''}"
-							id={item.id}
-							name={item.name}
-							preview={item.preview}
-							{readonly}
-							onRemove={onFileRemove}
-							height={imageHeight}
-							width={imageWidth}
-							{imageClass}
-							onClick={(event) => openPreview(item, event)}
-						/>
-					{:else}
-						<ChatAttachmentThumbnailFile
-							class="flex-shrink-0 cursor-pointer {limitToSingleRow ? 'first:ml-4 last:mr-4' : ''}"
-							id={item.id}
-							name={item.name}
-							type={item.type}
-							size={item.size}
-							{readonly}
-							onRemove={onFileRemove}
-							textContent={item.textContent}
-							onClick={(event) => openPreview(item, event)}
-						/>
-					{/if}
-				{/each}
-			</div>
-
-			<button
-				class="absolute top-1/2 right-4 z-10 flex h-6 w-6 -translate-y-1/2 items-center justify-center rounded-full bg-foreground/15 shadow-md backdrop-blur-xs transition-opacity hover:bg-foreground/35 {canScrollRight
-					? 'opacity-100'
-					: 'pointer-events-none opacity-0'}"
-				onclick={scrollRight}
-				aria-label="Scroll right"
-			>
-				<ChevronRight class="h-4 w-4" />
-			</button>
-		</div>
-
-		{#if showViewAll}
-			<div class="mt-2 -mr-2 flex justify-end px-4">
-				<Button
-					type="button"
-					variant="ghost"
-					size="sm"
-					class="h-6 text-xs text-muted-foreground hover:text-foreground"
-					onclick={() => (viewAllDialogOpen = true)}
-				>
-					View all
-				</Button>
-			</div>
-		{/if}
-	</div>
-{/if}
-
-{#if previewItem}
-	<DialogChatAttachmentPreview
-		bind:open={previewDialogOpen}
-		uploadedFile={previewItem.uploadedFile}
-		attachment={previewItem.attachment}
-		preview={previewItem.preview}
-		name={previewItem.name}
-		type={previewItem.type}
-		size={previewItem.size}
-		textContent={previewItem.textContent}
-	/>
-{/if}
-
-<DialogChatAttachmentsViewAll
-	bind:open={viewAllDialogOpen}
-	{uploadedFiles}
-	{attachments}
-	{readonly}
-	{onFileRemove}
-	imageHeight="h-64"
-	{imageClass}
-/>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsList.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsList.svelte
@ -0,0 +1,119 @@
+<script lang="ts">
+	import {
+		ChatAttachmentsListItem,
+		DialogChatAttachmentsPreview,
+		DialogMcpResourcePreview,
+		HorizontalScrollCarousel
+	} from '$lib/components/app';
+	import type { DatabaseMessageExtraMcpResource } from '$lib/types';
+	import { getAttachmentDisplayItems, isMcpPrompt, isMcpResource } from '$lib/utils';
+
+	interface Props {
+		class?: string;
+		style?: string;
+		// For ChatMessage - stored attachments
+		attachments?: DatabaseMessageExtra[];
+		readonly?: boolean;
+		// For ChatForm - pending uploads
+		onFileRemove?: (fileId: string) => void;
+		uploadedFiles?: ChatUploadedFile[];
+		// Image size customization
+		imageClass?: string;
+		imageHeight?: string;
+		imageWidth?: string;
+		// Limit display to single row with "+ X more" button
+		limitToSingleRow?: boolean;
+		// For vision modality check
+		activeModelId?: string;
+	}
+
+	let {
+		class: className = '',
+		style = '',
+		attachments = [],
+		readonly = false,
+		onFileRemove,
+		uploadedFiles = $bindable([]),
+		// Default to small size for form previews
+		imageClass = '',
+		imageHeight = 'h-24',
+		imageWidth = 'w-auto',
+		limitToSingleRow = false,
+		activeModelId
+	}: Props = $props();
+
+	let carouselRef: HorizontalScrollCarousel | undefined = $state();
+	let mcpResourcePreviewOpen = $state(false);
+	let mcpResourcePreviewExtra = $state<DatabaseMessageExtraMcpResource | null>(null);
+	let previewFocusIndex = $state(0);
+	let viewAllDialogOpen = $state(false);
+
+	let displayItems = $derived(getAttachmentDisplayItems({ uploadedFiles, attachments }));
+
+	function openPreview(item: ChatAttachmentDisplayItem, event?: MouseEvent) {
+		event?.stopPropagation();
+		event?.preventDefault();
+
+		// Find the index of the clicked item among non-MCP attachments
+		const nonMcpItems = displayItems.filter((i) => !isMcpPrompt(i) && !isMcpResource(i));
+		const index = nonMcpItems.findIndex((i) => i.id === item.id);
+
+		previewFocusIndex = index >= 0 ? index : 0;
+		viewAllDialogOpen = true;
+	}
+
+	function openMcpResourcePreview(extra: DatabaseMessageExtraMcpResource) {
+		mcpResourcePreviewExtra = extra;
+		mcpResourcePreviewOpen = true;
+	}
+
+	$effect(() => {
+		if (carouselRef && displayItems.length) {
+			carouselRef.resetScroll();
+		}
+	});
+</script>
+
+{#snippet attachmentitem(item: ChatAttachmentDisplayItem)}
+	<ChatAttachmentsListItem
+		{imageClass}
+		{imageHeight}
+		{imageWidth}
+		{item}
+		{limitToSingleRow}
+		{onFileRemove}
+		onMcpResourcePreview={openMcpResourcePreview}
+		onPreview={(i: ChatAttachmentDisplayItem, event?: MouseEvent) => openPreview(i, event)}
+		{readonly}
+	/>
+{/snippet}
+
+{#if displayItems.length > 0}
+	<div class={className} {style}>
+		{#if limitToSingleRow}
+			<HorizontalScrollCarousel bind:this={carouselRef}>
+				{#each displayItems as item (item.id)}
+					{@render attachmentitem(item)}
+				{/each}
+			</HorizontalScrollCarousel>
+		{:else}
+			<div class="flex flex-wrap items-start justify-end gap-3">
+				{#each displayItems as item (item.id)}
+					{@render attachmentitem(item)}
+				{/each}
+			</div>
+		{/if}
+	</div>
+{/if}
+
+<DialogChatAttachmentsPreview
+	{activeModelId}
+	{attachments}
+	bind:open={viewAllDialogOpen}
+	{previewFocusIndex}
+	{uploadedFiles}
+/>
+
+{#if mcpResourcePreviewExtra}
+	<DialogMcpResourcePreview extra={mcpResourcePreviewExtra} bind:open={mcpResourcePreviewOpen} />
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItem.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItem.svelte
@ -0,0 +1,132 @@
+<script lang="ts">
+	import {
+		ChatAttachmentsListItemMcpPrompt,
+		ChatAttachmentsListItemMcpResource,
+		ChatAttachmentsListItemThumbnailImage,
+		ChatAttachmentsListItemThumbnailFile
+	} from '$lib/components/app';
+	import { AttachmentType } from '$lib/enums';
+	import type {
+		ChatAttachmentDisplayItem,
+		DatabaseMessageExtraMcpPrompt,
+		DatabaseMessageExtraMcpResource,
+		MCPResourceAttachment
+	} from '$lib/types';
+	import { isMcpPrompt, isMcpResource, isPdfFile } from '$lib/utils';
+
+	interface Props {
+		class?: string;
+		imageClass?: string;
+		imageHeight?: string;
+		imageWidth?: string;
+		item: ChatAttachmentDisplayItem;
+		limitToSingleRow?: boolean;
+		onFileRemove?: (fileId: string) => void;
+		onMcpResourcePreview?: (extra: DatabaseMessageExtraMcpResource) => void;
+		onPreview?: (item: ChatAttachmentDisplayItem) => void;
+		readonly?: boolean;
+	}
+
+	let {
+		class: className = '',
+		imageClass = '',
+		imageHeight = 'h-24',
+		imageWidth = 'w-auto',
+		item,
+		limitToSingleRow = false,
+		onFileRemove,
+		onMcpResourcePreview,
+		onPreview,
+		readonly = false
+	}: Props = $props();
+
+	const scrollClasses = $derived(limitToSingleRow ? 'first:ml-4 last:mr-4' : '');
+
+	function toMcpResourceAttachment(
+		extra: DatabaseMessageExtraMcpResource,
+		id: string
+	): MCPResourceAttachment {
+		return {
+			id,
+			resource: {
+				uri: extra.uri,
+				name: extra.name,
+				title: extra.name,
+				serverName: extra.serverName
+			}
+		};
+	}
+</script>
+
+{#if isMcpPrompt(item)}
+	{@const mcpPrompt =
+		item.attachment?.type === AttachmentType.MCP_PROMPT
+			? (item.attachment as DatabaseMessageExtraMcpPrompt)
+			: item.uploadedFile?.mcpPrompt
+				? {
+						type: AttachmentType.MCP_PROMPT as const,
+						name: item.name,
+						serverName: item.uploadedFile.mcpPrompt.serverName,
+						promptName: item.uploadedFile.mcpPrompt.promptName,
+						content: item.textContent ?? '',
+						arguments: item.uploadedFile.mcpPrompt.arguments
+					}
+				: null}
+	{#if mcpPrompt}
+		<ChatAttachmentsListItemMcpPrompt
+			class="max-w-[300px] min-w-[200px] flex-shrink-0 {className} {scrollClasses}"
+			prompt={mcpPrompt}
+			{readonly}
+			isLoading={item.isLoading}
+			loadError={item.loadError}
+			onRemove={onFileRemove ? () => onFileRemove(item.id) : undefined}
+		/>
+	{/if}
+{:else if isMcpResource(item)}
+	{@const mcpResource = item.attachment as DatabaseMessageExtraMcpResource}
+
+	<ChatAttachmentsListItemMcpResource
+		class="flex-shrink-0 {className} {scrollClasses}"
+		attachment={toMcpResourceAttachment(mcpResource, item.id)}
+		onclick={() => onMcpResourcePreview?.(mcpResource)}
+	/>
+{:else if item.isImage && item.preview}
+	<ChatAttachmentsListItemThumbnailImage
+		class="flex-shrink-0 cursor-pointer {className} {scrollClasses}"
+		id={item.id}
+		name={item.name}
+		preview={item.preview}
+		{readonly}
+		onRemove={onFileRemove}
+		height={imageHeight}
+		width={imageWidth}
+		{imageClass}
+		onclick={() => onPreview?.(item)}
+	/>
+{:else if isPdfFile(item.attachment, item.uploadedFile)}
+	<ChatAttachmentsListItemThumbnailFile
+		class="flex-shrink-0 cursor-pointer {className} {scrollClasses}"
+		id={item.id}
+		name={item.name}
+		size={item.size}
+		{readonly}
+		onRemove={onFileRemove}
+		textContent={item.textContent}
+		attachment={item.attachment}
+		uploadedFile={item.uploadedFile}
+		onclick={() => onPreview?.(item)}
+	/>
+{:else}
+	<ChatAttachmentsListItemThumbnailFile
+		class="flex-shrink-0 cursor-pointer {className} {scrollClasses}"
+		id={item.id}
+		name={item.name}
+		size={item.size}
+		{readonly}
+		onRemove={onFileRemove}
+		textContent={item.textContent}
+		attachment={item.attachment}
+		uploadedFile={item.uploadedFile}
+		onclick={() => onPreview?.(item)}
+	/>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemMcpPrompt.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemMcpPrompt.svelte
@ -0,0 +1,41 @@
+<script lang="ts">
+	import { ChatMessageMcpPromptContent, ActionIcon } from '$lib/components/app';
+	import { X } from '@lucide/svelte';
+	import type { DatabaseMessageExtraMcpPrompt } from '$lib/types';
+	import { McpPromptVariant } from '$lib/enums';
+
+	interface Props {
+		class?: string;
+		isLoading?: boolean;
+		loadError?: string;
+		onRemove?: () => void;
+		prompt: DatabaseMessageExtraMcpPrompt;
+		readonly?: boolean;
+	}
+
+	let {
+		class: className = '',
+		isLoading = false,
+		loadError,
+		onRemove,
+		prompt,
+		readonly = false
+	}: Props = $props();
+</script>
+
+<div class="group relative {className}">
+	<ChatMessageMcpPromptContent
+		{isLoading}
+		{loadError}
+		{prompt}
+		variant={McpPromptVariant.ATTACHMENT}
+	/>
+
+	{#if !readonly && onRemove}
+		<div
+			class="absolute top-10 right-2 flex items-center justify-center opacity-0 transition-opacity group-hover:opacity-100"
+		>
+			<ActionIcon icon={X} tooltip="Remove" stopPropagationOnClick onclick={() => onRemove?.()} />
+		</div>
+	{/if}
+</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemMcpResource.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemMcpResource.svelte
@ -0,0 +1,89 @@
+<script lang="ts">
+	import { Loader2, AlertCircle } from '@lucide/svelte';
+	import { mcpStore } from '$lib/stores/mcp.svelte';
+	import type { MCPResourceAttachment } from '$lib/types';
+	import * as Tooltip from '$lib/components/ui/tooltip';
+	import { ActionIcon } from '$lib/components/app';
+	import { X } from '@lucide/svelte';
+	import { getResourceIcon, getResourceDisplayName } from '$lib/utils';
+
+	interface Props {
+		attachment: MCPResourceAttachment;
+		class?: string;
+		onclick?: () => void;
+		onRemove?: (attachmentId: string) => void;
+	}
+
+	let { attachment, class: className, onclick, onRemove }: Props = $props();
+
+	const ResourceIcon = $derived(
+		getResourceIcon(attachment.resource.mimeType, attachment.resource.uri)
+	);
+	const serverName = $derived(mcpStore.getServerDisplayName(attachment.resource.serverName));
+	const favicon = $derived(mcpStore.getServerFavicon(attachment.resource.serverName));
+
+	function getStatusClass(attachment: MCPResourceAttachment): string {
+		if (attachment.error) return 'border-red-500/50 bg-red-500/10';
+		if (attachment.loading) return 'border-border/50 bg-muted/30';
+
+		return 'border-border/50 bg-muted/30';
+	}
+</script>
+
+<Tooltip.Root>
+	<Tooltip.Trigger>
+		<button
+			class={[
+				'flex flex-shrink-0 items-center gap-1.5 rounded-md border px-2 py-0.75 text-sm transition-colors',
+				getStatusClass(attachment),
+				onclick && 'cursor-pointer hover:bg-muted/50',
+				className
+			]}
+			disabled={!onclick}
+			{onclick}
+			type="button"
+		>
+			{#if attachment.loading}
+				<Loader2 class="h-3 w-3 animate-spin text-muted-foreground" />
+			{:else if attachment.error}
+				<AlertCircle class="h-3 w-3 text-red-500" />
+			{:else}
+				<ResourceIcon class="h-3 w-3 text-muted-foreground" />
+			{/if}
+
+			<span class="max-w-[150px] truncate text-xs">
+				{getResourceDisplayName(attachment.resource)}
+			</span>
+
+			{#if onRemove}
+				<ActionIcon
+					class="-my-2 -mr-1.5 bg-transparent"
+					icon={X}
+					iconSize="h-2 w-2"
+					onclick={() => onRemove?.(attachment.id)}
+					stopPropagationOnClick
+					tooltip="Remove"
+				/>
+			{/if}
+		</button>
+	</Tooltip.Trigger>
+
+	<Tooltip.Content>
+		<div class="flex items-center gap-1 text-xs">
+			{#if favicon}
+				<img
+					alt={attachment.resource.serverName}
+					class="h-3 w-3 shrink-0 rounded-sm"
+					onerror={(e) => {
+						(e.currentTarget as HTMLImageElement).style.display = 'none';
+					}}
+					src={favicon}
+				/>
+			{/if}
+
+			<span class="truncate">
+				{serverName}
+			</span>
+		</div>
+	</Tooltip.Content>
+</Tooltip.Root>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemThumbnailFile.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemThumbnailFile.svelte
@ -0,0 +1,184 @@
+<script lang="ts">
+	import { X, Music, Video } from '@lucide/svelte';
+	import {
+		formatFileSize,
+		getFileTypeLabel,
+		getPreviewText,
+		isPdfFile,
+		isAudioFile,
+		isVideoFile,
+		isTextFile
+	} from '$lib/utils';
+	import { ActionIcon } from '$lib/components/app';
+	import { AttachmentType } from '$lib/enums';
+
+	interface Props {
+		attachment?: DatabaseMessageExtra;
+		class?: string;
+		id: string;
+		onclick?: (event: MouseEvent) => void;
+		onRemove?: (id: string) => void;
+		name: string;
+		readonly?: boolean;
+		size?: number;
+		textContent?: string;
+		// Either uploaded file or stored attachment
+		uploadedFile?: ChatUploadedFile;
+	}
+
+	let {
+		attachment,
+		class: className = '',
+		id,
+		onclick,
+		onRemove,
+		name,
+		readonly = false,
+		size,
+		textContent,
+		uploadedFile
+	}: Props = $props();
+
+	let isPdf = $derived(isPdfFile(attachment, uploadedFile));
+	let isAudio = $derived(isAudioFile(attachment, uploadedFile));
+	let isVideo = $derived(isVideoFile(attachment, uploadedFile));
+	let isPdfWithContent = $derived(isPdf && !!textContent);
+
+	let isText = $derived(isTextFile(attachment, uploadedFile));
+	let isTextWithContent = $derived(isText && !!textContent);
+
+	let fileTypeLabel = $derived.by(() => {
+		if (uploadedFile?.type) {
+			return getFileTypeLabel(uploadedFile.type);
+		}
+
+		if (attachment) {
+			if ('mimeType' in attachment && attachment.mimeType) {
+				return getFileTypeLabel(attachment.mimeType);
+			}
+
+			if (attachment.type) {
+				return getFileTypeLabel(attachment.type);
+			}
+		}
+
+		return getFileTypeLabel(name);
+	});
+
+	let pdfProcessingMode = $derived.by(() => {
+		if (attachment?.type === AttachmentType.PDF) {
+			const pdfAttachment = attachment as DatabaseMessageExtraPdfFile;
+
+			return pdfAttachment.processedAsImages ? 'Sent as Image' : 'Sent as Text';
+		}
+
+		return null;
+	});
+</script>
+
+{#snippet textPreview(content: string)}
+	<div class="relative">
+		<div
+			class="font-mono text-xs leading-relaxed break-words whitespace-pre-wrap text-muted-foreground {!readonly
+				? 'max-h-3rem line-height-1.2'
+				: ''}"
+		>
+			{getPreviewText(content)}
+		</div>
+
+		{#if content.length > 150}
+			<div
+				class="pointer-events-none absolute right-0 bottom-0 left-0 h-4 bg-gradient-to-t from-muted to-transparent {readonly
+					? 'h-6'
+					: ''}"
+			></div>
+		{/if}
+	</div>
+{/snippet}
+
+{#snippet removeButton()}
+	<div class="absolute top-2 right-2 opacity-0 transition-opacity group-hover:opacity-100">
+		<ActionIcon icon={X} tooltip="Remove" stopPropagationOnClick onclick={() => onRemove?.(id)} />
+	</div>
+{/snippet}
+
+{#snippet fileIcon()}
+	<div
+		class="flex h-8 w-8 items-center justify-center rounded bg-primary/10 text-xs font-medium text-primary"
+	>
+		{#if isAudio}
+			<Music class="h-4 w-4 text-white/70" />
+		{:else if isVideo}
+			<Video class="h-4 w-4 text-white/70" />
+		{:else}
+			{fileTypeLabel}
+		{/if}
+	</div>
+{/snippet}
+
+{#snippet info(text: string | undefined)}
+	{#if text}
+		<span class="text-xs text-muted-foreground">{text}</span>
+	{/if}
+{/snippet}
+
+{#if isTextWithContent || isPdfWithContent}
+	<button
+		aria-label={readonly ? `Preview ${name}` : undefined}
+		class="rounded-lg border border-border bg-muted p-3 {className} cursor-pointer {readonly
+			? 'w-full max-w-2xl transition-shadow hover:shadow-md'
+			: `group relative text-left ${textContent ? 'max-h-24 max-w-72' : 'max-w-36'}`} overflow-hidden"
+		{onclick}
+		type="button"
+	>
+		{#if !readonly}
+			{@render removeButton()}
+		{/if}
+
+		<div class={[!readonly && 'pr-8', 'overflow-hidden']}>
+			{#if readonly}
+				<div class="flex items-start gap-3">
+					<div class="flex min-w-0 flex-1 flex-col items-start text-left">
+						<span class="w-full truncate text-sm font-medium text-foreground">{name}</span>
+
+						{@render info(pdfProcessingMode || (size ? formatFileSize(size) : undefined))}
+
+						{#if textContent}
+							{@render textPreview(textContent)}
+						{/if}
+					</div>
+				</div>
+			{:else}
+				<span class="mb-3 block truncate text-sm font-medium text-foreground">{name}</span>
+
+				{#if textContent}
+					{@render textPreview(textContent)}
+				{/if}
+			{/if}
+		</div>
+	</button>
+{:else}
+	<button
+		class="group flex items-center gap-3 rounded-lg border border-border bg-muted p-3 {className} relative"
+		{onclick}
+		type="button"
+	>
+		{@render fileIcon()}
+
+		<div class="flex flex-col items-start gap-0.5">
+			<span
+				class="max-w-24 truncate text-sm font-medium text-foreground {readonly
+					? ''
+					: 'group-hover:pr-6'} md:max-w-32"
+			>
+				{name}
+			</span>
+
+			{@render info(pdfProcessingMode || (size ? formatFileSize(size) : undefined))}
+		</div>
+
+		{#if !readonly}
+			{@render removeButton()}
+		{/if}
+	</button>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemThumbnailImage.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsList/ChatAttachmentsListItem/ChatAttachmentsListItemThumbnailImage.svelte
@ -1,62 +1,65 @@
 <script lang="ts">
-	import { RemoveButton } from '$lib/components/app';
+	import { ActionIcon } from '$lib/components/app';
+	import { X } from '@lucide/svelte';

 	interface Props {
+		class?: string;
+		height?: string;
 		id: string;
+		imageClass?: string;
+		onclick?: (event?: MouseEvent) => void;
+		onRemove?: (id: string) => void;
 		name: string;
 		preview: string;
 		readonly?: boolean;
-		onRemove?: (id: string) => void;
-		onClick?: (event?: MouseEvent) => void;
-		class?: string;
-		// Customizable size props
 		width?: string;
-		height?: string;
-		imageClass?: string;
 	}

 	let {
+		class: className = '',
+		height = 'h-16',
 		id,
+		imageClass = '',
+		onclick,
+		onRemove,
 		name,
 		preview,
 		readonly = false,
-		onRemove,
-		onClick,
-		class: className = '',
-		// Default to small size for form previews
-		width = 'w-auto',
-		height = 'h-16',
-		imageClass = ''
+		width = 'w-auto'
 	}: Props = $props();
 </script>

-<div class="group relative overflow-hidden rounded-lg border border-border bg-muted {className}">
-	{#if onClick}
+{#snippet image()}
+	<img src={preview} alt={name} class="{height} {width} cursor-pointer object-cover {imageClass}" />
+{/snippet}
+
+<div
+	class="group relative overflow-hidden rounded-lg bg-muted shadow-lg dark:border dark:border-muted {className}"
+>
+	{#if onclick}
 		<button
-			type="button"
-			class="block h-full w-full rounded-lg focus:ring-2 focus:ring-primary focus:ring-offset-2 focus:outline-none"
-			onclick={onClick}
 			aria-label="Preview {name}"
+			class="block h-full w-full rounded-lg focus:ring-2 focus:ring-primary focus:ring-offset-2 focus:outline-none"
+			{onclick}
+			type="button"
 		>
-			<img
-				src={preview}
-				alt={name}
-				class="{height} {width} cursor-pointer object-cover {imageClass}"
-			/>
+			{@render image()}
 		</button>
 	{:else}
-		<img
-			src={preview}
-			alt={name}
-			class="{height} {width} cursor-pointer object-cover {imageClass}"
-		/>
+		{@render image()}
 	{/if}

 	{#if !readonly}
 		<div
 			class="absolute top-1 right-1 flex items-center justify-center opacity-0 transition-opacity group-hover:opacity-100"
 		>
-			<RemoveButton {id} {onRemove} class="text-white" />
+			<ActionIcon
+				class="text-white"
+				icon={X}
+				onclick={() => onRemove?.(id)}
+				stopPropagationOnClick
+				tooltip="Remove"
+			/>
 		</div>
 	{/if}
 </div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview.svelte
@ -0,0 +1,212 @@
+<script lang="ts">
+	import {
+		ChatAttachmentsPreviewCurrentItem,
+		ChatAttachmentsPreviewFileInfo,
+		ChatAttachmentsPreviewNavButtons,
+		ChatAttachmentsPreviewThumbnailStrip
+	} from '$lib/components/app';
+	import { modelsStore } from '$lib/stores/models.svelte';
+	import {
+		createBase64DataUrl,
+		formatFileSize,
+		getAttachmentDisplayItems,
+		getLanguageFromFilename,
+		isAudioFile,
+		isVideoFile,
+		isImageFile,
+		isMcpPrompt,
+		isMcpResource,
+		isPdfFile,
+		isTextFile
+	} from '$lib/utils';
+
+	interface PreviewItem {
+		id: string;
+		name: string;
+		size?: number;
+		preview?: string;
+		uploadedFile?: ChatUploadedFile;
+		attachment?: DatabaseMessageExtra;
+		textContent?: string;
+		isImage: boolean;
+		isAudio: boolean;
+		isVideo: boolean;
+	}
+
+	interface Props {
+		uploadedFiles?: ChatUploadedFile[];
+		attachments?: DatabaseMessageExtra[];
+		activeModelId?: string;
+		class?: string;
+		previewFocusIndex?: number;
+	}
+
+	let {
+		uploadedFiles = [],
+		attachments = [],
+		activeModelId,
+		class: className = '',
+		previewFocusIndex = 0
+	}: Props = $props();
+
+	let allItems = $derived(
+		getAttachmentDisplayItems({ uploadedFiles, attachments })
+			.filter((item) => !isMcpPrompt(item) && !isMcpResource(item))
+			.map(
+				(item): PreviewItem => ({
+					...item,
+					isImage: isImageFile(item.attachment, item.uploadedFile),
+					isAudio: isAudioFile(item.attachment, item.uploadedFile),
+					isVideo: isVideoFile(item.attachment, item.uploadedFile)
+				})
+			)
+	);
+
+	let currentIndex = $state(0);
+
+	$effect(() => {
+		if (previewFocusIndex >= 0 && previewFocusIndex < allItems.length) {
+			currentIndex = previewFocusIndex;
+		}
+	});
+
+	$effect(() => {
+		const handler = (e: Event) => {
+			const delta = (e as CustomEvent).detail;
+
+			if (delta < 0) {
+				currentIndex = currentIndex > 0 ? currentIndex - 1 : allItems.length - 1;
+			} else {
+				currentIndex = currentIndex < allItems.length - 1 ? currentIndex + 1 : 0;
+			}
+		};
+
+		document.addEventListener('chat-attachments-nav', handler);
+
+		return () => document.removeEventListener('chat-attachments-nav', handler);
+	});
+
+	$effect(() => {
+		const index = currentIndex;
+		setTimeout(() => {
+			const thumbnail = document.querySelector(`[data-thumbnail-index="${index}"]`);
+
+			thumbnail?.scrollIntoView({ behavior: 'smooth', inline: 'center', block: 'nearest' });
+		}, 0);
+	});
+
+	let currentItem = $derived(allItems[currentIndex] ?? null);
+	let displayName = $derived(
+		currentItem?.name ||
+			currentItem?.uploadedFile?.name ||
+			currentItem?.attachment?.name ||
+			'Unknown File'
+	);
+	let isAudio = $derived(
+		currentItem ? isAudioFile(currentItem.attachment, currentItem.uploadedFile) : false
+	);
+	let isVideo = $derived(
+		currentItem ? isVideoFile(currentItem.attachment, currentItem.uploadedFile) : false
+	);
+	let isImage = $derived(
+		currentItem ? isImageFile(currentItem.attachment, currentItem.uploadedFile) : false
+	);
+	let isPdf = $derived(
+		currentItem ? isPdfFile(currentItem.attachment, currentItem.uploadedFile) : false
+	);
+	let isText = $derived(
+		currentItem ? isTextFile(currentItem.attachment, currentItem.uploadedFile) : false
+	);
+
+	let displayPreview = $derived(
+		currentItem?.uploadedFile?.preview ||
+			(isImage && currentItem?.attachment && 'base64Url' in currentItem.attachment
+				? currentItem.attachment.base64Url
+				: currentItem?.preview)
+	);
+
+	let displayTextContent = $derived(
+		currentItem?.uploadedFile?.textContent ||
+			(currentItem?.attachment && 'content' in currentItem.attachment
+				? currentItem.attachment.content
+				: currentItem?.textContent)
+	);
+
+	let language = $derived(getLanguageFromFilename(displayName));
+
+	let fileSize = $derived(currentItem?.size ? formatFileSize(currentItem.size) : '');
+
+	let hasVisionModality = $derived(
+		currentItem && activeModelId ? modelsStore.modelSupportsVision(activeModelId) : false
+	);
+
+	let audioSrc = $derived(
+		isAudio && currentItem
+			? (currentItem.uploadedFile?.preview ??
+					(currentItem.attachment &&
+					'mimeType' in currentItem.attachment &&
+					'base64Data' in currentItem.attachment
+						? createBase64DataUrl(
+								currentItem.attachment.mimeType,
+								currentItem.attachment.base64Data
+							)
+						: null))
+			: null
+	);
+
+	let videoSrc = $derived(
+		isVideo && currentItem
+			? (currentItem.uploadedFile?.preview ??
+					(currentItem.attachment &&
+					'mimeType' in currentItem.attachment &&
+					'base64Data' in currentItem.attachment
+						? createBase64DataUrl(
+								currentItem.attachment.mimeType,
+								currentItem.attachment.base64Data
+							)
+						: null))
+			: null
+	);
+
+	export function prev() {
+		currentIndex = currentIndex > 0 ? currentIndex - 1 : allItems.length - 1;
+	}
+
+	export function next() {
+		currentIndex = currentIndex < allItems.length - 1 ? currentIndex + 1 : 0;
+	}
+
+	function onNavigate(index: number) {
+		currentIndex = index;
+	}
+</script>
+
+<div class="{className} flex flex-col text-white">
+	<div class="relative flex min-h-0 flex-1 items-center justify-center overflow-hidden">
+		<ChatAttachmentsPreviewNavButtons onPrev={prev} onNext={next} show={allItems.length > 1} />
+
+		<div class="flex h-full w-full flex-col items-center justify-start overflow-auto py-4">
+			{#if currentItem}
+				<ChatAttachmentsPreviewFileInfo {displayName} {fileSize} />
+
+				<ChatAttachmentsPreviewCurrentItem
+					{currentItem}
+					{isImage}
+					{isAudio}
+					{isVideo}
+					{isPdf}
+					{isText}
+					{displayPreview}
+					{displayTextContent}
+					{audioSrc}
+					{videoSrc}
+					{language}
+					{hasVisionModality}
+					{activeModelId}
+				/>
+			{/if}
+
+			<ChatAttachmentsPreviewThumbnailStrip items={allItems} {currentIndex} {onNavigate} />
+		</div>
+	</div>
+</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItem.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItem.svelte
@ -0,0 +1,74 @@
+<script lang="ts">
+	import type { ChatAttachmentDisplayItem } from '$lib/types';
+	import { Image, Music, Video, FileText, FileIcon } from '@lucide/svelte';
+	import ChatAttachmentsPreviewCurrentItemPdf from './ChatAttachmentsPreviewCurrentItemPdf.svelte';
+	import ChatAttachmentsPreviewCurrentItemImage from './ChatAttachmentsPreviewCurrentItemImage.svelte';
+	import ChatAttachmentsPreviewCurrentItemAudio from './ChatAttachmentsPreviewCurrentItemAudio.svelte';
+	import ChatAttachmentsPreviewCurrentItemVideo from './ChatAttachmentsPreviewCurrentItemVideo.svelte';
+	import ChatAttachmentsPreviewCurrentItemText from './ChatAttachmentsPreviewCurrentItemText.svelte';
+	import ChatAttachmentsPreviewCurrentItemUnavailable from './ChatAttachmentsPreviewCurrentItemUnavailable.svelte';
+
+	interface Props {
+		currentItem: ChatAttachmentDisplayItem | null;
+		isImage: boolean;
+		isAudio: boolean;
+		isVideo: boolean;
+		isPdf: boolean;
+		isText: boolean;
+		displayPreview: string | undefined;
+		displayTextContent: string | undefined;
+		audioSrc: string | null;
+		videoSrc: string | null;
+		language: string;
+		hasVisionModality: boolean;
+		activeModelId?: string;
+	}
+
+	let {
+		currentItem,
+		isImage,
+		isAudio,
+		isVideo,
+		isPdf,
+		isText,
+		displayPreview,
+		displayTextContent,
+		audioSrc,
+		videoSrc,
+		language,
+		hasVisionModality,
+		activeModelId
+	}: Props = $props();
+
+	let IconComponent = $derived(
+		isImage ? Image : isText || isPdf ? FileText : isAudio ? Music : isVideo ? Video : FileIcon
+	);
+
+	let isUnavailable = $derived(
+		!isPdf && !isImage && !(isText && displayTextContent) && !isAudio && !isVideo
+	);
+</script>
+
+{#if currentItem}
+	{#key currentItem.id}
+		{#if isPdf}
+			<ChatAttachmentsPreviewCurrentItemPdf
+				{currentItem}
+				displayName={currentItem.name}
+				{displayTextContent}
+				{hasVisionModality}
+				{activeModelId}
+			/>
+		{:else if isImage}
+			<ChatAttachmentsPreviewCurrentItemImage {currentItem} {displayPreview} />
+		{:else if isText && displayTextContent}
+			<ChatAttachmentsPreviewCurrentItemText {displayTextContent} {language} />
+		{:else if isAudio}
+			<ChatAttachmentsPreviewCurrentItemAudio {currentItem} {audioSrc} />
+		{:else if isVideo}
+			<ChatAttachmentsPreviewCurrentItemVideo {currentItem} {videoSrc} />
+		{:else if isUnavailable}
+			<ChatAttachmentsPreviewCurrentItemUnavailable {IconComponent} />
+		{/if}
+	{/key}
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemAudio.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemAudio.svelte
@ -0,0 +1,26 @@
+<script lang="ts">
+	import { Music } from '@lucide/svelte';
+
+	interface Props {
+		currentItem: { name?: string } | null;
+		audioSrc: string | null;
+	}
+
+	let { currentItem, audioSrc }: Props = $props();
+</script>
+
+<div class="flex flex-1 items-center justify-center p-8">
+	<div class="w-full max-w-md text-center">
+		<Music class="mx-auto mb-4 h-16 w-16 text-white/50" />
+
+		{#if audioSrc}
+			<audio controls class="mb-4 w-full" src={audioSrc}>
+				Your browser does not support the audio element.
+			</audio>
+		{:else}
+			<p class="mb-4 text-white/70">Audio preview not available</p>
+		{/if}
+
+		<p class="text-sm text-white/50">{currentItem?.name || 'Audio'}</p>
+	</div>
+</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemImage.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemImage.svelte
@ -0,0 +1,18 @@
+<script lang="ts">
+	interface Props {
+		currentItem: { name?: string } | null;
+		displayPreview: string | undefined;
+	}
+
+	let { currentItem, displayPreview }: Props = $props();
+</script>
+
+{#if displayPreview}
+	<div class="flex flex-1 items-center justify-center">
+		<img
+			src={displayPreview}
+			alt={currentItem?.name || 'preview'}
+			class="max-h-[80vh] max-w-[80vw] rounded-lg object-contain shadow-lg"
+		/>
+	</div>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemPdf.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemPdf.svelte
@ -0,0 +1,174 @@
+<script lang="ts">
+	import type { ChatAttachmentDisplayItem } from '$lib/types';
+	import { FileText, Eye, Info } from '@lucide/svelte';
+	import { Button } from '$lib/components/ui/button';
+	import * as Alert from '$lib/components/ui/alert';
+	import { SyntaxHighlightedCode } from '$lib/components/app';
+	import { getLanguageFromFilename } from '$lib/utils';
+	import { convertPDFToImage } from '$lib/utils/browser-only';
+	import { PdfViewMode } from '$lib/enums';
+
+	interface Props {
+		currentItem: ChatAttachmentDisplayItem | null;
+		displayName: string;
+		displayTextContent: string | undefined;
+		hasVisionModality: boolean;
+		activeModelId?: string;
+	}
+
+	let { currentItem, displayName, displayTextContent, hasVisionModality, activeModelId }: Props =
+		$props();
+
+	let pdfViewMode = $state<PdfViewMode>(PdfViewMode.PAGES);
+	let pdfImages = $state<string[]>([]);
+	let pdfImagesLoading = $state(false);
+	let pdfImagesError = $state<string | null>(null);
+
+	let language = $derived(getLanguageFromFilename(displayName));
+
+	async function loadPdfImages() {
+		if (pdfImages.length > 0 || pdfImagesLoading || !currentItem) return;
+
+		pdfImagesLoading = true;
+		pdfImagesError = null;
+
+		try {
+			let file: File | null = null;
+
+			if (currentItem.uploadedFile?.file) {
+				file = currentItem.uploadedFile.file;
+			} else if (currentItem.attachment) {
+				// Check if we have pre-processed images
+				if (
+					'images' in currentItem.attachment &&
+					currentItem.attachment.images &&
+					Array.isArray(currentItem.attachment.images) &&
+					currentItem.attachment.images.length > 0
+				) {
+					pdfImages = currentItem.attachment.images;
+					return;
+				}
+
+				// Convert base64 back to File for processing
+				if ('base64Data' in currentItem.attachment && currentItem.attachment.base64Data) {
+					const base64Data = currentItem.attachment.base64Data;
+					const byteCharacters = atob(base64Data);
+					const byteNumbers = new Array(byteCharacters.length);
+					for (let i = 0; i < byteCharacters.length; i++) {
+						byteNumbers[i] = byteCharacters.charCodeAt(i);
+					}
+					const byteArray = new Uint8Array(byteNumbers);
+					file = new File([byteArray], displayName, { type: 'application/pdf' });
+				}
+			}
+
+			if (file) {
+				pdfImages = await convertPDFToImage(file);
+			} else {
+				throw new Error('No PDF file available for conversion');
+			}
+		} catch (error) {
+			pdfImagesError = error instanceof Error ? error.message : 'Failed to load PDF images';
+		} finally {
+			pdfImagesLoading = false;
+		}
+	}
+
+	$effect(() => {
+		if (pdfViewMode === PdfViewMode.PAGES) {
+			loadPdfImages();
+		}
+	});
+</script>
+
+<div class="mb-4 flex items-center justify-end gap-2">
+	<Button
+		variant={pdfViewMode === PdfViewMode.TEXT ? 'default' : 'outline'}
+		size="sm"
+		onclick={() => (pdfViewMode = PdfViewMode.TEXT)}
+		disabled={pdfImagesLoading}
+	>
+		<FileText class="mr-1 h-4 w-4" />
+		Text
+	</Button>
+
+	<Button
+		variant={pdfViewMode === PdfViewMode.PAGES ? 'default' : 'outline'}
+		size="sm"
+		onclick={() => (pdfViewMode = PdfViewMode.PAGES)}
+		disabled={pdfImagesLoading}
+	>
+		{#if pdfImagesLoading}
+			<div
+				class="mr-1 h-4 w-4 animate-spin rounded-full border-2 border-current border-t-transparent"
+			></div>
+		{:else}
+			<Eye class="mr-1 h-4 w-4" />
+		{/if}
+		Pages
+	</Button>
+</div>
+
+{#if !hasVisionModality && activeModelId && currentItem}
+	<Alert.Root class="mb-4 max-w-4xl">
+		<Info class="h-4 w-4" />
+		<Alert.Title>Preview only</Alert.Title>
+		<Alert.Description>
+			<span class="inline-flex">
+				The selected model does not support vision. Only the extracted
+				<!-- svelte-ignore a11y_click_events_have_key_events -->
+				<!-- svelte-ignore a11y_no_static_element_interactions -->
+				<span
+					class="mx-1 cursor-pointer underline"
+					onclick={() => (pdfViewMode = PdfViewMode.TEXT)}
+				>
+					text
+				</span>
+				will be sent to the model.
+			</span>
+		</Alert.Description>
+	</Alert.Root>
+{/if}
+
+{#if pdfImagesLoading}
+	<div class="flex flex-1 items-center justify-center p-8">
+		<div class="text-center">
+			<div
+				class="mx-auto mb-4 h-8 w-8 animate-spin rounded-full border-4 border-white border-t-transparent"
+			></div>
+			<p class="text-white/70">Converting PDF to images...</p>
+		</div>
+	</div>
+{:else if pdfImagesError}
+	<div class="flex flex-1 items-center justify-center p-8">
+		<div class="text-center">
+			<FileText class="mx-auto mb-4 h-16 w-16 text-white/50" />
+			<p class="mb-4 text-white/70">Failed to load PDF images</p>
+			<p class="text-sm text-white/50">{pdfImagesError}</p>
+		</div>
+	</div>
+{:else if pdfImages.length > 0}
+	{#each pdfImages as image, index (image)}
+		<p class="mb-2 text-sm text-white/50">Page {index + 1}</p>
+		<img src={image} alt="PDF Page {index + 1}" class="mx-auto max-w-[85vw] rounded-lg shadow-lg" />
+		<div class="h-4"></div>
+	{/each}
+{:else}
+	<div class="flex flex-1 items-center justify-center p-8">
+		<div class="text-center">
+			<FileText class="mx-auto mb-4 h-16 w-16 text-white/50" />
+			<p class="text-white/70">No PDF pages available</p>
+		</div>
+	</div>
+{/if}
+
+{#if pdfViewMode === PdfViewMode.TEXT && displayTextContent}
+	<div class="px-4 pb-4">
+		<SyntaxHighlightedCode
+			class="max-w-4xl"
+			code={displayTextContent}
+			{language}
+			maxHeight="none"
+		/>
+	</div>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemText.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemText.svelte
@ -0,0 +1,21 @@
+<script lang="ts">
+	import { SyntaxHighlightedCode } from '$lib/components/app';
+
+	interface Props {
+		displayTextContent: string | undefined;
+		language: string;
+	}
+
+	let { displayTextContent, language }: Props = $props();
+</script>
+
+{#if displayTextContent}
+	<div class="px-4 pb-4">
+		<SyntaxHighlightedCode
+			class="max-w-4xl"
+			code={displayTextContent}
+			{language}
+			maxHeight="none"
+		/>
+	</div>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemUnavailable.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemUnavailable.svelte
@ -0,0 +1,17 @@
+<script lang="ts">
+	import type { Component } from 'svelte';
+
+	interface Props {
+		IconComponent: Component;
+	}
+
+	let { IconComponent }: Props = $props();
+</script>
+
+<div class="flex flex-1 items-center justify-center p-8">
+	<div class="text-center">
+		<IconComponent class="mx-auto mb-4 h-16 w-16 text-white/50" />
+
+		<p class="text-white/70">Preview not available for this file type</p>
+	</div>
+</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemVideo.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewCurrentItem/ChatAttachmentsPreviewCurrentItemVideo.svelte
@ -0,0 +1,27 @@
+<script lang="ts">
+	import { Video } from '@lucide/svelte';
+
+	interface Props {
+		currentItem: { name?: string } | null;
+		videoSrc: string | null;
+	}
+
+	let { currentItem, videoSrc }: Props = $props();
+</script>
+
+<div class="flex flex-1 items-center justify-center p-8">
+	<div class="w-full max-w-md text-center">
+		<Video class="mx-auto mb-4 h-16 w-16 text-white/50" />
+
+		{#if videoSrc}
+			<video controls class="mb-4 w-full" src={videoSrc}>
+				<track kind="captions" src="" />
+				Your browser does not support the video element.
+			</video>
+		{:else}
+			<p class="mb-4 text-white/70">Video preview not available</p>
+		{/if}
+
+		<p class="text-sm text-white/50">{currentItem?.name || 'Video'}</p>
+	</div>
+</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewFileInfo.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewFileInfo.svelte
@ -0,0 +1,16 @@
+<script lang="ts">
+	interface Props {
+		displayName: string;
+		fileSize: string;
+	}
+
+	let { displayName, fileSize }: Props = $props();
+</script>
+
+<div class="sticky top-0 z-[20] mb-4 rounded-lg bg-black/5 px-4 py-2 text-center backdrop-blur-md">
+	<p class="font-medium text-white">{displayName}</p>
+
+	{#if fileSize}
+		<p class="text-xs text-white/60">{fileSize}</p>
+	{/if}
+</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewNavButtons.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewNavButtons.svelte
@ -0,0 +1,34 @@
+<script lang="ts">
+	import { ChevronLeft, ChevronRight } from '@lucide/svelte';
+	import { Button } from '$lib/components/ui/button';
+
+	interface Props {
+		onPrev: () => void;
+		onNext: () => void;
+		show: boolean;
+	}
+
+	let { onPrev, onNext, show }: Props = $props();
+</script>
+
+{#if show}
+	<Button
+		variant="secondary"
+		size="icon"
+		class="absolute top-1/2 left-4 z-10 h-8 w-8 -translate-y-1/2 rounded-full bg-background/5 p-0 text-white!"
+		onclick={onPrev}
+		aria-label="Previous"
+	>
+		<ChevronLeft class="size-4" />
+	</Button>
+
+	<Button
+		variant="secondary"
+		size="icon"
+		class="absolute top-1/2 right-4 z-10 h-8 w-8 -translate-y-1/2 rounded-full bg-background/5 p-0 text-white!"
+		onclick={onNext}
+		aria-label="Next"
+	>
+		<ChevronRight class="size-4" />
+	</Button>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewThumbnailStrip.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsPreview/ChatAttachmentsPreviewThumbnailStrip.svelte
@ -0,0 +1,66 @@
+<script lang="ts">
+	import { Music, Video, FileText } from '@lucide/svelte';
+	import { HorizontalScrollCarousel } from '$lib/components/app/misc';
+
+	interface PreviewItem {
+		id: string;
+		name: string;
+		isImage: boolean;
+		isAudio: boolean;
+		isVideo: boolean;
+		preview?: string;
+	}
+
+	interface Props {
+		items: PreviewItem[];
+		currentIndex: number;
+		onNavigate: (index: number) => void;
+	}
+
+	let { items, currentIndex, onNavigate }: Props = $props();
+
+	function getFileExtension(name: string): string {
+		const parts = name.split('.');
+		if (parts.length > 1) {
+			return parts.pop()?.toUpperCase() ?? '';
+		}
+		return '';
+	}
+</script>
+
+{#if items.length > 1}
+	<div class="sticky bottom-0 z-10 mt-4 flex-shrink-0">
+		<HorizontalScrollCarousel class="max-w-full">
+			{#each items as item, index (item.id)}
+				<button
+					data-thumbnail-index={index}
+					class={[
+						'relative flex-shrink-0 cursor-pointer overflow-hidden rounded border-2 bg-black/80 backdrop-blur-sm transition-all hover:opacity-90',
+						index === currentIndex ? 'border-white' : 'border-transparent opacity-60',
+						'[&:not(:first-child)]:last:mr-4 [&:not(:last-child)]:first:ml-4'
+					]}
+					onclick={() => onNavigate(index)}
+					aria-label={`Go to ${item.name}`}
+				>
+					{#if item.isImage && item.preview}
+						<img src={item.preview} alt={item.name} class="h-12 w-12 object-cover" />
+					{:else}
+						<div
+							class="bg-foreground-muted/50 flex h-12 w-12 flex-col items-center justify-center gap-0.5 py-1"
+						>
+							{#if item.isAudio}
+								<Music class="h-4 w-4 text-white/70" />
+							{:else if item.isVideo}
+								<Video class="h-4 w-4 text-white/70" />
+							{:else}
+								<FileText class="h-4 w-4 text-white/70" />
+							{/if}
+
+							<span class="font-mono text-[9px] text-white/60">{getFileExtension(item.name)}</span>
+						</div>
+					{/if}
+				</button>
+			{/each}
+		</HorizontalScrollCarousel>
+	</div>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsViewAll.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatAttachments/ChatAttachmentsViewAll.svelte
@ -1,190 +0,0 @@
-<script lang="ts">
-	import {
-		ChatAttachmentThumbnailImage,
-		ChatAttachmentThumbnailFile,
-		DialogChatAttachmentPreview
-	} from '$lib/components/app';
-	import { FileTypeCategory } from '$lib/enums/files';
-	import { getFileTypeCategory } from '$lib/utils/file-type';
-	import type { ChatAttachmentDisplayItem, ChatAttachmentPreviewItem } from '$lib/types/chat';
-
-	interface Props {
-		uploadedFiles?: ChatUploadedFile[];
-		attachments?: DatabaseMessageExtra[];
-		readonly?: boolean;
-		onFileRemove?: (fileId: string) => void;
-		imageHeight?: string;
-		imageWidth?: string;
-		imageClass?: string;
-	}
-
-	let {
-		uploadedFiles = [],
-		attachments = [],
-		readonly = false,
-		onFileRemove,
-		imageHeight = 'h-24',
-		imageWidth = 'w-auto',
-		imageClass = ''
-	}: Props = $props();
-
-	let previewDialogOpen = $state(false);
-	let previewItem = $state<ChatAttachmentPreviewItem | null>(null);
-
-	let displayItems = $derived(getDisplayItems());
-	let imageItems = $derived(displayItems.filter((item) => item.isImage));
-	let fileItems = $derived(displayItems.filter((item) => !item.isImage));
-
-	function getDisplayItems(): ChatAttachmentDisplayItem[] {
-		const items: ChatAttachmentDisplayItem[] = [];
-
-		for (const file of uploadedFiles) {
-			items.push({
-				id: file.id,
-				name: file.name,
-				size: file.size,
-				preview: file.preview,
-				type: file.type,
-				isImage: getFileTypeCategory(file.type) === FileTypeCategory.IMAGE,
-				uploadedFile: file,
-				textContent: file.textContent
-			});
-		}
-
-		for (const [index, attachment] of attachments.entries()) {
-			if (attachment.type === 'imageFile') {
-				items.push({
-					id: `attachment-${index}`,
-					name: attachment.name,
-					preview: attachment.base64Url,
-					type: 'image',
-					isImage: true,
-					attachment,
-					attachmentIndex: index
-				});
-			} else if (attachment.type === 'textFile') {
-				items.push({
-					id: `attachment-${index}`,
-					name: attachment.name,
-					type: 'text',
-					isImage: false,
-					attachment,
-					attachmentIndex: index,
-					textContent: attachment.content
-				});
-			} else if (attachment.type === 'context') {
-				// Legacy format from old webui - treat as text file
-				items.push({
-					id: `attachment-${index}`,
-					name: attachment.name,
-					type: 'text',
-					isImage: false,
-					attachment,
-					attachmentIndex: index,
-					textContent: attachment.content
-				});
-			} else if (attachment.type === 'audioFile') {
-				items.push({
-					id: `attachment-${index}`,
-					name: attachment.name,
-					type: attachment.mimeType || 'audio',
-					isImage: false,
-					attachment,
-					attachmentIndex: index
-				});
-			} else if (attachment.type === 'pdfFile') {
-				items.push({
-					id: `attachment-${index}`,
-					name: attachment.name,
-					type: 'application/pdf',
-					isImage: false,
-					attachment,
-					attachmentIndex: index,
-					textContent: attachment.content
-				});
-			}
-		}
-
-		return items.reverse();
-	}
-
-	function openPreview(item: (typeof displayItems)[0], event?: Event) {
-		if (event) {
-			event.preventDefault();
-			event.stopPropagation();
-		}
-
-		previewItem = {
-			uploadedFile: item.uploadedFile,
-			attachment: item.attachment,
-			preview: item.preview,
-			name: item.name,
-			type: item.type,
-			size: item.size,
-			textContent: item.textContent
-		};
-		previewDialogOpen = true;
-	}
-</script>
-
-<div class="space-y-4">
-	<div class="min-h-0 flex-1 space-y-6 overflow-y-auto px-1">
-		{#if fileItems.length > 0}
-			<div>
-				<h3 class="mb-3 text-sm font-medium text-foreground">Files ({fileItems.length})</h3>
-				<div class="flex flex-wrap items-start gap-3">
-					{#each fileItems as item (item.id)}
-						<ChatAttachmentThumbnailFile
-							class="cursor-pointer"
-							id={item.id}
-							name={item.name}
-							type={item.type}
-							size={item.size}
-							{readonly}
-							onRemove={onFileRemove}
-							textContent={item.textContent}
-							onClick={(event) => openPreview(item, event)}
-						/>
-					{/each}
-				</div>
-			</div>
-		{/if}
-
-		{#if imageItems.length > 0}
-			<div>
-				<h3 class="mb-3 text-sm font-medium text-foreground">Images ({imageItems.length})</h3>
-				<div class="flex flex-wrap items-start gap-3">
-					{#each imageItems as item (item.id)}
-						{#if item.preview}
-							<ChatAttachmentThumbnailImage
-								class="cursor-pointer"
-								id={item.id}
-								name={item.name}
-								preview={item.preview}
-								{readonly}
-								onRemove={onFileRemove}
-								height={imageHeight}
-								width={imageWidth}
-								{imageClass}
-								onClick={(event) => openPreview(item, event)}
-							/>
-						{/if}
-					{/each}
-				</div>
-			</div>
-		{/if}
-	</div>
-</div>
-
-{#if previewItem}
-	<DialogChatAttachmentPreview
-		bind:open={previewDialogOpen}
-		uploadedFile={previewItem.uploadedFile}
-		attachment={previewItem.attachment}
-		preview={previewItem.preview}
-		name={previewItem.name}
-		type={previewItem.type}
-		size={previewItem.size}
-		textContent={previewItem.textContent}
-	/>
-{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatForm.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatForm.svelte
@ -1,121 +1,251 @@
 <script lang="ts">
-	import { afterNavigate } from '$app/navigation';
 	import {
 		ChatAttachmentsList,
 		ChatFormActions,
 		ChatFormFileInputInvisible,
-		ChatFormHelperText,
-		ChatFormTextarea
+		ChatFormMcpResourcesList,
+		ChatFormPickers,
+		ChatFormTextarea,
+		DialogMcpResourcesBrowser
 	} from '$lib/components/app';
-	import { INPUT_CLASSES } from '$lib/constants/input-classes';
+	import {
+		CLIPBOARD_CONTENT_QUOTE_PREFIX,
+		INPUT_CLASSES,
+		SETTING_CONFIG_DEFAULT,
+		INITIAL_FILE_SIZE,
+		PROMPT_CONTENT_SEPARATOR,
+		PROMPT_TRIGGER_PREFIX,
+		RESOURCE_TRIGGER_PREFIX
+	} from '$lib/constants';
+	import {
+		ContentPartType,
+		FileExtensionText,
+		KeyboardKey,
+		MimeTypeText,
+		SpecialFileType
+	} from '$lib/enums';
 	import { config } from '$lib/stores/settings.svelte';
-	import { FileTypeCategory, MimeTypeApplication } from '$lib/enums/files';
+	import { modelOptions, selectedModelId } from '$lib/stores/models.svelte';
+	import { isRouterMode } from '$lib/stores/server.svelte';
+	import { chatStore } from '$lib/stores/chat.svelte';
+	import { mcpStore } from '$lib/stores/mcp.svelte';
+	import { mcpHasResourceAttachments } from '$lib/stores/mcp-resources.svelte';
+	import { conversationsStore, activeMessages } from '$lib/stores/conversations.svelte';
+	import type { GetPromptResult, MCPPromptInfo, MCPResourceInfo, PromptMessage } from '$lib/types';
+	import { isIMEComposing, parseClipboardContent, uuid } from '$lib/utils';
 	import {
 		AudioRecorder,
 		convertToWav,
 		createAudioFile,
 		isAudioRecordingSupported
-	} from '$lib/utils/audio-recording';
+	} from '$lib/utils/browser-only';
 	import { onMount } from 'svelte';
-	import {
-		FileExtensionAudio,
-		FileExtensionImage,
-		FileExtensionPdf,
-		FileExtensionText,
-		MimeTypeAudio,
-		MimeTypeImage,
-		MimeTypeText
-	} from '$lib/enums/files';
-	import { isIMEComposing } from '$lib/utils/is-ime-composing';

 	interface Props {
+		// Data
+		attachments?: DatabaseMessageExtra[];
+		uploadedFiles?: ChatUploadedFile[];
+		value?: string;
+
+		// UI State
 		class?: string;
 		disabled?: boolean;
 		isLoading?: boolean;
-		onFileRemove?: (fileId: string) => void;
-		onFileUpload?: (files: File[]) => void;
-		onSend?: (message: string, files?: ChatUploadedFile[]) => Promise<boolean>;
+		placeholder?: string;
+		showMcpPromptButton?: boolean;
+		showAddButton?: boolean;
+		showModelSelector?: boolean;
+
+		// Event Handlers
+		onAttachmentRemove?: (index: number) => void;
+		onFilesAdd?: (files: File[]) => void;
 		onStop?: () => void;
-		showHelperText?: boolean;
-		uploadedFiles?: ChatUploadedFile[];
+		onSubmit?: () => void;
+		onSystemPromptClick?: (draft: { message: string; files: ChatUploadedFile[] }) => void;
+		onUploadedFileRemove?: (fileId: string) => void;
+		onUploadedFilesChange?: (files: ChatUploadedFile[]) => void;
+		onValueChange?: (value: string) => void;
 	}

 	let {
-		class: className,
+		attachments = [],
+		class: className = '',
 		disabled = false,
 		isLoading = false,
-		onFileRemove,
-		onFileUpload,
-		onSend,
+		placeholder = 'Type a message...',
+		showMcpPromptButton = false,
+		showAddButton = true,
+		showModelSelector = true,
+		uploadedFiles = $bindable([]),
+		value = $bindable(''),
+		onAttachmentRemove,
+		onFilesAdd,
 		onStop,
-		showHelperText = true,
-		uploadedFiles = $bindable([])
+		onSubmit,
+		onSystemPromptClick,
+		onUploadedFileRemove,
+		onUploadedFilesChange,
+		onValueChange
 	}: Props = $props();

+	// Component References
 	let audioRecorder: AudioRecorder | undefined;
-	let currentConfig = $derived(config());
-	let fileAcceptString = $state<string | undefined>(undefined);
+	let chatFormActionsRef: ChatFormActions | undefined = $state(undefined);
 	let fileInputRef: ChatFormFileInputInvisible | undefined = $state(undefined);
-	let isRecording = $state(false);
-	let message = $state('');
-	let pasteLongTextToFileLength = $derived(Number(currentConfig.pasteLongTextToFileLen) || 2500);
-	let previousIsLoading = $state(isLoading);
-	let recordingSupported = $state(false);
+	let pickersRef: { handleKeydown: (event: KeyboardEvent) => boolean } | undefined =
+		$state(undefined);
 	let textareaRef: ChatFormTextarea | undefined = $state(undefined);

-	function getAcceptStringForFileType(fileType: FileTypeCategory): string {
-		switch (fileType) {
-			case FileTypeCategory.IMAGE:
-				return [...Object.values(FileExtensionImage), ...Object.values(MimeTypeImage)].join(',');
-			case FileTypeCategory.AUDIO:
-				return [...Object.values(FileExtensionAudio), ...Object.values(MimeTypeAudio)].join(',');
-			case FileTypeCategory.PDF:
-				return [...Object.values(FileExtensionPdf), ...Object.values(MimeTypeApplication)].join(
-					','
-				);
-			case FileTypeCategory.TEXT:
-				return [...Object.values(FileExtensionText), MimeTypeText.PLAIN].join(',');
-			default:
-				return '';
+	// Audio Recording State
+	let isRecording = $state(false);
+	let recordingSupported = $state(false);
+
+	// Picker State
+	let isPromptPickerOpen = $state(false);
+	let promptSearchQuery = $state('');
+	let isInlineResourcePickerOpen = $state(false);
+	let resourceSearchQuery = $state('');
+
+	// Resource Dialog State
+	let isResourceDialogOpen = $state(false);
+	let preSelectedResourceUri = $state<string | undefined>(undefined);
+
+	let currentConfig = $derived(config());
+
+	let pasteLongTextToFileLength = $derived.by(() => {
+		const n = Number(currentConfig.pasteLongTextToFileLen);
+		return Number.isNaN(n) ? Number(SETTING_CONFIG_DEFAULT.pasteLongTextToFileLen) : n;
+	});
+
+	let isRouter = $derived(isRouterMode());
+	let conversationModel = $derived(
+		chatStore.getConversationModel(activeMessages() as DatabaseMessage[])
+	);
+	let activeModelId = $derived.by(() => {
+		const options = modelOptions();
+
+		if (!isRouter) {
+			return options.length > 0 ? options[0].model : null;
 		}
+
+		const selectedId = selectedModelId();
+		if (selectedId) {
+			const model = options.find((m) => m.id === selectedId);
+			if (model) return model.model;
+		}
+
+		if (conversationModel) {
+			const model = options.find((m) => m.model === conversationModel);
+			if (model) return model.model;
+		}
+
+		return null;
+	});
+
+	let hasModelSelected = $derived(!isRouter || !!conversationModel || !!selectedModelId());
+	let hasLoadingAttachments = $derived(uploadedFiles.some((f) => f.isLoading));
+	let hasAttachments = $derived(
+		(attachments && attachments.length > 0) || (uploadedFiles && uploadedFiles.length > 0)
+	);
+	let canSubmit = $derived(value.trim().length > 0 || hasAttachments);
+
+	onMount(() => {
+		recordingSupported = isAudioRecordingSupported();
+		audioRecorder = new AudioRecorder();
+	});
+
+	export function focus() {
+		textareaRef?.focus();
+	}
+
+	export function resetTextareaHeight() {
+		textareaRef?.resetHeight();
+	}
+
+	export function openModelSelector() {
+		chatFormActionsRef?.openModelSelector();
+	}
+
+	export function checkModelSelected(): boolean {
+		if (!hasModelSelected) {
+			chatFormActionsRef?.openModelSelector();
+			return false;
+		}
+		return true;
 	}

 	function handleFileSelect(files: File[]) {
-		onFileUpload?.(files);
+		onFilesAdd?.(files);
 	}

-	function handleFileUpload(fileType?: FileTypeCategory) {
-		if (fileType) {
-			fileAcceptString = getAcceptStringForFileType(fileType);
+	function handleFileUpload() {
+		fileInputRef?.click();
+	}
+
+	function handleFileRemove(fileId: string) {
+		if (fileId.startsWith('attachment-')) {
+			const index = parseInt(fileId.replace('attachment-', ''), 10);
+			if (!isNaN(index) && index >= 0 && index < attachments.length) {
+				onAttachmentRemove?.(index);
+			}
 		} else {
-			fileAcceptString = undefined;
+			onUploadedFileRemove?.(fileId);
+		}
+	}
+
+	function handleInput() {
+		const perChatOverrides = conversationsStore.getAllMcpServerOverrides();
+		const hasServers = mcpStore.hasEnabledServers(perChatOverrides);
+
+		if (value.startsWith(PROMPT_TRIGGER_PREFIX) && hasServers) {
+			isPromptPickerOpen = true;
+			promptSearchQuery = value.slice(1);
+			isInlineResourcePickerOpen = false;
+			resourceSearchQuery = '';
+		} else if (
+			value.startsWith(RESOURCE_TRIGGER_PREFIX) &&
+			hasServers &&
+			mcpStore.hasResourcesCapability(perChatOverrides)
+		) {
+			isInlineResourcePickerOpen = true;
+			resourceSearchQuery = value.slice(1);
+			isPromptPickerOpen = false;
+			promptSearchQuery = '';
+		} else {
+			isPromptPickerOpen = false;
+			promptSearchQuery = '';
+			isInlineResourcePickerOpen = false;
+			resourceSearchQuery = '';
+		}
+	}
+
+	function handleKeydown(event: KeyboardEvent) {
+		if (pickersRef?.handleKeydown(event)) {
+			return;
 		}

-		// Use setTimeout to ensure the accept attribute is applied before opening dialog
-		setTimeout(() => {
-			fileInputRef?.click();
-		}, 10);
-	}
+		if (event.key === KeyboardKey.ESCAPE && isPromptPickerOpen) {
+			isPromptPickerOpen = false;
+			promptSearchQuery = '';
+			return;
+		}

-	async function handleKeydown(event: KeyboardEvent) {
-		if (event.key === 'Enter' && !event.shiftKey && !isIMEComposing(event)) {
-			event.preventDefault();
+		if (event.key === KeyboardKey.ESCAPE && isInlineResourcePickerOpen) {
+			isInlineResourcePickerOpen = false;
+			resourceSearchQuery = '';
+			return;
+		}

-			if ((!message.trim() && uploadedFiles.length === 0) || disabled || isLoading) return;
+		if (event.key === KeyboardKey.ENTER && !event.shiftKey && !isIMEComposing(event)) {
+			const isModifier = event.ctrlKey || event.metaKey;
+			const sendOnEnter = currentConfig.sendOnEnter !== false;

-			const messageToSend = message.trim();
-			const filesToSend = [...uploadedFiles];
+			if (sendOnEnter || isModifier) {
+				event.preventDefault();

-			message = '';
-			uploadedFiles = [];
+				if (!canSubmit || disabled || hasLoadingAttachments) return;

-			textareaRef?.resetHeight();
-
-			const success = await onSend?.(messageToSend, filesToSend);
-
-			if (!success) {
-				message = messageToSend;
-				uploadedFiles = filesToSend;
+				onSubmit?.();
 			}
 		}
 	}
@ -130,12 +260,62 @@

 		if (files.length > 0) {
 			event.preventDefault();
-			onFileUpload?.(files);
+			onFilesAdd?.(files);
 			return;
 		}

 		const text = event.clipboardData.getData(MimeTypeText.PLAIN);

+		if (text.startsWith(CLIPBOARD_CONTENT_QUOTE_PREFIX)) {
+			const parsed = parseClipboardContent(text);
+
+			if (parsed.textAttachments.length > 0 || parsed.mcpPromptAttachments.length > 0) {
+				event.preventDefault();
+				value = parsed.message;
+				onValueChange?.(parsed.message);
+
+				// Handle text attachments as files
+				if (parsed.textAttachments.length > 0) {
+					const attachmentFiles = parsed.textAttachments.map(
+						(att) =>
+							new File([att.content], att.name, {
+								type: MimeTypeText.PLAIN
+							})
+					);
+					onFilesAdd?.(attachmentFiles);
+				}
+
+				// Handle MCP prompt attachments as ChatUploadedFile with mcpPrompt data
+				if (parsed.mcpPromptAttachments.length > 0) {
+					const mcpPromptFiles: ChatUploadedFile[] = parsed.mcpPromptAttachments.map((att) => ({
+						id: uuid(),
+						name: att.name,
+						size: att.content.length,
+						type: SpecialFileType.MCP_PROMPT,
+						file: new File([att.content], `${att.name}${FileExtensionText.TXT}`, {
+							type: MimeTypeText.PLAIN
+						}),
+						isLoading: false,
+						textContent: att.content,
+						mcpPrompt: {
+							serverName: att.serverName,
+							promptName: att.promptName,
+							arguments: att.arguments
+						}
+					}));
+
+					uploadedFiles = [...uploadedFiles, ...mcpPromptFiles];
+					onUploadedFilesChange?.(uploadedFiles);
+				}
+
+				setTimeout(() => {
+					textareaRef?.focus();
+				}, 10);
+
+				return;
+			}
+		}
+
 		if (
 			text.length > 0 &&
 			pasteLongTextToFileLength > 0 &&
@ -147,10 +327,117 @@
 				type: MimeTypeText.PLAIN
 			});

-			onFileUpload?.([textFile]);
+			onFilesAdd?.([textFile]);
 		}
 	}

+	function handlePromptLoadStart(
+		placeholderId: string,
+		promptInfo: MCPPromptInfo,
+		args?: Record<string, string>
+	) {
+		// Only clear the value if the prompt was triggered by typing '/'
+		if (value.startsWith(PROMPT_TRIGGER_PREFIX)) {
+			value = '';
+			onValueChange?.('');
+		}
+		isPromptPickerOpen = false;
+		promptSearchQuery = '';
+
+		const promptName = promptInfo.title || promptInfo.name;
+		const placeholder: ChatUploadedFile = {
+			id: placeholderId,
+			name: promptName,
+			size: INITIAL_FILE_SIZE,
+			type: SpecialFileType.MCP_PROMPT,
+			file: new File([], 'loading'),
+			isLoading: true,
+			mcpPrompt: {
+				serverName: promptInfo.serverName,
+				promptName: promptInfo.name,
+				arguments: args ? { ...args } : undefined
+			}
+		};
+
+		uploadedFiles = [...uploadedFiles, placeholder];
+		onUploadedFilesChange?.(uploadedFiles);
+		textareaRef?.focus();
+	}
+
+	function handlePromptLoadComplete(placeholderId: string, result: GetPromptResult) {
+		const promptText = result.messages
+			?.map((msg: PromptMessage) => {
+				if (typeof msg.content === 'string') {
+					return msg.content;
+				}
+
+				if (msg.content.type === ContentPartType.TEXT) {
+					return msg.content.text;
+				}
+
+				return '';
+			})
+			.filter(Boolean)
+			.join(PROMPT_CONTENT_SEPARATOR);
+
+		uploadedFiles = uploadedFiles.map((f) =>
+			f.id === placeholderId
+				? {
+						...f,
+						isLoading: false,
+						textContent: promptText,
+						size: promptText.length,
+						file: new File([promptText], `${f.name}${FileExtensionText.TXT}`, {
+							type: MimeTypeText.PLAIN
+						})
+					}
+				: f
+		);
+		onUploadedFilesChange?.(uploadedFiles);
+	}
+
+	function handlePromptLoadError(placeholderId: string, error: string) {
+		uploadedFiles = uploadedFiles.map((f) =>
+			f.id === placeholderId ? { ...f, isLoading: false, loadError: error } : f
+		);
+		onUploadedFilesChange?.(uploadedFiles);
+	}
+
+	function handlePromptPickerClose() {
+		isPromptPickerOpen = false;
+		promptSearchQuery = '';
+		textareaRef?.focus();
+	}
+
+	function handleInlineResourcePickerClose() {
+		isInlineResourcePickerOpen = false;
+		resourceSearchQuery = '';
+		textareaRef?.focus();
+	}
+
+	function handleInlineResourceSelect() {
+		if (value.startsWith(RESOURCE_TRIGGER_PREFIX)) {
+			value = '';
+			onValueChange?.('');
+		}
+
+		isInlineResourcePickerOpen = false;
+		resourceSearchQuery = '';
+		textareaRef?.focus();
+	}
+
+	function handleBrowseResources() {
+		isInlineResourcePickerOpen = false;
+		resourceSearchQuery = '';
+
+		if (value.startsWith(RESOURCE_TRIGGER_PREFIX)) {
+			value = '';
+			onValueChange?.('');
+		}
+
+		isResourceDialogOpen = true;
+	}
+
 	async function handleMicClick() {
 		if (!audioRecorder || !recordingSupported) {
 			console.warn('Audio recording not supported');
@ -158,16 +445,15 @@
 		}

 		if (isRecording) {
+			isRecording = false;
 			try {
 				const audioBlob = await audioRecorder.stopRecording();
 				const wavBlob = await convertToWav(audioBlob);
 				const audioFile = createAudioFile(wavBlob);

-				onFileUpload?.([audioFile]);
-				isRecording = false;
+				onFilesAdd?.([audioFile]);
 			} catch (error) {
 				console.error('Failed to stop recording:', error);
-				isRecording = false;
 			}
 		} else {
 			try {
@ -178,89 +464,109 @@
 			}
 		}
 	}
-
-	function handleStop() {
-		onStop?.();
-	}
-
-	async function handleSubmit(event: SubmitEvent) {
-		event.preventDefault();
-		if ((!message.trim() && uploadedFiles.length === 0) || disabled || isLoading) return;
-
-		const messageToSend = message.trim();
-		const filesToSend = [...uploadedFiles];
-
-		message = '';
-		uploadedFiles = [];
-
-		textareaRef?.resetHeight();
-
-		const success = await onSend?.(messageToSend, filesToSend);
-
-		if (!success) {
-			message = messageToSend;
-			uploadedFiles = filesToSend;
-		}
-	}
-
-	onMount(() => {
-		setTimeout(() => textareaRef?.focus(), 10);
-		recordingSupported = isAudioRecordingSupported();
-		audioRecorder = new AudioRecorder();
-	});
-
-	afterNavigate(() => {
-		setTimeout(() => textareaRef?.focus(), 10);
-	});
-
-	$effect(() => {
-		if (previousIsLoading && !isLoading) {
-			setTimeout(() => textareaRef?.focus(), 10);
-		}
-
-		previousIsLoading = isLoading;
-	});
 </script>

-<ChatFormFileInputInvisible
-	bind:this={fileInputRef}
-	bind:accept={fileAcceptString}
-	onFileSelect={handleFileSelect}
-/>
+<ChatFormFileInputInvisible bind:this={fileInputRef} onFileSelect={handleFileSelect} />

 <form
-	onsubmit={handleSubmit}
-	class="{INPUT_CLASSES} border-radius-bottom-none mx-auto max-w-[48rem] overflow-hidden rounded-3xl backdrop-blur-md {className}"
+	class="relative {className}"
+	onsubmit={(event) => {
+		event.preventDefault();
+
+		if (!canSubmit || disabled || hasLoadingAttachments) return;
+
+		onSubmit?.();
+	}}
 >
-	<ChatAttachmentsList
-		bind:uploadedFiles
-		{onFileRemove}
-		limitToSingleRow
-		class="py-5"
-		style="scroll-padding: 1rem;"
+	<ChatFormPickers
+		bind:this={pickersRef}
+		{isPromptPickerOpen}
+		{promptSearchQuery}
+		{isInlineResourcePickerOpen}
+		{resourceSearchQuery}
+		onPromptPickerClose={handlePromptPickerClose}
+		onInlineResourcePickerClose={handleInlineResourcePickerClose}
+		onInlineResourceSelect={handleInlineResourceSelect}
+		onPromptLoadStart={handlePromptLoadStart}
+		onPromptLoadComplete={handlePromptLoadComplete}
+		onPromptLoadError={handlePromptLoadError}
+		onInlineResourceBrowse={handleBrowseResources}
 	/>

 	<div
-		class="flex-column relative min-h-[48px] items-center rounded-3xl px-5 py-3 shadow-sm transition-all focus-within:shadow-md"
-		onpaste={handlePaste}
+		class="{INPUT_CLASSES} overflow-hidden rounded-3xl backdrop-blur-md {disabled
+			? 'cursor-not-allowed opacity-60'
+			: ''}"
+		data-slot="input-area"
 	>
-		<ChatFormTextarea
-			bind:this={textareaRef}
-			bind:value={message}
-			onKeydown={handleKeydown}
-			{disabled}
+		<ChatAttachmentsList
+			{attachments}
+			bind:uploadedFiles
+			onFileRemove={handleFileRemove}
+			limitToSingleRow
+			class="py-5"
+			style="scroll-padding: 1rem;"
+			activeModelId={activeModelId ?? undefined}
 		/>

-		<ChatFormActions
-			canSend={message.trim().length > 0 || uploadedFiles.length > 0}
-			{disabled}
-			{isLoading}
-			{isRecording}
-			onFileUpload={handleFileUpload}
-			onMicClick={handleMicClick}
-			onStop={handleStop}
-		/>
+		<div
+			class="flex-column relative min-h-[48px] items-center rounded-3xl py-2 pb-2.25 shadow-sm transition-all focus-within:shadow-md md:!py-3"
+			onpaste={handlePaste}
+		>
+			<ChatFormTextarea
+				class="px-5 py-1.5 md:pt-0"
+				bind:this={textareaRef}
+				bind:value
+				onKeydown={handleKeydown}
+				onInput={() => {
+					handleInput();
+					onValueChange?.(value);
+				}}
+				{disabled}
+				{placeholder}
+			/>
+
+			{#if mcpHasResourceAttachments()}
+				<ChatFormMcpResourcesList
+					class="mb-3"
+					onResourceClick={(uri) => {
+						preSelectedResourceUri = uri;
+						isResourceDialogOpen = true;
+					}}
+				/>
+			{/if}
+
+			<ChatFormActions
+				class="px-3"
+				bind:this={chatFormActionsRef}
+				canSend={canSubmit}
+				{disabled}
+				{isLoading}
+				isReasoning={chatStore.isReasoning}
+				{isRecording}
+				{showAddButton}
+				{showModelSelector}
+				{uploadedFiles}
+				onFileUpload={handleFileUpload}
+				onMicClick={handleMicClick}
+				{onStop}
+				onSystemPromptClick={() => onSystemPromptClick?.({ message: value, files: uploadedFiles })}
+				onMcpPromptClick={showMcpPromptButton ? () => (isPromptPickerOpen = true) : undefined}
+				onMcpResourcesClick={() => (isResourceDialogOpen = true)}
+			/>
+		</div>
 	</div>
 </form>

-<ChatFormHelperText show={showHelperText} />
+<DialogMcpResourcesBrowser
+	bind:open={isResourceDialogOpen}
+	preSelectedUri={preSelectedResourceUri}
+	onAttach={(resource: MCPResourceInfo) => {
+		mcpStore.attachResource(resource.uri);
+	}}
+	onOpenChange={(newOpen: boolean) => {
+		if (!newOpen) {
+			preSelectedResourceUri = undefined;
+		}
+	}}
+/>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionAddButton.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionAddButton.svelte
@ -0,0 +1,33 @@
+<script lang="ts">
+	import { Plus } from '@lucide/svelte';
+	import { Button } from '$lib/components/ui/button';
+	import * as Tooltip from '$lib/components/ui/tooltip';
+	import { ATTACHMENT_TOOLTIP_TEXT } from '$lib/constants';
+
+	interface Props {
+		disabled?: boolean;
+		onclick?: (e: MouseEvent) => void;
+	}
+
+	let { disabled = false, onclick }: Props = $props();
+</script>
+
+<Tooltip.Root>
+	<Tooltip.Trigger class="w-full">
+		<Button
+			class="file-upload-button h-8 w-8 rounded-full p-0"
+			{disabled}
+			{onclick}
+			variant="secondary"
+			type="button"
+		>
+			<span class="sr-only">{ATTACHMENT_TOOLTIP_TEXT}</span>
+
+			<Plus class="h-4 w-4" />
+		</Button>
+	</Tooltip.Trigger>
+
+	<Tooltip.Content>
+		<p>{ATTACHMENT_TOOLTIP_TEXT}</p>
+	</Tooltip.Content>
+</Tooltip.Root>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionAddDropdown.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionAddDropdown.svelte
@ -0,0 +1,179 @@
+<script lang="ts">
+	import { Plus, File, MessageSquare, Zap, FolderOpen } from '@lucide/svelte';
+	import * as DropdownMenu from '$lib/components/ui/dropdown-menu';
+	import * as Tooltip from '$lib/components/ui/tooltip';
+	import { buttonVariants } from '$lib/components/ui/button';
+	import { cn } from '$lib/components/ui/utils';
+	import {
+		ATTACHMENT_FILE_ITEMS,
+		ATTACHMENT_TOOLTIP_TEXT,
+		TOOLTIP_DELAY_DURATION
+	} from '$lib/constants';
+	import {
+		ChatFormActionAddToolsSubmenu,
+		ChatFormActionAddMcpServersSubmenu
+	} from '$lib/components/app';
+	import { useAttachmentMenu } from '$lib/hooks/use-attachment-menu.svelte';
+
+	interface Props {
+		class?: string;
+		disabled?: boolean;
+		hasAudioModality?: boolean;
+		hasVideoModality?: boolean;
+		hasVisionModality?: boolean;
+		hasMcpPromptsSupport?: boolean;
+		hasMcpResourcesSupport?: boolean;
+		onFileUpload?: () => void;
+		onSystemPromptClick?: () => void;
+		onMcpPromptClick?: () => void;
+		onMcpSettingsClick?: () => void;
+		onMcpResourcesClick?: () => void;
+	}
+
+	let {
+		class: className = '',
+		disabled = false,
+		hasAudioModality = false,
+		hasVideoModality = false,
+		hasVisionModality = false,
+		hasMcpPromptsSupport = false,
+		hasMcpResourcesSupport = false,
+		onFileUpload,
+		onSystemPromptClick,
+		onMcpPromptClick,
+		onMcpSettingsClick,
+		onMcpResourcesClick
+	}: Props = $props();
+
+	let dropdownOpen = $state(false);
+
+	function handleMcpSettingsClick() {
+		dropdownOpen = false;
+		onMcpSettingsClick?.();
+	}
+
+	const attachmentMenu = useAttachmentMenu(
+		() => ({
+			hasVisionModality,
+			hasAudioModality,
+			hasVideoModality,
+			hasMcpPromptsSupport,
+			hasMcpResourcesSupport
+		}),
+		() => ({ onFileUpload, onSystemPromptClick, onMcpPromptClick, onMcpResourcesClick }),
+		() => {
+			dropdownOpen = false;
+		}
+	);
+</script>
+
+<div class="flex items-center gap-1 {className}">
+	<DropdownMenu.Root bind:open={dropdownOpen}>
+		<Tooltip.Root>
+			<Tooltip.Trigger>
+				{#snippet child({ props })}
+					<DropdownMenu.Trigger
+						{...props}
+						class={cn(
+							buttonVariants({ variant: 'secondary' }),
+							'file-upload-button h-8 w-8 cursor-pointer rounded-full p-0'
+						)}
+						{disabled}
+					>
+						<span class="sr-only">{ATTACHMENT_TOOLTIP_TEXT}</span>
+
+						<Plus class="h-4 w-4" />
+					</DropdownMenu.Trigger>
+				{/snippet}
+			</Tooltip.Trigger>
+
+			<Tooltip.Content>
+				<p>{ATTACHMENT_TOOLTIP_TEXT}</p>
+			</Tooltip.Content>
+		</Tooltip.Root>
+
+		<DropdownMenu.Content align="start" class="w-48">
+			<DropdownMenu.Sub>
+				<DropdownMenu.SubTrigger class="flex cursor-pointer items-center gap-2">
+					<File class="h-4 w-4" />
+
+					<span>Add files</span>
+				</DropdownMenu.SubTrigger>
+
+				<DropdownMenu.SubContent class="w-48">
+					{#each ATTACHMENT_FILE_ITEMS as item (item.id)}
+						{@const enabled = attachmentMenu.isItemEnabled(item.enabledWhen)}
+						{#if enabled}
+							<DropdownMenu.Item
+								class="{item.class ?? ''} flex cursor-pointer items-center gap-2"
+								onclick={() => attachmentMenu.callbacks[item.action]()}
+							>
+								<item.icon class="h-4 w-4" />
+
+								<span>{item.label}</span>
+							</DropdownMenu.Item>
+						{:else if item.disabledTooltip}
+							<Tooltip.Root delayDuration={TOOLTIP_DELAY_DURATION}>
+								<Tooltip.Trigger tabindex={-1}>
+									{#snippet child({ props })}
+										<div {...props} class="cursor-default">
+											<DropdownMenu.Item
+												class="{item.class ?? ''} flex items-center gap-2"
+												disabled
+											>
+												<item.icon class="h-4 w-4" />
+
+												<span>{item.label}</span>
+											</DropdownMenu.Item>
+										</div>
+									{/snippet}
+								</Tooltip.Trigger>
+
+								<Tooltip.Content side="right">
+									<p>{item.disabledTooltip}</p>
+								</Tooltip.Content>
+							</Tooltip.Root>
+						{/if}
+					{/each}
+				</DropdownMenu.SubContent>
+			</DropdownMenu.Sub>
+
+			<DropdownMenu.Item
+				class="flex cursor-pointer items-center gap-2"
+				onclick={onSystemPromptClick}
+			>
+				<MessageSquare class="h-4 w-4" />
+
+				<span>System Message</span>
+			</DropdownMenu.Item>
+
+			<ChatFormActionAddToolsSubmenu />
+
+			<ChatFormActionAddMcpServersSubmenu onMcpSettingsClick={handleMcpSettingsClick} />
+
+			{#if hasMcpPromptsSupport}
+				<DropdownMenu.Separator />
+
+				<DropdownMenu.Item
+					class="flex cursor-pointer items-center gap-2"
+					onclick={onMcpPromptClick}
+				>
+					<Zap class="h-4 w-4" />
+
+					<span>MCP Prompt</span>
+				</DropdownMenu.Item>
+			{/if}
+
+			{#if hasMcpResourcesSupport}
+				<DropdownMenu.Item
+					class="flex cursor-pointer items-center gap-2"
+					onclick={onMcpResourcesClick}
+				>
+					<FolderOpen class="h-4 w-4" />
+
+					<span>MCP Resources</span>
+				</DropdownMenu.Item>
+			{/if}
+		</DropdownMenu.Content>
+	</DropdownMenu.Root>
+</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionAddMcpServersSubmenu.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionAddMcpServersSubmenu.svelte
@ -0,0 +1,150 @@
+<script lang="ts">
+	import { Settings, Plus } from '@lucide/svelte';
+	import { Switch } from '$lib/components/ui/switch';
+	import * as DropdownMenu from '$lib/components/ui/dropdown-menu';
+	import { McpLogo, DropdownMenuSearchable, McpServerIdentity } from '$lib/components/app';
+	import { conversationsStore } from '$lib/stores/conversations.svelte';
+	import { mcpStore } from '$lib/stores/mcp.svelte';
+	import { HealthCheckStatus } from '$lib/enums';
+	import type { MCPServerSettingsEntry } from '$lib/types';
+	import { goto } from '$app/navigation';
+	import { ROUTES } from '$lib/constants/routes';
+
+	interface Props {
+		onMcpSettingsClick?: () => void;
+	}
+
+	let { onMcpSettingsClick }: Props = $props();
+
+	let mcpSearchQuery = $state('');
+	let allMcpServers = $derived(mcpStore.getServersSorted());
+	let mcpServers = $derived(allMcpServers.filter((s) => s.enabled));
+	let hasMcpServers = $derived(mcpServers.length > 0);
+	// let hasAnyMcpServers = $derived(allMcpServers.length > 0);
+	let filteredMcpServers = $derived.by(() => {
+		const query = mcpSearchQuery.toLowerCase().trim();
+		if (!query) return mcpServers;
+		return mcpServers.filter((s) => {
+			const name = getServerLabel(s).toLowerCase();
+			const url = s.url.toLowerCase();
+			return name.includes(query) || url.includes(query);
+		});
+	});
+
+	function getServerLabel(server: MCPServerSettingsEntry): string {
+		return mcpStore.getServerLabel(server);
+	}
+
+	function isServerEnabledForChat(serverId: string): boolean {
+		return conversationsStore.isMcpServerEnabledForChat(serverId);
+	}
+
+	async function toggleServerForChat(serverId: string) {
+		await conversationsStore.toggleMcpServerForChat(serverId);
+	}
+
+	function handleMcpSubMenuOpen(open: boolean) {
+		if (open) {
+			mcpSearchQuery = '';
+			mcpStore.runHealthChecksForServers(allMcpServers);
+		}
+	}
+
+	function handleMcpSettingsClick() {
+		onMcpSettingsClick?.();
+
+		goto(`${hasMcpServers ? '' : '?add'}${ROUTES.MCP_SERVERS}`);
+	}
+</script>
+
+<DropdownMenu.Root>
+	<DropdownMenu.Sub onOpenChange={handleMcpSubMenuOpen}>
+		<DropdownMenu.SubTrigger class="flex cursor-pointer items-center gap-2">
+			<McpLogo class="h-4 w-4" />
+
+			<span>MCP Servers</span>
+		</DropdownMenu.SubTrigger>
+
+		<DropdownMenu.SubContent class="w-72 pt-0">
+			{#if hasMcpServers}
+				<DropdownMenuSearchable
+					placeholder="Search servers..."
+					bind:searchValue={mcpSearchQuery}
+					emptyMessage="No servers found"
+					isEmpty={filteredMcpServers.length === 0}
+				>
+					<div class="max-h-64 overflow-y-auto">
+						{#each filteredMcpServers as server (server.id)}
+							{@const healthState = mcpStore.getHealthCheckState(server.id)}
+							{@const hasError = healthState.status === HealthCheckStatus.ERROR}
+							{@const isEnabledForChat = isServerEnabledForChat(server.id)}
+							{@const displayName = getServerLabel(server)}
+							{@const faviconUrl = mcpStore.getServerFavicon(server.id)}
+
+							<button
+								type="button"
+								class="flex w-full items-center justify-between gap-2 rounded-sm px-2 py-2 text-left transition-colors hover:bg-accent disabled:cursor-not-allowed disabled:opacity-50"
+								onclick={() => !hasError && toggleServerForChat(server.id)}
+								disabled={hasError}
+							>
+								<div class="flex min-w-0 flex-1 items-center gap-2">
+									<div class="min-w-0 flex-1">
+										<McpServerIdentity
+											{displayName}
+											{faviconUrl}
+											iconClass="h-4 w-4"
+											iconRounded="rounded-sm"
+											showVersion={false}
+											nameClass="text-sm"
+										/>
+									</div>
+
+									{#if hasError}
+										<span
+											class="shrink-0 rounded bg-destructive/15 px-1.5 py-0.5 text-xs text-destructive"
+										>
+											Error
+										</span>
+									{/if}
+								</div>
+
+								<Switch
+									checked={isEnabledForChat}
+									disabled={hasError}
+									onclick={(e) => e.stopPropagation()}
+									onCheckedChange={() => toggleServerForChat(server.id)}
+								/>
+							</button>
+						{/each}
+					</div>
+
+					{#snippet footer()}
+						<DropdownMenu.Item
+							class="flex cursor-pointer items-center gap-2"
+							onclick={handleMcpSettingsClick}
+						>
+							<Settings class="h-4 w-4" />
+
+							<span>Manage MCP Servers</span>
+						</DropdownMenu.Item>
+					{/snippet}
+				</DropdownMenuSearchable>
+			{:else}
+				<div class="px-2 py-3 text-center text-sm text-muted-foreground">
+					No MCP servers configured
+				</div>
+
+				<DropdownMenu.Separator />
+
+				<DropdownMenu.Item
+					class="flex cursor-pointer items-center gap-2"
+					onclick={handleMcpSettingsClick}
+				>
+					<Plus class="h-4 w-4" />
+
+					<span>Add MCP Servers</span>
+				</DropdownMenu.Item>
+			{/if}
+		</DropdownMenu.SubContent>
+	</DropdownMenu.Sub>
+</DropdownMenu.Root>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionAddSheet.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionAddSheet.svelte
@ -0,0 +1,297 @@
+<script lang="ts">
+	import type { Snippet } from 'svelte';
+	import * as Tooltip from '$lib/components/ui/tooltip';
+	import * as Sheet from '$lib/components/ui/sheet';
+	import * as Collapsible from '$lib/components/ui/collapsible';
+	import { File, MessageSquare, Zap, FolderOpen } from '@lucide/svelte';
+	import { Switch } from '$lib/components/ui/switch';
+	import { Checkbox } from '$lib/components/ui/checkbox';
+	import { TOOLTIP_DELAY_DURATION } from '$lib/constants';
+	import { ATTACHMENT_FILE_ITEMS } from '$lib/constants/attachment-menu';
+	import { useAttachmentMenu } from '$lib/hooks/use-attachment-menu.svelte';
+	import { useToolsPanel } from '$lib/hooks/use-tools-panel.svelte';
+	import { conversationsStore } from '$lib/stores/conversations.svelte';
+	import { mcpStore } from '$lib/stores/mcp.svelte';
+	import { McpLogo } from '$lib/components/app';
+	import { PencilRuler, ChevronDown, ChevronRight } from '@lucide/svelte';
+	import { HealthCheckStatus } from '$lib/enums';
+
+	interface Props {
+		class?: string;
+		disabled?: boolean;
+		hasAudioModality?: boolean;
+		hasVideoModality?: boolean;
+		hasVisionModality?: boolean;
+		hasMcpPromptsSupport?: boolean;
+		hasMcpResourcesSupport?: boolean;
+		onFileUpload?: () => void;
+		onSystemPromptClick?: () => void;
+		onMcpPromptClick?: () => void;
+		onMcpResourcesClick?: () => void;
+		trigger: Snippet<[{ disabled: boolean; onclick?: () => void }]>;
+	}
+
+	let {
+		class: className = '',
+		disabled = false,
+		hasAudioModality = false,
+		hasVisionModality = false,
+		hasVideoModality = false,
+		hasMcpPromptsSupport = false,
+		hasMcpResourcesSupport = false,
+		onFileUpload,
+		onSystemPromptClick,
+		onMcpPromptClick,
+		onMcpResourcesClick,
+		trigger
+	}: Props = $props();
+
+	let sheetOpen = $state(false);
+	let filesExpanded = $state(true);
+	let toolsExpanded = $state(false);
+	let mcpExpanded = $state(false);
+
+	const attachmentMenu = useAttachmentMenu(
+		() => ({
+			hasVisionModality,
+			hasAudioModality,
+			hasVideoModality,
+			hasMcpPromptsSupport,
+			hasMcpResourcesSupport
+		}),
+		() => ({ onFileUpload, onSystemPromptClick, onMcpPromptClick, onMcpResourcesClick }),
+		() => {
+			sheetOpen = false;
+		}
+	);
+
+	const toolsPanel = useToolsPanel();
+
+	const sheetItemClass =
+		'flex w-full items-center gap-3 rounded-md px-3 py-2.5 text-left text-sm transition-colors hover:bg-accent active:bg-accent disabled:cursor-not-allowed disabled:opacity-50';
+
+	const sheetItemRowClass =
+		'flex w-full items-center justify-between gap-2 rounded-md px-3 py-2 text-left text-sm transition-colors hover:bg-accent';
+
+	function getEnabledMcpServers() {
+		return mcpStore.getServersSorted().filter((s) => s.enabled);
+	}
+</script>
+
+<div class="flex items-center gap-1 {className}">
+	<Sheet.Root bind:open={sheetOpen}>
+		{@render trigger({ disabled, onclick: () => (sheetOpen = true) })}
+
+		<Sheet.Content side="bottom" class="max-h-[85vh] gap-0 overflow-y-auto">
+			<Sheet.Header>
+				<Sheet.Title>Add to chat</Sheet.Title>
+
+				<Sheet.Description class="sr-only">
+					Add files, system prompt or configure MCP servers
+				</Sheet.Description>
+			</Sheet.Header>
+
+			<div class="flex flex-col gap-1 px-1.5 pb-2">
+				<Collapsible.Root open={filesExpanded} onOpenChange={(open) => (filesExpanded = open)}>
+					<Collapsible.Trigger class={sheetItemClass}>
+						{#if filesExpanded}
+							<ChevronDown class="h-4 w-4 shrink-0" />
+						{:else}
+							<ChevronRight class="h-4 w-4 shrink-0" />
+						{/if}
+
+						<File class="h-4 w-4 shrink-0" />
+
+						<span class="flex-1">Add files</span>
+					</Collapsible.Trigger>
+
+					<Collapsible.Content>
+						<div class="flex flex-col gap-0.5 pl-4">
+							{#each ATTACHMENT_FILE_ITEMS as item (item.id)}
+								{@const enabled = attachmentMenu.isItemEnabled(item.enabledWhen)}
+								{#if enabled}
+									<button
+										type="button"
+										class={sheetItemClass}
+										onclick={() => attachmentMenu.callbacks[item.action]()}
+									>
+										<item.icon class="h-4 w-4 shrink-0" />
+
+										<span>{item.label}</span>
+									</button>
+								{:else if item.disabledTooltip}
+									<Tooltip.Root delayDuration={TOOLTIP_DELAY_DURATION}>
+										<Tooltip.Trigger>
+											<button type="button" class={sheetItemClass} disabled>
+												<item.icon class="h-4 w-4 shrink-0" />
+
+												<span>{item.label}</span>
+											</button>
+										</Tooltip.Trigger>
+
+										<Tooltip.Content side="right">
+											<p>{item.disabledTooltip}</p>
+										</Tooltip.Content>
+									</Tooltip.Root>
+								{/if}
+							{/each}
+						</div>
+					</Collapsible.Content>
+				</Collapsible.Root>
+
+				<Collapsible.Root open={mcpExpanded} onOpenChange={(open) => (mcpExpanded = open)}>
+					<Collapsible.Trigger class={sheetItemClass}>
+						{#if mcpExpanded}
+							<ChevronDown class="h-4 w-4 shrink-0" />
+						{:else}
+							<ChevronRight class="h-4 w-4 shrink-0" />
+						{/if}
+
+						<McpLogo class="inline h-4 w-4 shrink-0" />
+
+						<span class="flex-1">MCP Servers</span>
+
+						<span class="text-xs text-muted-foreground">
+							{getEnabledMcpServers().length} server{getEnabledMcpServers().length !== 1 ? 's' : ''}
+						</span>
+					</Collapsible.Trigger>
+
+					<Collapsible.Content>
+						<div class="flex flex-col gap-0.5 pl-4">
+							{#each getEnabledMcpServers() as server (server.id)}
+								{@const healthState = mcpStore.getHealthCheckState(server.id)}
+								{@const hasError = healthState.status === HealthCheckStatus.ERROR}
+								{@const displayName = mcpStore.getServerLabel(server)}
+								{@const faviconUrl = mcpStore.getServerFavicon(server.id)}
+								{@const isEnabled = conversationsStore.isMcpServerEnabledForChat(server.id)}
+
+								<button
+									type="button"
+									class={sheetItemRowClass}
+									onclick={() => !hasError && conversationsStore.toggleMcpServerForChat(server.id)}
+									disabled={hasError}
+								>
+									<div class="flex min-w-0 flex-1 items-center gap-2">
+										{#if faviconUrl}
+											<img
+												src={faviconUrl}
+												alt=""
+												class="h-4 w-4 shrink-0 rounded-sm"
+												onerror={(e) => {
+													(e.currentTarget as HTMLImageElement).style.display = 'none';
+												}}
+											/>
+										{/if}
+
+										<span class="min-w-0 truncate text-sm">{displayName}</span>
+									</div>
+
+									{#if hasError}
+										<span
+											class="shrink-0 rounded bg-destructive/15 px-1.5 py-0.5 text-xs text-destructive"
+										>
+											Error
+										</span>
+									{:else}
+										<Switch
+											checked={isEnabled}
+											onCheckedChange={() => conversationsStore.toggleMcpServerForChat(server.id)}
+										/>
+									{/if}
+								</button>
+							{/each}
+
+							{#if getEnabledMcpServers().length === 0}
+								<div class="px-3 py-2 text-center text-sm text-muted-foreground">
+									No MCP servers configured
+								</div>
+							{/if}
+						</div>
+					</Collapsible.Content>
+				</Collapsible.Root>
+
+				{#if toolsPanel.totalToolCount > 0}
+					<Collapsible.Root open={toolsExpanded} onOpenChange={(open) => (toolsExpanded = open)}>
+						<Collapsible.Trigger class={sheetItemClass}>
+							{#if toolsExpanded}
+								<ChevronDown class="h-4 w-4 shrink-0" />
+							{:else}
+								<ChevronRight class="h-4 w-4 shrink-0" />
+							{/if}
+
+							<PencilRuler class="inline h-4 w-4 shrink-0" />
+
+							<span class="flex-1">Tools</span>
+
+							<span class="text-xs text-muted-foreground">
+								{toolsPanel.totalToolCount} tool{toolsPanel.totalToolCount !== 1 ? 's' : ''}
+							</span>
+						</Collapsible.Trigger>
+
+						<Collapsible.Content>
+							<div class="flex flex-col gap-0.5 pl-4">
+								{#each toolsPanel.activeGroups as group (group.label)}
+									{@const checked = toolsPanel.isGroupChecked(group)}
+									{@const enabledCount = toolsPanel.getEnabledToolCount(group)}
+									{@const favicon = toolsPanel.getFavicon(group)}
+
+									<button
+										type="button"
+										class={sheetItemRowClass}
+										onclick={() => toolsPanel.toggleGroupByLabel(group.label)}
+									>
+										{#if favicon}
+											<img
+												src={favicon}
+												alt=""
+												class="h-4 w-4 shrink-0 rounded-sm"
+												onerror={(e) => {
+													(e.currentTarget as HTMLImageElement).style.display = 'none';
+												}}
+											/>
+										{/if}
+
+										<span class="min-w-0 flex-1 truncate text-sm font-medium">{group.label}</span>
+
+										<span class="shrink-0 text-xs text-muted-foreground">
+											{enabledCount}/{group.tools.length}
+										</span>
+
+										<Checkbox
+											{checked}
+											class="h-4 w-4 shrink-0"
+											onclick={(e) => e.stopPropagation()}
+											onCheckedChange={() => toolsPanel.toggleGroupByLabel(group.label)}
+										/>
+									</button>
+								{/each}
+							</div>
+						</Collapsible.Content>
+					</Collapsible.Root>
+				{/if}
+
+				<button type="button" class={sheetItemClass} onclick={onSystemPromptClick}>
+					<MessageSquare class="h-4 w-4 shrink-0" />
+
+					<span>System Message</span>
+				</button>
+
+				{#if hasMcpPromptsSupport}
+					<button type="button" class={sheetItemClass} onclick={onMcpPromptClick}>
+						<Zap class="h-4 w-4 shrink-0" />
+
+						<span>MCP Prompt</span>
+					</button>
+				{/if}
+
+				{#if hasMcpResourcesSupport}
+					<button type="button" class={sheetItemClass} onclick={onMcpResourcesClick}>
+						<FolderOpen class="h-4 w-4 shrink-0" />
+
+						<span>MCP Resources</span>
+					</button>
+				{/if}
+			</div>
+		</Sheet.Content>
+	</Sheet.Root>
+</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionAddToolsSubmenu.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionAddToolsSubmenu.svelte
@ -0,0 +1,157 @@
+<script lang="ts">
+	import { PencilRuler, ChevronDown, ChevronRight, Loader2, Info, Check } from '@lucide/svelte';
+	import { Checkbox } from '$lib/components/ui/checkbox';
+	import * as Collapsible from '$lib/components/ui/collapsible';
+	import * as DropdownMenu from '$lib/components/ui/dropdown-menu';
+	import * as Tooltip from '$lib/components/ui/tooltip';
+	import { toolsStore } from '$lib/stores/tools.svelte';
+	import { CLI_FLAGS } from '$lib/constants';
+	import { mcpStore } from '$lib/stores/mcp.svelte';
+	import { useToolsPanel } from '$lib/hooks/use-tools-panel.svelte';
+
+	const toolsPanel = useToolsPanel();
+	const hasMcpServersAvailable = $derived(mcpStore.getServersSorted().length > 0);
+</script>
+
+<DropdownMenu.Sub onOpenChange={(open) => open && toolsPanel.handleOpen()}>
+	<DropdownMenu.SubTrigger class="flex cursor-pointer items-center gap-2">
+		<PencilRuler class="h-4 w-4" />
+
+		<span>Tools</span>
+	</DropdownMenu.SubTrigger>
+
+	<DropdownMenu.SubContent class="w-72 p-0">
+		{#if toolsPanel.totalToolCount === 0}
+			{#if toolsStore.loading}
+				<div class="px-3 py-4 text-center text-sm text-muted-foreground">
+					<Loader2 class="mx-auto mb-1 h-4 w-4 animate-spin" />
+
+					Loading tools...
+				</div>
+			{:else if toolsStore.isToolsEndpointUnreachable}
+				<div class="grid gap-2.5 px-3 py-4 text-sm text-muted-foreground">
+					<span class="flex gap-2">
+						<Info class="mt-0.5 h-4 w-4 shrink-0" />
+
+						<span>
+							Run llama-server with <code>{CLI_FLAGS.TOOLS}</code> flag to enable
+
+							<strong>Built-in Tools</strong>.
+						</span>
+					</span>
+
+					<span class="flex gap-2">
+						<Info class="mt-0.5 h-4 w-4 shrink-0" />
+
+						<span>
+							{hasMcpServersAvailable ? 'Enable' : 'Add'} MCP Server(s) to access
+
+							<strong>MCP Tools</strong>.
+						</span>
+					</span>
+				</div>
+			{:else if toolsStore.error}
+				<div class="px-3 py-4 text-center text-sm text-muted-foreground">Failed to load tools</div>
+			{:else if toolsPanel.noToolsInfoMessage}
+				<div class="flex gap-2 px-3 py-4 text-sm text-muted-foreground">
+					<Info class="mt-0.5 h-4 w-4 shrink-0" />
+
+					<span>{toolsPanel.noToolsInfoMessage}</span>
+				</div>
+			{:else}
+				<div class="px-3 py-4 text-center text-sm text-muted-foreground">No tools available</div>
+			{/if}
+		{:else}
+			<div class="max-h-80 overflow-y-auto p-2 pr-1">
+				{#each toolsPanel.activeGroups as group (group.label)}
+					{@const isExpanded = toolsPanel.expandedGroups.has(group.label)}
+					{@const checked = toolsPanel.isGroupChecked(group)}
+					{@const favicon = toolsPanel.getFavicon(group)}
+
+					<Collapsible.Root
+						open={isExpanded}
+						onOpenChange={() => toolsPanel.toggleGroupExpanded(group.label)}
+					>
+						<div class="flex items-center gap-1">
+							<Collapsible.Trigger
+								class="flex min-w-0 flex-1 items-center gap-2 rounded px-2 py-1.5 text-sm hover:bg-muted/50"
+							>
+								{#if isExpanded}
+									<ChevronDown class="h-3.5 w-3.5 shrink-0" />
+								{:else}
+									<ChevronRight class="h-3.5 w-3.5 shrink-0" />
+								{/if}
+
+								<span class="inline-flex min-w-0 items-center gap-1.5 font-medium">
+									{#if favicon}
+										<img
+											src={favicon}
+											alt=""
+											class="h-4 w-4 shrink-0 rounded-sm"
+											onerror={(e) => {
+												(e.currentTarget as HTMLImageElement).style.display = 'none';
+											}}
+										/>
+									{/if}
+
+									<span class="truncate">{group.label}</span>
+								</span>
+
+								<span class="ml-auto shrink-0 text-xs text-muted-foreground">
+									{toolsPanel.getEnabledToolCount(group)}/{group.tools.length}
+								</span>
+							</Collapsible.Trigger>
+
+							<Tooltip.Root>
+								<Tooltip.Trigger>
+									{#snippet child({ props })}
+										<Checkbox
+											{...props}
+											{checked}
+											onCheckedChange={() => toolsPanel.toggleGroupByLabel(group.label)}
+											class="mr-2 h-4 w-4 shrink-0"
+										/>
+									{/snippet}
+								</Tooltip.Trigger>
+
+								<Tooltip.Content side="right">
+									<p>
+										{checked ? 'Disable' : 'Enable'}
+										{group.tools.length} tool{group.tools.length !== 1 ? 's' : ''}
+									</p>
+								</Tooltip.Content>
+							</Tooltip.Root>
+						</div>
+
+						<Collapsible.Content>
+							<div class="ml-4 flex flex-col gap-0.5 border-l border-border/50 pl-2">
+								{#each group.tools as entry (entry.key)}
+									{@const enabled = toolsStore.isToolEnabled(entry.key)}
+									<button
+										type="button"
+										class="flex w-full items-center gap-2 rounded px-2 py-1.5 text-left text-sm transition-colors hover:bg-muted/50"
+										onclick={() => toolsStore.toggleTool(entry.key)}
+									>
+										<span
+											data-slot="checkbox"
+											data-state={enabled ? 'checked' : 'unchecked'}
+											class="flex size-4 shrink-0 items-center justify-center rounded-[4px] border border-input data-[state=checked]:border-primary data-[state=checked]:bg-primary data-[state=checked]:text-primary-foreground"
+										>
+											{#if enabled}
+												<Check class="size-3.5" />
+											{/if}
+										</span>
+
+										<span class="min-w-0 flex-1 truncate font-mono text-[12px]">
+											{entry.definition.function.name}
+										</span>
+									</button>
+								{/each}
+							</div>
+						</Collapsible.Content>
+					</Collapsible.Root>
+				{/each}
+			</div>
+		{/if}
+	</DropdownMenu.SubContent>
+</DropdownMenu.Sub>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionsAdd.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionAdd/ChatFormActionsAdd.svelte
@ -0,0 +1,66 @@
+<script lang="ts">
+	import { isMobile } from '$lib/stores/viewport.svelte';
+	import ChatFormActionAddDropdown from './ChatFormActionAddDropdown.svelte';
+	import ChatFormActionAddSheet from './ChatFormActionAddSheet.svelte';
+	import ChatFormActionAddButton from './ChatFormActionAddButton.svelte';
+
+	interface Props {
+		disabled?: boolean;
+		hasAudioModality?: boolean;
+		hasVideoModality?: boolean;
+		hasMcpPromptsSupport?: boolean;
+		hasMcpResourcesSupport?: boolean;
+		hasVisionModality?: boolean;
+		onFileUpload?: () => void;
+		onMcpPromptClick?: () => void;
+		onMcpResourcesClick?: () => void;
+		onMcpSettingsClick?: () => void;
+		onSystemPromptClick?: () => void;
+	}
+
+	let {
+		disabled = false,
+		hasAudioModality = false,
+		hasVideoModality = false,
+		hasMcpPromptsSupport = false,
+		hasMcpResourcesSupport = false,
+		hasVisionModality = false,
+		onFileUpload,
+		onMcpPromptClick,
+		onMcpResourcesClick,
+		onMcpSettingsClick,
+		onSystemPromptClick
+	}: Props = $props();
+</script>
+
+{#if isMobile.current}
+	<ChatFormActionAddSheet
+		{disabled}
+		{hasAudioModality}
+		{hasVideoModality}
+		{hasVisionModality}
+		{hasMcpPromptsSupport}
+		{hasMcpResourcesSupport}
+		{onFileUpload}
+		{onMcpPromptClick}
+		{onMcpResourcesClick}
+	>
+		{#snippet trigger({ disabled, onclick })}
+			<ChatFormActionAddButton {disabled} {onclick} />
+		{/snippet}
+	</ChatFormActionAddSheet>
+{:else}
+	<ChatFormActionAddDropdown
+		{disabled}
+		{hasAudioModality}
+		{hasVideoModality}
+		{hasVisionModality}
+		{hasMcpPromptsSupport}
+		{hasMcpResourcesSupport}
+		{onFileUpload}
+		{onMcpPromptClick}
+		{onMcpResourcesClick}
+		{onMcpSettingsClick}
+		{onSystemPromptClick}
+	/>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionModels.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionModels.svelte
@ -0,0 +1,193 @@
+<script lang="ts">
+	import { chatStore } from '$lib/stores/chat.svelte';
+	import {
+		modelsStore,
+		modelOptions,
+		selectedModelId,
+		selectedModelName
+	} from '$lib/stores/models.svelte';
+	import { isRouterMode, serverError } from '$lib/stores/server.svelte';
+	import { ModelsSelectorDropdown, ModelsSelectorSheet } from '$lib/components/app';
+	import { isMobile } from '$lib/stores/viewport.svelte';
+	import { activeMessages } from '$lib/stores/conversations.svelte';
+
+	interface Props {
+		disabled?: boolean;
+		forceForegroundText?: boolean;
+		hasAudioModality?: boolean;
+		hasVideoModality?: boolean;
+		hasVisionModality?: boolean;
+		hasModelSelected?: boolean;
+		isSelectedModelInCache?: boolean;
+		submitTooltip?: string;
+		useGlobalSelection?: boolean;
+	}
+
+	let {
+		disabled = false,
+		forceForegroundText = false,
+		hasAudioModality = $bindable(false),
+		hasVideoModality = $bindable(false),
+		hasVisionModality = $bindable(false),
+		hasModelSelected = $bindable(false),
+		isSelectedModelInCache = $bindable(true),
+		submitTooltip = $bindable(''),
+		useGlobalSelection = false
+	}: Props = $props();
+
+	let isRouter = $derived(isRouterMode());
+	let isOffline = $derived(!!serverError());
+
+	let conversationModel = $derived(
+		chatStore.getConversationModel(activeMessages() as DatabaseMessage[])
+	);
+
+	let lastSyncedConversationModel: string | null = null;
+
+	let selectorModel = $derived.by(() => {
+		const storeModel = selectedModelName();
+		if (storeModel && storeModel !== conversationModel) {
+			return storeModel;
+		}
+
+		if (conversationModel) {
+			return conversationModel;
+		}
+
+		return null;
+	});
+
+	$effect(() => {
+		if (conversationModel && conversationModel !== lastSyncedConversationModel) {
+			if (modelOptions().some((m) => m.model === conversationModel)) {
+				modelsStore.selectedModelName = conversationModel;
+				modelsStore.selectModelByName(conversationModel);
+			} else {
+				modelsStore.selectedModelName = null;
+				modelsStore.clearSelection();
+			}
+			lastSyncedConversationModel = conversationModel;
+		} else if (
+			isRouter &&
+			!modelsStore.selectedModelId &&
+			modelsStore.loadedModelIds.length > 0 &&
+			activeMessages().length > 0 &&
+			!conversationModel
+		) {
+			lastSyncedConversationModel = null;
+			const first = modelOptions().find((m) => modelsStore.loadedModelIds.includes(m.model));
+			if (first) modelsStore.selectModelById(first.id);
+		}
+	});
+
+	let activeModelId = $derived.by(() => {
+		const options = modelOptions();
+
+		if (!isRouter) {
+			return options.length > 0 ? options[0].model : null;
+		}
+
+		const selectedId = selectedModelId();
+
+		if (selectedId) {
+			const model = options.find((m) => m.id === selectedId);
+
+			if (model) return model.model;
+		}
+
+		if (conversationModel) {
+			const model = options.find((m) => m.model === conversationModel);
+
+			if (model) return model.model;
+		}
+
+		return null;
+	});
+
+	let modelPropsVersion = $state(0); // Used to trigger reactivity after fetch
+
+	$effect(() => {
+		if (activeModelId) {
+			const cached = modelsStore.getModelProps(activeModelId);
+
+			if (!cached) {
+				modelsStore.fetchModelProps(activeModelId).then(() => {
+					modelPropsVersion++;
+				});
+			}
+		}
+	});
+
+	$effect(() => {
+		void modelPropsVersion;
+
+		hasAudioModality = activeModelId ? modelsStore.modelSupportsAudio(activeModelId) : false;
+	});
+
+	$effect(() => {
+		void modelPropsVersion;
+
+		hasVideoModality = activeModelId ? modelsStore.modelSupportsVideo(activeModelId) : false;
+	});
+
+	$effect(() => {
+		void modelPropsVersion;
+
+		hasVisionModality = activeModelId ? modelsStore.modelSupportsVision(activeModelId) : false;
+	});
+
+	$effect(() => {
+		hasModelSelected = !isRouter || !!conversationModel || !!selectedModelId();
+	});
+
+	$effect(() => {
+		if (!isRouter) {
+			isSelectedModelInCache = true;
+		} else if (conversationModel) {
+			isSelectedModelInCache = modelOptions().some((option) => option.model === conversationModel);
+		} else {
+			const currentModelId = selectedModelId();
+
+			if (!currentModelId) {
+				isSelectedModelInCache = false;
+			} else {
+				isSelectedModelInCache = modelOptions().some((option) => option.id === currentModelId);
+			}
+		}
+	});
+
+	$effect(() => {
+		if (!hasModelSelected) {
+			submitTooltip = 'Please select a model first';
+		} else if (!isSelectedModelInCache) {
+			submitTooltip = 'Selected model is not available, please select another';
+		} else {
+			submitTooltip = '';
+		}
+	});
+
+	let selectorModelRef: ModelsSelectorDropdown | ModelsSelectorSheet | undefined =
+		$state(undefined);
+
+	export function open() {
+		selectorModelRef?.open();
+	}
+</script>
+
+{#if isMobile.current}
+	<ModelsSelectorSheet
+		disabled={disabled || isOffline}
+		bind:this={selectorModelRef}
+		currentModel={selectorModel}
+		{forceForegroundText}
+		{useGlobalSelection}
+	/>
+{:else}
+	<ModelsSelectorDropdown
+		disabled={disabled || isOffline}
+		bind:this={selectorModelRef}
+		currentModel={selectorModel}
+		{forceForegroundText}
+		{useGlobalSelection}
+	/>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionRecord.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionRecord.svelte
@ -0,0 +1,52 @@
+<script lang="ts">
+	import { Mic, Square } from '@lucide/svelte';
+	import { Button } from '$lib/components/ui/button';
+	import * as Tooltip from '$lib/components/ui/tooltip';
+
+	interface Props {
+		class?: string;
+		disabled?: boolean;
+		hasAudioModality?: boolean;
+		isLoading?: boolean;
+		isRecording?: boolean;
+		onMicClick?: () => void;
+	}
+
+	let {
+		class: className = '',
+		disabled = false,
+		hasAudioModality = false,
+		isLoading = false,
+		isRecording = false,
+		onMicClick
+	}: Props = $props();
+</script>
+
+<div class="flex items-center gap-1 {className}">
+	<Tooltip.Root>
+		<Tooltip.Trigger>
+			<Button
+				class="h-8 w-8 rounded-full p-0 {isRecording
+					? 'animate-pulse bg-red-500 text-white hover:bg-red-600'
+					: ''}"
+				disabled={disabled || isLoading || !hasAudioModality}
+				onclick={onMicClick}
+				type="button"
+			>
+				<span class="sr-only">{isRecording ? 'Stop recording' : 'Start recording'}</span>
+
+				{#if isRecording}
+					<Square class="h-4 w-4 animate-pulse fill-white" />
+				{:else}
+					<Mic class="h-4 w-4" />
+				{/if}
+			</Button>
+		</Tooltip.Trigger>
+
+		{#if !hasAudioModality}
+			<Tooltip.Content>
+				<p>Current model does not support audio</p>
+			</Tooltip.Content>
+		{/if}
+	</Tooltip.Root>
+</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionSubmit.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActionSubmit.svelte
@ -0,0 +1,46 @@
+<script lang="ts">
+	import { ArrowUp } from '@lucide/svelte';
+	import { Button } from '$lib/components/ui/button';
+	import * as Tooltip from '$lib/components/ui/tooltip';
+
+	interface Props {
+		canSend?: boolean;
+		disabled?: boolean;
+		showErrorState?: boolean;
+		tooltipLabel?: string;
+	}
+
+	let { canSend = false, disabled = false, showErrorState = false, tooltipLabel }: Props = $props();
+
+	let isDisabled = $derived(!canSend || disabled);
+</script>
+
+{#snippet submitButton(props = {})}
+	<Button
+		type="submit"
+		disabled={isDisabled}
+		class={[
+			'h-8 w-8 rounded-full p-0',
+			showErrorState &&
+				'bg-red-400/10 text-red-400 hover:bg-red-400/20 hover:text-red-400 disabled:opacity-100'
+		]}
+		{...props}
+	>
+		<span class="sr-only">Send</span>
+		<ArrowUp class="h-12 w-12" />
+	</Button>
+{/snippet}
+
+{#if tooltipLabel}
+	<Tooltip.Root>
+		<Tooltip.Trigger>
+			{@render submitButton()}
+		</Tooltip.Trigger>
+
+		<Tooltip.Content>
+			<p>{tooltipLabel}</p>
+		</Tooltip.Content>
+	</Tooltip.Root>
+{:else}
+	{@render submitButton()}
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActions.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormActions.svelte
@ -0,0 +1,177 @@
+<script lang="ts">
+	import { Square, SkipForward } from '@lucide/svelte';
+	import { Button } from '$lib/components/ui/button';
+	import { ChatService } from '$lib/services';
+	import {
+		ChatFormActionsAdd,
+		ChatFormActionModels,
+		ChatFormActionRecord,
+		ChatFormActionSubmit,
+		ChatFormReasoningToggle
+	} from '$lib/components/app';
+	import { FileTypeCategory } from '$lib/enums';
+	import { mcpStore } from '$lib/stores/mcp.svelte';
+	import { config } from '$lib/stores/settings.svelte';
+	import { conversationsStore } from '$lib/stores/conversations.svelte';
+	import { getFileTypeCategory } from '$lib/utils';
+	import { goto } from '$app/navigation';
+	import { ROUTES } from '$lib/constants/routes';
+
+	interface Props {
+		canSend?: boolean;
+		canSubmit?: boolean;
+		class?: string;
+		disabled?: boolean;
+		isLoading?: boolean;
+		isReasoning?: boolean;
+		isRecording?: boolean;
+		showAddButton?: boolean;
+		showModelSelector?: boolean;
+		uploadedFiles?: ChatUploadedFile[];
+		onFileUpload?: () => void;
+		onMicClick?: () => void;
+		onStop?: () => void;
+		onSystemPromptClick?: () => void;
+		onMcpPromptClick?: () => void;
+		onMcpResourcesClick?: () => void;
+	}
+
+	let {
+		canSend = false,
+		canSubmit = false,
+		class: className = '',
+		disabled = false,
+		isLoading = false,
+		isReasoning = false,
+		isRecording = false,
+		showAddButton = true,
+		showModelSelector = true,
+		uploadedFiles = [],
+		onFileUpload,
+		onMicClick,
+		onStop,
+		onSystemPromptClick,
+		onMcpPromptClick,
+		onMcpResourcesClick
+	}: Props = $props();
+
+	let currentConfig = $derived(config());
+
+	let hasMcpPromptsSupport = $derived.by(() => {
+		const perChatOverrides = conversationsStore.getAllMcpServerOverrides();
+
+		return mcpStore.hasPromptsCapability(perChatOverrides);
+	});
+
+	let hasMcpResourcesSupport = $derived.by(() => {
+		const perChatOverrides = conversationsStore.getAllMcpServerOverrides();
+
+		return mcpStore.hasResourcesCapability(perChatOverrides);
+	});
+
+	let hasAudioModality = $state(false);
+	let hasVideoModality = $state(false);
+	let hasVisionModality = $state(false);
+	let hasModelSelected = $state(false);
+	let isSelectedModelInCache = $state(true);
+	let submitTooltip = $state('');
+
+	let hasAudioAttachments = $derived(
+		uploadedFiles.some((file) => getFileTypeCategory(file.type) === FileTypeCategory.AUDIO)
+	);
+	let shouldShowRecordButton = $derived(
+		hasAudioModality && !canSubmit && !hasAudioAttachments && currentConfig.autoMicOnEmpty
+	);
+
+	let selectorModelRef: ChatFormActionModels | undefined = $state(undefined);
+
+	export function openModelSelector() {
+		selectorModelRef?.open();
+	}
+	// the streaming assistant message carries both the completion id and the model that
+	// produced it, targeting reasoning control from the same source keeps them consistent
+	let activeMessage = $derived(
+		conversationsStore.activeMessages[conversationsStore.activeMessages.length - 1]
+	);
+</script>
+
+<div
+	class="flex w-full items-center gap-3 {className} {showAddButton ? '' : 'justify-end'}"
+	style="container-type: inline-size"
+>
+	{#if showAddButton}
+		<div class="mr-auto flex items-center gap-3">
+			<ChatFormActionsAdd
+				{disabled}
+				{hasAudioModality}
+				{hasVideoModality}
+				{hasVisionModality}
+				{hasMcpPromptsSupport}
+				{hasMcpResourcesSupport}
+				{onFileUpload}
+				{onSystemPromptClick}
+				{onMcpPromptClick}
+				{onMcpResourcesClick}
+				onMcpSettingsClick={() => goto(ROUTES.MCP_SERVERS)}
+			/>
+		</div>
+	{/if}
+
+	<div class="flex items-center gap-2">
+		<ChatFormReasoningToggle />
+
+		{#if showModelSelector}
+			<ChatFormActionModels
+				{disabled}
+				bind:this={selectorModelRef}
+				bind:hasAudioModality
+				bind:hasVideoModality
+				bind:hasVisionModality
+				bind:hasModelSelected
+				bind:isSelectedModelInCache
+				bind:submitTooltip
+				forceForegroundText
+				useGlobalSelection
+			/>
+		{/if}
+	</div>
+
+	{#if isReasoning}
+		<Button
+			type="button"
+			variant="secondary"
+			onclick={() =>
+				ChatService.stopReasoning(activeMessage?.completionId ?? '', activeMessage?.model)}
+			class="group h-8 w-8 rounded-full p-0"
+			title="Skip reasoning"
+		>
+			<span class="sr-only">Skip reasoning</span>
+
+			<SkipForward class="h-4 w-4 stroke-muted-foreground group-hover:stroke-foreground" />
+		</Button>
+	{/if}
+
+	{#if isLoading && !canSubmit}
+		<Button
+			type="button"
+			variant="secondary"
+			onclick={onStop}
+			class="group h-8 w-8 rounded-full p-0 hover:bg-destructive/10!"
+		>
+			<span class="sr-only">Stop</span>
+
+			<Square
+				class="h-8 w-8 fill-muted-foreground stroke-muted-foreground group-hover:fill-destructive group-hover:stroke-destructive hover:fill-destructive hover:stroke-destructive"
+			/>
+		</Button>
+	{:else if shouldShowRecordButton}
+		<ChatFormActionRecord {disabled} {hasAudioModality} {isLoading} {isRecording} {onMicClick} />
+	{:else}
+		<ChatFormActionSubmit
+			canSend={canSend && (showModelSelector ? hasModelSelected && isSelectedModelInCache : true)}
+			{disabled}
+			tooltipLabel={submitTooltip}
+			showErrorState={showModelSelector && hasModelSelected && !isSelectedModelInCache}
+		/>
+	{/if}
+</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormReasoningEffortSubmenu.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormReasoningEffortSubmenu.svelte
@ -0,0 +1,132 @@
+<script lang="ts">
+	import { Check, Info, Lightbulb, LightbulbOff } from '@lucide/svelte';
+	import * as DropdownMenu from '$lib/components/ui/dropdown-menu';
+	import * as Tooltip from '$lib/components/ui/tooltip';
+	import { ReasoningEffort, MessageRole } from '$lib/enums';
+	import { REASONING_EFFORT_TOKENS } from '$lib/constants/reasoning-effort-tokens';
+	import { REASONING_EFFORT_LEVELS } from '$lib/constants/reasoning-effort';
+	import type { ReasoningEffortLevel } from '$lib/types';
+	import {
+		modelsStore,
+		checkModelSupportsThinking,
+		supportsThinking,
+		propsCacheVersion,
+		loadedModelIds
+	} from '$lib/stores/models.svelte';
+	import { chatStore } from '$lib/stores/chat.svelte';
+	import { conversationsStore, activeMessages } from '$lib/stores/conversations.svelte';
+	import { isRouterMode } from '$lib/stores/server.svelte';
+	import type { DatabaseMessage } from '$lib/types/database';
+
+	let thinkingEnabled = $derived(conversationsStore.getThinkingEnabled());
+	let currentEffort = $derived(conversationsStore.getReasoningEffort());
+	let isOff = $derived(!thinkingEnabled);
+	let subOpen = $state(false);
+
+	// Get conversation model from message history
+	let conversationModel = $derived(
+		chatStore.getConversationModel(activeMessages() as DatabaseMessage[])
+	);
+
+	let modelSupportsThinkingFromMessages = $derived.by(() => {
+		const modelId = isRouterMode() ? modelsStore.selectedModelName || conversationModel : null;
+		if (!modelId) return false;
+
+		const messages = conversationsStore.activeMessages;
+
+		return messages.some(
+			(m: DatabaseMessage) =>
+				m.role === MessageRole.ASSISTANT && m.model === modelId && !!m.reasoningContent
+		);
+	});
+
+	let modelSupportsThinking = $derived.by(() => {
+		loadedModelIds();
+		propsCacheVersion();
+
+		if (isRouterMode()) {
+			const modelId = modelsStore.selectedModelName || conversationModel;
+			return checkModelSupportsThinking(modelId ?? '') || modelSupportsThinkingFromMessages;
+		}
+
+		return supportsThinking() || modelSupportsThinkingFromMessages;
+	});
+
+	function isSelected(item: ReasoningEffortLevel): boolean {
+		if (item.isOff) return isOff;
+
+		return thinkingEnabled && currentEffort === item.value;
+	}
+
+	function handleSelection(item: ReasoningEffortLevel) {
+		if (item.isOff) {
+			conversationsStore.setThinkingEnabled(false);
+		} else {
+			conversationsStore.setThinkingEnabled(true);
+			conversationsStore.setReasoningEffort(item.value as ReasoningEffort);
+		}
+		subOpen = false;
+	}
+</script>
+
+{#if modelSupportsThinking}
+	<DropdownMenu.Sub bind:open={subOpen}>
+		<DropdownMenu.SubTrigger
+			class="flex cursor-pointer items-center gap-2 rounded-md px-2.5 py-1.5 text-sm transition-colors outline-none hover:bg-accent focus:bg-accent"
+		>
+			{#if thinkingEnabled}
+				<Lightbulb class="h-4 w-4 shrink-0 text-amber-400" />
+			{:else}
+				<LightbulbOff class="h-4 w-4 shrink-0 text-muted-foreground" />
+			{/if}
+
+			<span class="flex-1">Thinking</span>
+
+			{#if thinkingEnabled}
+				<span class="text-xs text-muted-foreground">{currentEffort}</span>
+			{:else}
+				<span class="text-xs text-muted-foreground">off</span>
+			{/if}
+		</DropdownMenu.SubTrigger>
+
+		<DropdownMenu.SubContent
+			class="w-60 rounded-xl bg-popover p-3 text-popover-foreground shadow-md outline-none data-[side=bottom]:slide-in-from-top-2 data-[side=left]:slide-in-from-right-2 data-[side=right]:slide-in-from-left-2 data-[side=top]:slide-in-from-bottom-2 data-[state=closed]:animate-out data-[state=closed]:fade-out-0 data-[state=closed]:zoom-out-95 data-[state=open]:animate-in data-[state=open]:fade-in-0 data-[state=open]:zoom-in-95"
+		>
+			{#each REASONING_EFFORT_LEVELS as level (level.value)}
+				<button
+					type="button"
+					class="flex w-full cursor-pointer items-center gap-2 rounded-lg px-2.5 py-2 text-left text-sm transition-colors hover:bg-accent"
+					class:bg-accent={isSelected(level)}
+					onclick={() => handleSelection(level)}
+				>
+					{#if isSelected(level)}
+						<Check class="h-4 w-4 shrink-0 text-foreground" />
+					{:else}
+						<div class="h-4 w-4 shrink-0"></div>
+					{/if}
+
+					<span class="flex-1">{level.label}</span>
+
+					{#if !level.isOff}
+						<span class="text-[11px] text-muted-foreground opacity-60">
+							{REASONING_EFFORT_TOKENS[level.value] === -1
+								? 'Unlimited'
+								: `Max ${REASONING_EFFORT_TOKENS[level.value].toLocaleString()} tokens`}
+						</span>
+					{/if}
+
+					{#if level.hasInfo}
+						<Tooltip.Root>
+							<Tooltip.Trigger>
+								<Info class="h-3.5 w-3.5 shrink-0 text-muted-foreground" />
+							</Tooltip.Trigger>
+							<Tooltip.Content side="left">
+								<p>Maximum thinking effort with extended context usage</p>
+							</Tooltip.Content>
+						</Tooltip.Root>
+					{/if}
+				</button>
+			{/each}
+		</DropdownMenu.SubContent>
+	</DropdownMenu.Sub>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormReasoningToggle.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormActions/ChatFormReasoningToggle.svelte
@ -0,0 +1,145 @@
+<script lang="ts">
+	import { Lightbulb, LightbulbOff, Check, Info } from '@lucide/svelte';
+	import * as DropdownMenu from '$lib/components/ui/dropdown-menu';
+	import * as Tooltip from '$lib/components/ui/tooltip';
+	import { ReasoningEffort, MessageRole } from '$lib/enums';
+	import { REASONING_EFFORT_TOKENS } from '$lib/constants/reasoning-effort-tokens';
+	import { REASONING_EFFORT_LEVELS } from '$lib/constants/reasoning-effort';
+	import type { ReasoningEffortLevel } from '$lib/types';
+	import {
+		modelsStore,
+		checkModelSupportsThinking,
+		supportsThinking,
+		propsCacheVersion,
+		loadedModelIds
+	} from '$lib/stores/models.svelte';
+	import { chatStore } from '$lib/stores/chat.svelte';
+	import { conversationsStore, activeMessages } from '$lib/stores/conversations.svelte';
+	import { isRouterMode } from '$lib/stores/server.svelte';
+	import type { DatabaseMessage } from '$lib/types/database';
+
+	let thinkingEnabled = $derived(conversationsStore.getThinkingEnabled());
+	let currentEffort = $derived(conversationsStore.getReasoningEffort());
+	let isOff = $derived(!thinkingEnabled);
+	let tooltipText = $derived(thinkingEnabled ? `${currentEffort} Reasoning` : 'Disabled Reasoning');
+	let subOpen = $state(false);
+
+	// Get conversation model from message history
+	let conversationModel = $derived(
+		chatStore.getConversationModel(activeMessages() as DatabaseMessage[])
+	);
+
+	// Fallback: if model props aren't available, check if any assistant messages
+	// for this model in the active conversation have reasoning content.
+	let modelSupportsThinkingFromMessages = $derived.by(() => {
+		const modelId = isRouterMode() ? modelsStore.selectedModelName || conversationModel : null;
+		if (!modelId) return false;
+		const messages = conversationsStore.activeMessages;
+		return messages.some(
+			(m: DatabaseMessage) =>
+				m.role === MessageRole.ASSISTANT && m.model === modelId && !!m.reasoningContent
+		);
+	});
+
+	// Check if model supports thinking. Primary: chat template from /props.
+	// Fallback: message history (reasoning content in assistant messages).
+	let modelSupportsThinking = $derived.by(() => {
+		loadedModelIds();
+		propsCacheVersion();
+
+		if (isRouterMode()) {
+			const modelId = modelsStore.selectedModelName || conversationModel;
+			return checkModelSupportsThinking(modelId ?? '') || modelSupportsThinkingFromMessages;
+		}
+
+		// In non-router mode, use the built-in supportsThinking
+		return supportsThinking() || modelSupportsThinkingFromMessages;
+	});
+
+	// Check if current item is selected
+	function isSelected(item: ReasoningEffortLevel): boolean {
+		if (item.isOff) {
+			return isOff;
+		}
+		return thinkingEnabled && currentEffort === item.value;
+	}
+
+	function handleSelection(item: ReasoningEffortLevel) {
+		if (item.isOff) {
+			conversationsStore.setThinkingEnabled(false);
+		} else {
+			conversationsStore.setThinkingEnabled(true);
+			conversationsStore.setReasoningEffort(item.value as ReasoningEffort);
+		}
+		subOpen = false;
+	}
+</script>
+
+{#if modelSupportsThinking}
+	<DropdownMenu.Root bind:open={subOpen}>
+		<Tooltip.Root>
+			<Tooltip.Trigger>
+				<DropdownMenu.Trigger
+					class={[
+						'flex h-6 w-6 cursor-pointer items-center justify-center rounded-full p-0 transition-colors focus:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2',
+						thinkingEnabled ? 'bg-amber-400/10 hover:bg-amber-400/20' : 'bg-muted'
+					]}
+					aria-label={`${tooltipText}. Click to configure.`}
+				>
+					{#if thinkingEnabled}
+						<Lightbulb class="h-3 w-3 text-amber-400" />
+					{:else}
+						<LightbulbOff class="h-3 w-3 text-muted-foreground" />
+					{/if}
+				</DropdownMenu.Trigger>
+			</Tooltip.Trigger>
+
+			<Tooltip.Content>
+				<p class="capitalize">{tooltipText}</p>
+			</Tooltip.Content>
+		</Tooltip.Root>
+
+		<DropdownMenu.Content
+			align="start"
+			class="w-60 rounded-xl bg-popover p-3 text-popover-foreground shadow-md outline-none"
+		>
+			<div class="mb-2 px-2.5 text-sm font-medium">Reasoning effort</div>
+
+			{#each REASONING_EFFORT_LEVELS as level (level.value)}
+				<button
+					type="button"
+					class="flex w-full cursor-pointer items-center gap-2 rounded-lg px-2.5 py-2 text-left text-sm transition-colors hover:bg-accent"
+					class:bg-accent={isSelected(level)}
+					onclick={() => handleSelection(level)}
+				>
+					{#if isSelected(level)}
+						<Check class="h-4 w-4 shrink-0 text-foreground" />
+					{:else}
+						<div class="h-4 w-4 shrink-0"></div>
+					{/if}
+
+					<span class="flex-1">{level.label}</span>
+
+					{#if !level.isOff}
+						<span class="text-[11px] text-muted-foreground opacity-60">
+							{REASONING_EFFORT_TOKENS[level.value] === -1
+								? 'Unlimited'
+								: `Max ${REASONING_EFFORT_TOKENS[level.value].toLocaleString()} tokens`}
+						</span>
+					{/if}
+
+					{#if level.hasInfo}
+						<Tooltip.Root>
+							<Tooltip.Trigger>
+								<Info class="h-3.5 w-3.5 shrink-0 text-muted-foreground" />
+							</Tooltip.Trigger>
+							<Tooltip.Content side="left">
+								<p>Maximum reasoning effort with extended context usage</p>
+							</Tooltip.Content>
+						</Tooltip.Root>
+					{/if}
+				</button>
+			{/each}
+		</DropdownMenu.Content>
+	</DropdownMenu.Root>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormFileInputInvisible.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormFileInputInvisible.svelte
@ -1,31 +1,21 @@
 <script lang="ts">
-	import { generateModalityAwareAcceptString } from '$lib/utils/modality-file-validation';
-
 	interface Props {
-		accept?: string;
 		class?: string;
 		multiple?: boolean;
 		onFileSelect?: (files: File[]) => void;
 	}

-	let {
-		accept = $bindable(),
-		class: className = '',
-		multiple = true,
-		onFileSelect
-	}: Props = $props();
+	let { class: className = '', multiple = true, onFileSelect }: Props = $props();

 	let fileInputElement: HTMLInputElement | undefined;

-	// Use modality-aware accept string by default, but allow override
-	let finalAccept = $derived(accept ?? generateModalityAwareAcceptString());
-
 	export function click() {
 		fileInputElement?.click();
 	}

 	function handleFileSelect(event: Event) {
 		const input = event.target as HTMLInputElement;
+
 		if (input.files) {
 			onFileSelect?.(Array.from(input.files));
 		}
@ -36,7 +26,6 @@
 	bind:this={fileInputElement}
 	type="file"
 	{multiple}
-	accept={finalAccept}
 	onchange={handleFileSelect}
 	class="hidden {className}"
 />
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormHelperText.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormHelperText.svelte
@ -1,17 +0,0 @@
-<script lang="ts">
-	interface Props {
-		class?: string;
-		show?: boolean;
-	}
-
-	let { class: className = '', show = true }: Props = $props();
-</script>
-
-{#if show}
-	<div class="mt-4 flex items-center justify-center {className}">
-		<p class="text-xs text-muted-foreground">
-			Press <kbd class="rounded bg-muted px-1 py-0.5 font-mono text-xs">Enter</kbd> to send,
-			<kbd class="rounded bg-muted px-1 py-0.5 font-mono text-xs">Shift + Enter</kbd> for new line
-		</p>
-	</div>
-{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormMcpResourcesList.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormMcpResourcesList.svelte
@ -0,0 +1,44 @@
+<script lang="ts">
+	import { mcpStore } from '$lib/stores/mcp.svelte';
+	import {
+		mcpResourceAttachments,
+		mcpHasResourceAttachments
+	} from '$lib/stores/mcp-resources.svelte';
+	import {
+		ChatAttachmentsListItemMcpResource,
+		HorizontalScrollCarousel
+	} from '$lib/components/app';
+
+	interface Props {
+		class?: string;
+		onResourceClick?: (uri: string) => void;
+	}
+
+	let { class: className, onResourceClick }: Props = $props();
+
+	const attachments = $derived(mcpResourceAttachments());
+	const hasAttachments = $derived(mcpHasResourceAttachments());
+
+	function handleRemove(attachmentId: string) {
+		mcpStore.removeResourceAttachment(attachmentId);
+	}
+
+	function handleResourceClick(uri: string) {
+		onResourceClick?.(uri);
+	}
+</script>
+
+{#if hasAttachments}
+	<div class={className}>
+		<HorizontalScrollCarousel gapSize="2">
+			{#each attachments as attachment, i (attachment.id)}
+				<ChatAttachmentsListItemMcpResource
+					class={i === 0 ? 'ml-3' : ''}
+					{attachment}
+					onRemove={handleRemove}
+					onclick={() => handleResourceClick(attachment.resource.uri)}
+				/>
+			{/each}
+		</HorizontalScrollCarousel>
+	</div>
+{/if}
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte
@ -1,352 +0,0 @@
-<script lang="ts">
-	import { onMount, tick } from 'svelte';
-	import { ChevronDown, Loader2 } from '@lucide/svelte';
-	import { cn } from '$lib/components/ui/utils';
-	import { portalToBody } from '$lib/utils/portal-to-body';
-	import {
-		fetchModels,
-		modelOptions,
-		modelsError,
-		modelsLoading,
-		modelsUpdating,
-		selectModel,
-		selectedModelId
-	} from '$lib/stores/models.svelte';
-	import type { ModelOption } from '$lib/types/models';
-
-	interface Props {
-		class?: string;
-	}
-
-	let { class: className = '' }: Props = $props();
-
-	let options = $derived(modelOptions());
-	let loading = $derived(modelsLoading());
-	let updating = $derived(modelsUpdating());
-	let error = $derived(modelsError());
-	let activeId = $derived(selectedModelId());
-
-	let isMounted = $state(false);
-	let isOpen = $state(false);
-	let container: HTMLDivElement | null = null;
-	let triggerButton = $state<HTMLButtonElement | null>(null);
-	let menuRef = $state<HTMLDivElement | null>(null);
-	let menuPosition = $state<{
-		top: number;
-		left: number;
-		width: number;
-		placement: 'top' | 'bottom';
-		maxHeight: number;
-	} | null>(null);
-	let lockedWidth: number | null = null;
-
-	onMount(async () => {
-		try {
-			await fetchModels();
-		} catch (error) {
-			console.error('Unable to load models:', error);
-		} finally {
-			isMounted = true;
-		}
-	});
-
-	function handlePointerDown(event: PointerEvent) {
-		if (!container) return;
-
-		const target = event.target as Node | null;
-
-		if (target && !container.contains(target) && !(menuRef && menuRef.contains(target))) {
-			closeMenu();
-		}
-	}
-
-	function handleKeydown(event: KeyboardEvent) {
-		if (event.key === 'Escape') {
-			closeMenu();
-		}
-	}
-
-	function handleResize() {
-		if (isOpen) {
-			updateMenuPosition();
-		}
-	}
-
-	async function handleSelect(value: string | undefined) {
-		if (!value) return;
-
-		const option = options.find((item) => item.id === value);
-		if (!option) {
-			console.error('Model is no longer available');
-			return;
-		}
-
-		try {
-			await selectModel(option.id);
-		} catch (error) {
-			console.error('Failed to switch model:', error);
-		}
-	}
-
-	const VIEWPORT_GUTTER = 8;
-	const MENU_OFFSET = 6;
-	const MENU_MAX_WIDTH = 320;
-
-	async function openMenu() {
-		if (loading || updating) return;
-
-		isOpen = true;
-		await tick();
-		updateMenuPosition();
-		requestAnimationFrame(() => updateMenuPosition());
-	}
-
-	function toggleOpen() {
-		if (loading || updating) return;
-
-		if (isOpen) {
-			closeMenu();
-		} else {
-			void openMenu();
-		}
-	}
-
-	function closeMenu() {
-		if (!isOpen) return;
-
-		isOpen = false;
-		menuPosition = null;
-		lockedWidth = null;
-	}
-
-	async function handleOptionSelect(optionId: string) {
-		try {
-			await handleSelect(optionId);
-		} finally {
-			closeMenu();
-		}
-	}
-
-	$effect(() => {
-		if (loading || updating) {
-			closeMenu();
-		}
-	});
-
-	$effect(() => {
-		const optionCount = options.length;
-
-		if (!isOpen || optionCount <= 0) return;
-
-		queueMicrotask(() => updateMenuPosition());
-	});
-
-	function updateMenuPosition() {
-		if (!isOpen || !triggerButton || !menuRef) return;
-
-		const triggerRect = triggerButton.getBoundingClientRect();
-		const viewportWidth = window.innerWidth;
-		const viewportHeight = window.innerHeight;
-
-		if (viewportWidth === 0 || viewportHeight === 0) return;
-
-		const scrollWidth = menuRef.scrollWidth;
-		const scrollHeight = menuRef.scrollHeight;
-
-		const availableWidth = Math.max(0, viewportWidth - VIEWPORT_GUTTER * 2);
-		const constrainedMaxWidth = Math.min(MENU_MAX_WIDTH, availableWidth || MENU_MAX_WIDTH);
-		const safeMaxWidth =
-			constrainedMaxWidth > 0 ? constrainedMaxWidth : Math.min(MENU_MAX_WIDTH, viewportWidth);
-		const desiredMinWidth = Math.min(160, safeMaxWidth || 160);
-
-		let width = lockedWidth;
-		if (width === null) {
-			const naturalWidth = Math.min(scrollWidth, safeMaxWidth);
-			const baseWidth = Math.max(triggerRect.width, naturalWidth, desiredMinWidth);
-			width = Math.min(baseWidth, safeMaxWidth || baseWidth);
-			lockedWidth = width;
-		} else {
-			width = Math.min(Math.max(width, desiredMinWidth), safeMaxWidth || width);
-		}
-
-		if (width > 0) {
-			menuRef.style.width = `${width}px`;
-		}
-
-		const availableBelow = Math.max(
-			0,
-			viewportHeight - VIEWPORT_GUTTER - triggerRect.bottom - MENU_OFFSET
-		);
-		const availableAbove = Math.max(0, triggerRect.top - VIEWPORT_GUTTER - MENU_OFFSET);
-		const viewportAllowance = Math.max(0, viewportHeight - VIEWPORT_GUTTER * 2);
-		const fallbackAllowance = Math.max(1, viewportAllowance > 0 ? viewportAllowance : scrollHeight);
-
-		function computePlacement(placement: 'top' | 'bottom') {
-			const available = placement === 'bottom' ? availableBelow : availableAbove;
-			const allowedHeight =
-				available > 0 ? Math.min(available, fallbackAllowance) : fallbackAllowance;
-			const maxHeight = Math.min(scrollHeight, allowedHeight);
-			const height = Math.max(0, maxHeight);
-
-			let top: number;
-			if (placement === 'bottom') {
-				const rawTop = triggerRect.bottom + MENU_OFFSET;
-				const minTop = VIEWPORT_GUTTER;
-				const maxTop = viewportHeight - VIEWPORT_GUTTER - height;
-				if (maxTop < minTop) {
-					top = minTop;
-				} else {
-					top = Math.min(Math.max(rawTop, minTop), maxTop);
-				}
-			} else {
-				const rawTop = triggerRect.top - MENU_OFFSET - height;
-				const minTop = VIEWPORT_GUTTER;
-				const maxTop = viewportHeight - VIEWPORT_GUTTER - height;
-				if (maxTop < minTop) {
-					top = minTop;
-				} else {
-					top = Math.max(Math.min(rawTop, maxTop), minTop);
-				}
-			}
-
-			return { placement, top, height, maxHeight };
-		}
-
-		const belowMetrics = computePlacement('bottom');
-		const aboveMetrics = computePlacement('top');
-
-		let metrics = belowMetrics;
-		if (scrollHeight > belowMetrics.maxHeight && aboveMetrics.maxHeight > belowMetrics.maxHeight) {
-			metrics = aboveMetrics;
-		}
-
-		menuRef.style.maxHeight = metrics.maxHeight > 0 ? `${Math.round(metrics.maxHeight)}px` : '';
-
-		let left = triggerRect.right - width;
-		const maxLeft = viewportWidth - VIEWPORT_GUTTER - width;
-		if (maxLeft < VIEWPORT_GUTTER) {
-			left = VIEWPORT_GUTTER;
-		} else {
-			if (left > maxLeft) {
-				left = maxLeft;
-			}
-			if (left < VIEWPORT_GUTTER) {
-				left = VIEWPORT_GUTTER;
-			}
-		}
-
-		menuPosition = {
-			top: Math.round(metrics.top),
-			left: Math.round(left),
-			width: Math.round(width),
-			placement: metrics.placement,
-			maxHeight: Math.round(metrics.maxHeight)
-		};
-	}
-
-	function getDisplayOption(): ModelOption | undefined {
-		if (activeId) {
-			return options.find((option) => option.id === activeId);
-		}
-
-		return options[0];
-	}
-</script>
-
-<svelte:window onresize={handleResize} />
-
-<svelte:document onpointerdown={handlePointerDown} onkeydown={handleKeydown} />
-
-<div
-	class={cn('relative z-10 flex max-w-[200px] min-w-[120px] flex-col items-end gap-1', className)}
-	bind:this={container}
->
-	{#if loading && options.length === 0 && !isMounted}
-		<div class="flex items-center gap-2 text-xs text-muted-foreground">
-			<Loader2 class="h-4 w-4 animate-spin" />
-			Loading models…
-		</div>
-	{:else if options.length === 0}
-		<p class="text-xs text-muted-foreground">No models available.</p>
-	{:else}
-		{@const selectedOption = getDisplayOption()}
-
-		<div class="relative w-full">
-			<button
-				type="button"
-				class={cn(
-					'flex w-full items-center justify-end gap-2 rounded-md px-2 py-1 text-sm text-muted-foreground transition hover:text-foreground focus:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:cursor-not-allowed disabled:opacity-60',
-					isOpen ? 'text-foreground' : ''
-				)}
-				aria-haspopup="listbox"
-				aria-expanded={isOpen}
-				onclick={toggleOpen}
-				bind:this={triggerButton}
-				disabled={loading || updating}
-			>
-				<span class="max-w-[160px] truncate text-right font-medium">
-					{selectedOption?.name || 'Select model'}
-				</span>
-
-				{#if updating}
-					<Loader2 class="h-3.5 w-3.5 animate-spin text-muted-foreground" />
-				{:else}
-					<ChevronDown
-						class={cn(
-							'h-4 w-4 text-muted-foreground transition-transform',
-							isOpen ? 'rotate-180 text-foreground' : ''
-						)}
-					/>
-				{/if}
-			</button>
-
-			{#if isOpen}
-				<div
-					bind:this={menuRef}
-					use:portalToBody
-					class={cn(
-						'fixed z-[1000] overflow-hidden rounded-md border bg-popover shadow-lg transition-opacity',
-						menuPosition ? 'opacity-100' : 'pointer-events-none opacity-0'
-					)}
-					role="listbox"
-					style:top={menuPosition ? `${menuPosition.top}px` : undefined}
-					style:left={menuPosition ? `${menuPosition.left}px` : undefined}
-					style:width={menuPosition ? `${menuPosition.width}px` : undefined}
-					data-placement={menuPosition?.placement ?? 'bottom'}
-				>
-					<div
-						class="overflow-y-auto py-1"
-						style:max-height={menuPosition && menuPosition.maxHeight > 0
-							? `${menuPosition.maxHeight}px`
-							: undefined}
-					>
-						{#each options as option (option.id)}
-							<button
-								type="button"
-								class={cn(
-									'flex w-full flex-col items-start gap-0.5 px-3 py-2 text-left text-sm transition hover:bg-muted focus:bg-muted focus:outline-none',
-									option.id === selectedOption?.id ? 'bg-accent text-accent-foreground' : ''
-								)}
-								role="option"
-								aria-selected={option.id === selectedOption?.id}
-								onclick={() => handleOptionSelect(option.id)}
-							>
-								<span class="block w-full truncate font-medium" title={option.name}>
-									{option.name}
-								</span>
-
-								{#if option.description}
-									<span class="text-xs text-muted-foreground">{option.description}</span>
-								{/if}
-							</button>
-						{/each}
-					</div>
-				</div>
-			{/if}
-		</div>
-	{/if}
-
-	{#if error}
-		<p class="text-xs text-destructive">{error}</p>
-	{/if}
-</div>
--- a/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormPickers/ChatFormPicker/ChatFormPickerItemHeader.svelte
+++ b/examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormPickers/ChatFormPicker/ChatFormPickerItemHeader.svelte
@ -0,0 +1,55 @@
+<script lang="ts">
+	import type { Snippet } from 'svelte';
+	import type { MCPServerSettingsEntry } from '$lib/types';
+	import { mcpStore } from '$lib/stores/mcp.svelte';
+
+	interface Props {
+		server: MCPServerSettingsEntry | undefined;
+		serverLabel: string;
+		title: string;
+		description?: string;
+		titleExtra?: Snippet;
+		subtitle?: Snippet;
+	}
+
+	let { server, serverLabel, title, description, titleExtra, subtitle }: Props = $props();
+
+	let faviconUrl = $derived(server ? mcpStore.getServerFavicon(server.id) : null);
+</script>
+
+<div class="min-w-0 flex-1">
+	<div class="mb-0.5 flex items-center gap-1.5 text-xs text-muted-foreground">
+		{#if faviconUrl}
+			<img
+				src={faviconUrl}
+				alt=""
+				class="h-3 w-3 shrink-0 rounded-sm"
+				onerror={(e) => {
+					(e.currentTarget as HTMLImageElement).style.display = 'none';
+				}}
+			/>
+		{/if}
+
+		<span>{serverLabel}</span>
+	</div>
+
+	<div class="flex items-center gap-2">
+		<span class="font-medium">
+			{title}
+		</span>
+
+		{#if titleExtra}
+			{@render titleExtra()}
+		{/if}
+	</div>
+
+	{#if description}
+		<p class="mt-0.5 truncate text-sm text-muted-foreground">
+			{description}
+		</p>
+	{/if}
+
+	{#if subtitle}
+		{@render subtitle()}
+	{/if}
+</div>
--- a/Show More
+++ b/Show More