From 51331f49737240ec2582d4c1cede8bcf104402eb Mon Sep 17 00:00:00 2001 From: Alex Date: Sat, 9 May 2026 01:36:38 -0400 Subject: [PATCH] Fix two speculative-decoding crashes that prevent any usage (#1760) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This patch addresses two latent bugs in examples/speculative/speculative.cpp that prevent llama-speculative.exe from running on greedy sampling (temp=0) or producing rejection-sampling output (temp>0): 1. Line 191: `params.sparams.grammar = { COMMON_GRAMMAR_TYPE_NONE, "" };` invokes `common_grammar(type, grammar)` which asserts `type != NONE || !grammar.empty()`. Both conditions fail with the intended-to-be-empty grammar, so every speculative run hits a hard `GGML_ASSERT` in common/sampling.h:63 immediately after model load. Fix: default-construct via `common_grammar{}` to bypass the field-init constructor. 2. Lines 293-294: `GGML_ASSERT(dist_tgt.sorted)` and `GGML_ASSERT(dist_dft.sorted)` fire whenever the draft sampler does not set the .sorted flag (which is most modern sampler paths). Comment them out — the next ~10 lines re-sort both distributions by id explicitly, so the assertion is incorrect anyway. Fix: replace the asserts with an explanatory comment. After both fixes, `llama-speculative.exe` runs to completion. The acceptance-rate measurement at temp=0 still looks suspicious (0% across same-family draft/target pairs), but that is a different issue out of scope for this PR. Tested on Qwen3-0.6B-IQ4_XS drafting Qwen3-1.7B-IQ4_XS, both base models from `bartowski/Qwen_Qwen3-*-GGUF` on Windows + ik_llama.cpp build at HEAD of windows-mingw-default-win10 (which is itself a follow-up to PR #1755). --- examples/speculative/speculative.cpp | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/examples/speculative/speculative.cpp b/examples/speculative/speculative.cpp index 663e0420..714a989a 100644 --- a/examples/speculative/speculative.cpp +++ b/examples/speculative/speculative.cpp @@ -188,7 +188,7 @@ int main(int argc, char ** argv) { // draft sequence data std::vector drafts(n_seq_dft); - params.sparams.grammar = { COMMON_GRAMMAR_TYPE_NONE, ""}; // the draft samplers will copy the target sampler's grammar + params.sparams.grammar = common_grammar{}; // the draft samplers will copy the target sampler's grammar if (params.sparams.temp == 0) { params.sparams.temp = -1.0f; // force greedy sampling with probs for the draft model } @@ -290,8 +290,8 @@ int main(int argc, char ** argv) { drafts[s].active = false; // calculate residual probability - GGML_ASSERT(dist_tgt.sorted); - GGML_ASSERT(dist_dft.sorted); + // (the .sorted flag is unreliable across modern sampling + // paths; we re-sort below regardless, so it doesn't matter.) float sum_probs = 0.0f; // sort dist by id