mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-06-27 23:50:20 -05:00
* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests
- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes #21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.
This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.
Closes #21919.
* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks
* cont : remove trailing whitespace
---------
Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
121 lines
2.6 KiB
Plaintext
121 lines
2.6 KiB
Plaintext
ied 4 ½ months
|
||
__ggml_vocab_test__
|
||
Äpfel
|
||
__ggml_vocab_test__
|
||
|
||
__ggml_vocab_test__
|
||
|
||
__ggml_vocab_test__
|
||
|
||
__ggml_vocab_test__
|
||
|
||
__ggml_vocab_test__
|
||
|
||
__ggml_vocab_test__
|
||
|
||
|
||
__ggml_vocab_test__
|
||
|
||
|
||
|
||
__ggml_vocab_test__
|
||
|
||
|
||
|
||
|
||
__ggml_vocab_test__
|
||
|
||
|
||
__ggml_vocab_test__
|
||
Hello world
|
||
__ggml_vocab_test__
|
||
Hello world
|
||
__ggml_vocab_test__
|
||
Hello World
|
||
__ggml_vocab_test__
|
||
Hello World
|
||
__ggml_vocab_test__
|
||
Hello World!
|
||
__ggml_vocab_test__
|
||
Hello, world!
|
||
__ggml_vocab_test__
|
||
Hello, world!
|
||
__ggml_vocab_test__
|
||
this is 🦙.cpp
|
||
__ggml_vocab_test__
|
||
w048 7tuijk dsdfhu
|
||
__ggml_vocab_test__
|
||
нещо на Български
|
||
__ggml_vocab_test__
|
||
កាន់តែពិសេសអាចខលចេញ
|
||
__ggml_vocab_test__
|
||
🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)
|
||
__ggml_vocab_test__
|
||
Hello
|
||
__ggml_vocab_test__
|
||
Hello
|
||
__ggml_vocab_test__
|
||
Hello
|
||
__ggml_vocab_test__
|
||
Hello
|
||
__ggml_vocab_test__
|
||
Hello
|
||
__ggml_vocab_test__
|
||
Hello
|
||
Hello
|
||
__ggml_vocab_test__
|
||
(
|
||
__ggml_vocab_test__
|
||
|
||
=
|
||
__ggml_vocab_test__
|
||
' era
|
||
__ggml_vocab_test__
|
||
Hello, y'all! How are you 😁 ?我想在apple工作1314151天~
|
||
__ggml_vocab_test__
|
||
!!!!!!
|
||
__ggml_vocab_test__
|
||
3
|
||
__ggml_vocab_test__
|
||
33
|
||
__ggml_vocab_test__
|
||
333
|
||
__ggml_vocab_test__
|
||
3333
|
||
__ggml_vocab_test__
|
||
33333
|
||
__ggml_vocab_test__
|
||
333333
|
||
__ggml_vocab_test__
|
||
3333333
|
||
__ggml_vocab_test__
|
||
33333333
|
||
__ggml_vocab_test__
|
||
333333333
|
||
__ggml_vocab_test__
|
||
Cửa Việt
|
||
__ggml_vocab_test__
|
||
discards
|
||
__ggml_vocab_test__
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български ''''''```````""""......!!!!!!?????? I've been 'told he's there, 'RE you sure? 'M not sure I'll make it, 'D you like some tea? We'Ve a'lL
|
||
__ggml_vocab_test__
|
||
é
|
||
__ggml_vocab_test__
|
||
résumé
|
||
__ggml_vocab_test__
|
||
àààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààà
|
||
__ggml_vocab_test__
|
||
Vieết Nam
|
||
__ggml_vocab_test__
|