Reasoning off mode issue

#30

by GergelyZsolt - opened 28 days ago

•

I tried multiple qwen models with llama.cpp in --reasoning off mode, and it happens very often that an orphaned </think> tag appears. It did not do this with other chat templates.

kaotd

28 days ago

here too:

config.ini:

[*]
n-gpu-layers = all
ctx-size = 65536
threads = 18
batch-size  = 2048
ubatch-size = 1024

parallel = 2
mlock = true
mmap = true
; no-mmap = true
flash-attn = true

cache-type-k = q8_0
cache-type-v = q8_0
cache-type-k-draft = q8_0
cache-type-v-draft = q8_0

reasoning = false
prio = 3
seed = 3407
jinja = true

[Qwen3.6-35B-A3B:UD-Q4_K_XL]
model = /models/Qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
mmproj = /models/Qwen3.6/mmproj-F16.gguf
temperature = 0.7
top-p = 0.8
top-k = 20
min-p = 0.0
presence-penalty = 1.5
repeat-penalty = 1.0
image-min-tokens = 1024
spec-type = draft-mtp
spec-draft-n-max = 2
chat-template-file = /templates/froggeric-chat_template-v19.jinja

open-webui v0.9.5

prompt:

**Build a VRAM and KV cache calculator tool for llama.cpp server.** The tool should include the following parameters: model type (e.g., Qwen2.5-72B), bit precision (4-bit/8-bit), total `--ctx-size`, number of `--parallel` slots, and batch size. The output should display the estimated VRAM usage, KV cache allocation per slot, and warnings if this configuration exceeds physical GPU limits to avoid 400 errors or stuttering/lag when running multiple concurrent threads.

response:

I'll build a comprehensive VRAM and KV Cache Calculator for llama.cpp server. Let me first research the current understanding of these calculations to ensure accuracy.

<function=web_search>
<parameter=search_queries>
["llama.cpp VRAM calculation KV cache formula 2024", "llama.cpp --ctx-size --parallel VRAM usage calculator", "KV cache memory calculation transformer models bits per token"]
</parameter>
</function>
</tool_call>

tooltd

28 days ago

This comment has been hidden (marked as Off-Topic)

BebopVox

24 days ago

•

edited 24 days ago

Any solutions?

I have this block

BebopVox

24 days ago

Found this one, works nice:
https://huggingface.co/spiritbuun/buun-Qwen3.6-chat_template

kaotd

24 days ago

Found this one, works nice:
https://huggingface.co/spiritbuun/buun-Qwen3.6-chat_template

Thanks, I will give a try.

froggeric

Owner 19 days ago

In the v20 release, I completely overhauled the thinking toggles and state tracking to handle reasoning-off environments better. Please try the latest v20 template and see if that cleans up the orphaned tags. If you're still seeing it, you might need to update your llama.cpp server to the latest build.

BebopVox

11 days ago

👋

In the v20 release, I completely overhauled the thinking toggles and state tracking to handle reasoning-off environments better. Please try the latest v20 template and see if that cleans up the orphaned tags. If you're still seeing it, you might need to update your llama.cpp server to the latest build.

Thank you for this!

However I've been working with this template tonight and running PiehSoft/Qwen3.6-40B-Deckard-MTP model overnight. And now see these error logs:

Now let me update the test file to inline the functions.
 ⤵ 1K  ⤴ 81  cache: 95K
┌─── ✎ Write: 🟦 src/widgets/Speedometer/__tests__/Speedometer.test.ts · 1 line ──────────────────────────────────────────────────────────────────────────┐
│   1 import { describe, it, expect, vi,                                                                                                                  │
│ ✘ Diagnostics (1 error(s))                                                                                                                              │
│  └─ 🟦 src/widgets/Speedometer/__tests__/Speedometer.test.ts                                                                                            │
│    └─ ✘:1:35 '}' expected. (1005)                                                                                                                       │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
 Error: 500 Failed to parse tool call arguments as JSON: [json.exception.parse_error.101] parse error at line 1, colu… 
...
Error: Retry failed after 10 attempts: 500 Failed to parse tool call arguments as JSON: [json.exception.parse_error.101] parse error at line 1, column
 168: syntax error while parsing value - invalid string: missing closing quote; last read: '"import { describe, it, expect, vi,'

Also in llamacpp server logs I got this:

175.27.099.097 W srv    operator(): got exception: {"error":{"code":500,"message":"Failed to parse tool call arguments as JSON: [json.exception.parse_error.101] parse error at line 1, column 168: syntax error while parsing value - invalid string: missing closing quote; last read: '\"import { describe, it, expect, vi,'","type":"server_error"}}

Not sure it's a template problem but this is exactly what was highlighted to me by my assistant model.
Its said that this template has probable fix to add kwarg auto_disable_thinking_with_tools. Please have a look and maybe this template also needs it? :

https://huggingface.co/spiritbuun/buun-Qwen3.6-chat_template

BebopVox

11 days ago

This comment has been hidden (marked as Spam)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment