toolcall inside thinking

#25

by snapo - opened May 18

May 18

i get many time a tool call inside the thinking tag... even i use your profile

services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:full-cuda13-b9209
container_name: llama-server
restart: unless-stopped
ports:
- "16384:8080"
volumes:
- ./models:/models:ro
command: >
--server
--model /models/Qwen3.6-27B-Q4_K_M-uc-mtp-v2.gguf
--alias "Qwen3.6 27B"
--temp 0.6
--top-p 0.95
--min-p 0.00
--top-k 20
--port 8080
--host 0.0.0.0
--fit off
--ctx-size 200000
--presence-penalty 0.0
--repeat-penalty 1.0
--jinja
--chat-template-file /models/Qwen3.6-11.jinja
--mmproj /models/Qwen3.6-27B-Q4_K_M-MTP-mmproj-f16-uc-v2.gguf
--webui
--spec-draft-p-min 0.75
--spec-type draft-mtp
--spec-draft-n-max 3
--chat-template-kwargs '{"preserve_thinking": true}'
--reasoning-budget 8192
--reasoning-budget-message "... thinking budget exceeded, let's answer now.\n"
--split-mode tensor
user: "1000:1000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=all

i tested it with your jinja 11 template for qwen 3.6 up to the template 18 ... and i still face this issue...
is this a problem of opencode or is this a problem with the template?

herstrabol

May 19

@snapo
could you test my v1.0?
https://huggingface.co/StableQuant/Qwen-Templates-Rebuild-Project

snapo

May 19

@herstrabol

because your template completely kills the thinking process.... i want the thinking process to occur....

your template works exactly the same as if i would turn off thinking...
i need thinking for the 8k given thinking budget, because this is what makes the model so extremely good. but i dont want to let it think for 65k tokens thats why i am limiting it.

additional your template causes which are not recognized as thinking/reasoning text tags as seen in the screenshot , no clue if its think or thinking as the tag... maybe model specific....

snapo changed discussion status to closed May 19

snapo changed discussion status to open May 19

herstrabol

May 22

@snapo , it was fixed in v1.1
Now is v1.1.2 out which also fixes error case when some apps call the wrong tool
Please test it out and let me know

snapo

27 days ago

Just want to say froggeric v19 templates for qwen 3.6 solved it... i had not one single problem anymore...

sanjxz

22 days ago

•

edited 22 days ago

Apparently latest claude code\ikllama.cpp broke this, and normal qwen template as well:

It straight up outputs tool calling in main text.
llama-server -m "G:\xlam2\Qwen3.6-35B-A3B-APEX-I-Balanced.gguf" -ngl 999 --chat-template-kwargs "{"preserve_thinking":true}" --parallel-tool-calls --prompt-cache prompt.cache --no-mmap -c 230000 --fit --fit-margin 2048 --grouped-expert-routing --cache-type-k q8_0 --cache-type-v q8_0 --k-cache-hadamard --v-cache-hadamard -np 1 -fa on --mlock -t 8 -tb 16 --merge-qkv -b 1280 -ub 1280 -muge --jinja --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --reasoning-budget -1 --repeat-penalty 1.0 --presence-penalty 0.0 --alias 'qwen-3-apex' --cache-ram 8192 --ctx-checkpoints 64 --ctx-checkpoints-interval 4096 -sm layer --host 127.0.0.1 --port 8080 --chat-template-file G:\xlam2\chat_template.jinja

UPDATE: Definitely CC update issue. Pi Code works as expected

froggeric

Owner 20 days ago

Qwen models inherently tend to bleed tool calls into thinking blocks. In the v20 release, I updated the system prompt instructions to be much more strict about closing the block before emitting . Give the new template a try and let me know if it helps.

sanjxz

20 days ago

Qwen models inherently tend to bleed tool calls into thinking blocks. In the v20 release, I updated the system prompt instructions to be much more strict about closing the block before emitting . Give the new template a try and let me know if it helps.

Thank you! that update fixed my issue. CC works now.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment