Aquileo | froggeric/Qwen-Fixed-Chat-Templates · preserve_thinking: false or true

preserve_thinking: false or true

#37
by vikuloff - opened

Hi,

I’m testing the v20 chat template with vLLM for a Qwen Code / agentic coding setup, and I’d like to clarify the recommended default-chat-template-kwargs configuration.

My current vLLM launch uses:

--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'

With v20, I noticed the new logic around thinking preservation, deep agent loops, auto_disable_thinking_with_tools, and payload truncation options such as max_tool_arg_chars and max_tool_response_chars.

For an agentic coding workflow with Qwen Code in vLLM, what would you recommend as the optimal combination of default-chat-template-kwargs?

In particular:

  1. Should preserve_thinking be set to true or false when using v20 with vLLM?
  2. For tool-heavy agent loops, is it recommended to enable auto_disable_thinking_with_tools?
  3. What values would you suggest for max_tool_arg_chars and max_tool_response_chars?
  4. Would this be a reasonable default for vLLM agents?
{
  "enable_thinking": true,
  "preserve_thinking": false,
  "auto_disable_thinking_with_tools": true,
  "max_tool_response_chars": 12000,
  "max_tool_arg_chars": 6000
}

Or would you recommend a different configuration for best tool-calling reliability and coding-agent performance?

Thanks!

One more data point: on my current configuration

{
"enable_thinking": true,
"preserve_thinking": true
}

after switching to v20 today, I almost immediately hit a hang / stalled execution in the OpenClaw agent.

For context, this is how I launch vLLM:

docker run -d \
  --restart always \
  --name vllm-qwen \
  --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/vllm-templates:/templates \
  --env HF_TOKEN="$HF_TOKEN" \
  --env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  --env VLLM_API_KEY="key" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  Qwen/Qwen3.6-27B-FP8 \
  --port 8000 \
  --max-model-len 222400 \
  --chat-template /templates/chat_template.jinja \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --compilation-config '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}' \
  --gpu-memory-utilization 0.975 \
  --enable-prefix-caching \
  --performance-mode interactivity \
  --attention-backend flashinfer \
  --kv-cache-dtype bfloat16 \
  --default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \
  --max-num-seqs 1 \
  --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}'

Also it's worth pointing out that the README states that preserve_thinking is on by default, but the jinja file makes it default to false

Sign up or log in to comment