preserve_thinking: false or true

#37

by vikuloff - opened 16 days ago

Hi,

I’m testing the v20 chat template with vLLM for a Qwen Code / agentic coding setup, and I’d like to clarify the recommended default-chat-template-kwargs configuration.

My current vLLM launch uses:

--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'

With v20, I noticed the new logic around thinking preservation, deep agent loops, auto_disable_thinking_with_tools, and payload truncation options such as max_tool_arg_chars and max_tool_response_chars.

For an agentic coding workflow with Qwen Code in vLLM, what would you recommend as the optimal combination of default-chat-template-kwargs?

In particular:

Should preserve_thinking be set to true or false when using v20 with vLLM?
For tool-heavy agent loops, is it recommended to enable auto_disable_thinking_with_tools?
What values would you suggest for max_tool_arg_chars and max_tool_response_chars?
Would this be a reasonable default for vLLM agents?

{
  "enable_thinking": true,
  "preserve_thinking": false,
  "auto_disable_thinking_with_tools": true,
  "max_tool_response_chars": 12000,
  "max_tool_arg_chars": 6000
}

Or would you recommend a different configuration for best tool-calling reliability and coding-agent performance?

Thanks!

vikuloff

15 days ago

•

edited 15 days ago

One more data point: on my current configuration

{
"enable_thinking": true,
"preserve_thinking": true
}

after switching to v20 today, I almost immediately hit a hang / stalled execution in the OpenClaw agent.

For context, this is how I launch vLLM:

docker run -d \
  --restart always \
  --name vllm-qwen \
  --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/vllm-templates:/templates \
  --env HF_TOKEN="$HF_TOKEN" \
  --env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  --env VLLM_API_KEY="key" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  Qwen/Qwen3.6-27B-FP8 \
  --port 8000 \
  --max-model-len 222400 \
  --chat-template /templates/chat_template.jinja \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --compilation-config '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}' \
  --gpu-memory-utilization 0.975 \
  --enable-prefix-caching \
  --performance-mode interactivity \
  --attention-backend flashinfer \
  --kv-cache-dtype bfloat16 \
  --default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \
  --max-num-seqs 1 \
  --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}'

crusaderky

12 days ago

Also it's worth pointing out that the README states that preserve_thinking is on by default, but the jinja file makes it default to false

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment