Instructions to use froggeric/Qwen-Fixed-Chat-Templates with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use froggeric/Qwen-Fixed-Chat-Templates with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Qwen-Fixed-Chat-Templates froggeric/Qwen-Fixed-Chat-Templates
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
preserve_thinking: false or true
Hi,
I’m testing the v20 chat template with vLLM for a Qwen Code / agentic coding setup, and I’d like to clarify the recommended default-chat-template-kwargs configuration.
My current vLLM launch uses:
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
With v20, I noticed the new logic around thinking preservation, deep agent loops, auto_disable_thinking_with_tools, and payload truncation options such as max_tool_arg_chars and max_tool_response_chars.
For an agentic coding workflow with Qwen Code in vLLM, what would you recommend as the optimal combination of default-chat-template-kwargs?
In particular:
- Should
preserve_thinkingbe set totrueorfalsewhen using v20 with vLLM? - For tool-heavy agent loops, is it recommended to enable
auto_disable_thinking_with_tools? - What values would you suggest for
max_tool_arg_charsandmax_tool_response_chars? - Would this be a reasonable default for vLLM agents?
{
"enable_thinking": true,
"preserve_thinking": false,
"auto_disable_thinking_with_tools": true,
"max_tool_response_chars": 12000,
"max_tool_arg_chars": 6000
}
Or would you recommend a different configuration for best tool-calling reliability and coding-agent performance?
Thanks!
One more data point: on my current configuration
{
"enable_thinking": true,
"preserve_thinking": true
}
after switching to v20 today, I almost immediately hit a hang / stalled execution in the OpenClaw agent.
For context, this is how I launch vLLM:
docker run -d \
--restart always \
--name vllm-qwen \
--gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ~/vllm-templates:/templates \
--env HF_TOKEN="$HF_TOKEN" \
--env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
--env VLLM_API_KEY="key" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
Qwen/Qwen3.6-27B-FP8 \
--port 8000 \
--max-model-len 222400 \
--chat-template /templates/chat_template.jinja \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
--compilation-config '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}' \
--gpu-memory-utilization 0.975 \
--enable-prefix-caching \
--performance-mode interactivity \
--attention-backend flashinfer \
--kv-cache-dtype bfloat16 \
--default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \
--max-num-seqs 1 \
--override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}'
Also it's worth pointing out that the README states that preserve_thinking is on by default, but the jinja file makes it default to false