Aquileo | deepseek-ai/DeepSeek-V4-Flash · Add chat template

Add chat template

#16
by Rocketknight1 HF Staff - opened
No description provided.
Rocketknight1 changed pull request status to open

This chat template looks good to me in pre-testing, but we might want to wait until the model is fully merged into Transformers for final testing + merging!

is this expected?

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", revision="refs/pr/16")

tool_calls = [{"type": "function", "function": {"name": "dummy", "arguments": "{}"}}]
messages1 = [
    {"role": "user", "content": "dummy"},
    {"role": "assistant", "content": "", "tool_calls": tool_calls},
]
messages2 = messages1 + [{"role": "tool", "name": "dummy", "content": "dummy"}]

s1 = tok.apply_chat_template(messages1, tokenize=False)
s2 = tok.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True)

print("s1:", repr(s1))
print("s2:", repr(s2))
print("s2 starts with s1?", s2.startswith(s1))
s1: '<|begin▁of▁sentence|><|User|>dummy<|Assistant|><think></think>\n\n<|DSML|tool_calls>\n<|DSML|invoke name="dummy">\n<|DSML|parameter name="arguments" string="true">{}</|DSML|parameter>\n</|DSML|invoke>\n</|DSML|tool_calls><|end▁of▁sentence|>'
s2: '<|begin▁of▁sentence|><|User|>dummy<|Assistant|></think>\n\n<|DSML|tool_calls>\n<|DSML|invoke name="dummy">\n<|DSML|parameter name="arguments" string="true">{}</|DSML|parameter>\n</|DSML|invoke>\n</|DSML|tool_calls><|end▁of▁sentence|><|User|><tool_result>dummy</tool_result><|Assistant|><think>'
s2 starts with s1? False

@qgallouedec I think this is intended, or at least it matched the example tests; I mostly did "test-driven" development here. I realize <|Assistant|></think> instead of something like <|Assistant|><think>\n</think>looks like a bug, but this pattern also appears in the expected test outputs when drop_thinking=True (which is the default): https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/encoding/tests/test_output_2.txt

Hi! Sharing some downstream evidence that landing this template (or any tool-supporting variant) is load-bearing for the agent ecosystem.

We just shipped day-0 DeepSeek-V4-Flash support in rapid-mlx (Apple Silicon MLX backend) by vendoring Blaizzy's mlx-lm PR #1192 and tested both the 2-bit DQ and 8-bit mlx-community quants on a Mac Studio M3 Ultra:

Suite 2-bit DQ 8-bit
Plain chat ✅ works ✅ works
Decode (tok/s) 56 31
Stress (8 scenarios) 7/8 PASS 7/8 PASS
Tool calling (30-scenario eval) 0/30 0/30
Hermes/OpenClaude agent integration failing on tool tests failing on tool tests

The reason for the 0/30 is exactly what's being discussed here: the chat_template.jinja that mlx-community's quants ship today only handles system/user/assistant — no tool role rendering, no tools array iteration, no <tool_call> markers. So tools passed via the OpenAI-compatible API are silently dropped before the model ever sees them.

So whichever variant of this PR lands first (the current draft or the trl-internal-testing alternative @qgallouedec referenced) is genuinely unblocking — the model itself is clearly capable, it just isn't being told tools exist. Happy to re-run our evals and report numbers once a final template is merged.

A data point from the MLX side, strongly in support of landing this.

I tested how this template behaves on the heavily-quantized MLX build (mlx-community/DeepSeek-V4-Flash-2bit-DQ, ~96 GB, running under mlx_lm on a 128 GB Mac Studio).

Without a tool template — the current state of that conversion — mlx_lm.server logs "model does not support tool calling" and silently drops the tools array, so tool calling is fully broken out of the box. This matches the "load-bearing for the agent ecosystem" observation in this thread.

With this DSML template (ported into the conversion) + a DSML tool parser, I benchmarked the 2-bit model on two tool-calling suites (greedy, thinking off):

Suite no template Hermes <tool_call> workaround native DSML (this template)
jdhodges (40) 0 (tools dropped) 33/40 (82 %) 39/40 (98 %)
Veerman (12) 0 6/12 (50 %) 9/12 (75 %)
parallel multi-tool cases 3/8 8/8

Two takeaways:

  1. Even the 2-bit build is an excellent tool-caller in its native format — 98 % on jdhodges, matching the best full-size local model I have on this rig (a 35B-A3B at 98 %). For a 2-bit checkpoint that's a strong argument that the template is what's gating tool use, not the model.

  2. Native DSML materially beats a generic Hermes <tool_call> workaround (82 % → 98 %), and the gap is almost entirely parallel calls. DSML's single <|DSML|tool_calls> block containing multiple <|DSML|invoke> is what lets the model emit parallel calls reliably; when I coaxed it into emitting separate Hermes <tool_call> blocks instead, it produced only the first of N every time (3/8 on the parallel cases). I initially mistook that for a quantization ceiling — it wasn't; it was a format mismatch. So this template doesn't just make tool calls "more correct," it unlocks parallel calling.

For MLX specifically: mlx-lm ships no parser for DSML, so the template alone isn't sufficient there — I wrote one and submitted it upstream: ml-explore/mlx-lm#1337. One gotcha for anyone wiring DSML into mlx-lm: it matches tool-call markers by exact token-id sequence, and the marker's trailing > merges with the next byte on this tokenizer (...tool_calls>\n tokenizes as one >\n token), so a full-> marker silently never matches. I anchor on the <|DSML|tool_calls prefix instead — the |DSML| core is a special token, so it's a stable anchor. (Related generic fix for the same issue in the json_tools parser: #1335 / #1336.)

Net: strong +1 to landing this template — it's the enabler, and it's worth propagating into the mlx-community conversions too.

Methodology + per-case data: https://github.com/snagnever/macstudio-local-llm/blob/deepseek-v4-tool-dsml/docs/benchmark-plans/2026-05-30-deepseek-v4-flash-tool-template.md

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment