Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash") model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Flash") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepseek-ai/DeepSeek-V4-Flash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
- SGLang
How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
Add chat template
This chat template looks good to me in pre-testing, but we might want to wait until the model is fully merged into Transformers for final testing + merging!
is this expected?
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", revision="refs/pr/16")
tool_calls = [{"type": "function", "function": {"name": "dummy", "arguments": "{}"}}]
messages1 = [
{"role": "user", "content": "dummy"},
{"role": "assistant", "content": "", "tool_calls": tool_calls},
]
messages2 = messages1 + [{"role": "tool", "name": "dummy", "content": "dummy"}]
s1 = tok.apply_chat_template(messages1, tokenize=False)
s2 = tok.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True)
print("s1:", repr(s1))
print("s2:", repr(s2))
print("s2 starts with s1?", s2.startswith(s1))
s1: '<|begin▁of▁sentence|><|User|>dummy<|Assistant|><think></think>\n\n<|DSML|tool_calls>\n<|DSML|invoke name="dummy">\n<|DSML|parameter name="arguments" string="true">{}</|DSML|parameter>\n</|DSML|invoke>\n</|DSML|tool_calls><|end▁of▁sentence|>'
s2: '<|begin▁of▁sentence|><|User|>dummy<|Assistant|></think>\n\n<|DSML|tool_calls>\n<|DSML|invoke name="dummy">\n<|DSML|parameter name="arguments" string="true">{}</|DSML|parameter>\n</|DSML|invoke>\n</|DSML|tool_calls><|end▁of▁sentence|><|User|><tool_result>dummy</tool_result><|Assistant|><think>'
s2 starts with s1? False
this seems to work better: https://huggingface.co/trl-internal-testing/tiny-DeepseekV4ForCausalLM/discussions/6/files
@qgallouedec I think this is intended, or at least it matched the example tests; I mostly did "test-driven" development here. I realize <|Assistant|></think> instead of something like <|Assistant|><think>\n</think>looks like a bug, but this pattern also appears in the expected test outputs when drop_thinking=True (which is the default): https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/encoding/tests/test_output_2.txt
Hi! Sharing some downstream evidence that landing this template (or any tool-supporting variant) is load-bearing for the agent ecosystem.
We just shipped day-0 DeepSeek-V4-Flash support in rapid-mlx (Apple Silicon MLX backend) by vendoring Blaizzy's mlx-lm PR #1192 and tested both the 2-bit DQ and 8-bit mlx-community quants on a Mac Studio M3 Ultra:
| Suite | 2-bit DQ | 8-bit |
|---|---|---|
| Plain chat | ✅ works | ✅ works |
| Decode (tok/s) | 56 | 31 |
| Stress (8 scenarios) | 7/8 PASS | 7/8 PASS |
| Tool calling (30-scenario eval) | 0/30 | 0/30 |
| Hermes/OpenClaude agent integration | failing on tool tests | failing on tool tests |
The reason for the 0/30 is exactly what's being discussed here: the chat_template.jinja that mlx-community's quants ship today only handles system/user/assistant — no tool role rendering, no tools array iteration, no <tool_call> markers. So tools passed via the OpenAI-compatible API are silently dropped before the model ever sees them.
So whichever variant of this PR lands first (the current draft or the trl-internal-testing alternative @qgallouedec referenced) is genuinely unblocking — the model itself is clearly capable, it just isn't being told tools exist. Happy to re-run our evals and report numbers once a final template is merged.
A data point from the MLX side, strongly in support of landing this.
I tested how this template behaves on the heavily-quantized MLX build (mlx-community/DeepSeek-V4-Flash-2bit-DQ, ~96 GB, running under mlx_lm on a 128 GB Mac Studio).
Without a tool template — the current state of that conversion — mlx_lm.server logs "model does not support tool calling" and silently drops the tools array, so tool calling is fully broken out of the box. This matches the "load-bearing for the agent ecosystem" observation in this thread.
With this DSML template (ported into the conversion) + a DSML tool parser, I benchmarked the 2-bit model on two tool-calling suites (greedy, thinking off):
| Suite | no template | Hermes <tool_call> workaround |
native DSML (this template) |
|---|---|---|---|
| jdhodges (40) | 0 (tools dropped) | 33/40 (82 %) | 39/40 (98 %) |
| Veerman (12) | 0 | 6/12 (50 %) | 9/12 (75 %) |
| parallel multi-tool cases | — | 3/8 | 8/8 |
Two takeaways:
Even the 2-bit build is an excellent tool-caller in its native format — 98 % on jdhodges, matching the best full-size local model I have on this rig (a 35B-A3B at 98 %). For a 2-bit checkpoint that's a strong argument that the template is what's gating tool use, not the model.
Native DSML materially beats a generic Hermes
<tool_call>workaround (82 % → 98 %), and the gap is almost entirely parallel calls. DSML's single<|DSML|tool_calls>block containing multiple<|DSML|invoke>is what lets the model emit parallel calls reliably; when I coaxed it into emitting separate Hermes<tool_call>blocks instead, it produced only the first of N every time (3/8 on the parallel cases). I initially mistook that for a quantization ceiling — it wasn't; it was a format mismatch. So this template doesn't just make tool calls "more correct," it unlocks parallel calling.
For MLX specifically: mlx-lm ships no parser for DSML, so the template alone isn't sufficient there — I wrote one and submitted it upstream: ml-explore/mlx-lm#1337. One gotcha for anyone wiring DSML into mlx-lm: it matches tool-call markers by exact token-id sequence, and the marker's trailing > merges with the next byte on this tokenizer (...tool_calls>\n tokenizes as one >\n token), so a full-> marker silently never matches. I anchor on the <|DSML|tool_calls prefix instead — the |DSML| core is a special token, so it's a stable anchor. (Related generic fix for the same issue in the json_tools parser: #1335 / #1336.)
Net: strong +1 to landing this template — it's the enabler, and it's worth propagating into the mlx-community conversions too.
Methodology + per-case data: https://github.com/snagnever/macstudio-local-llm/blob/deepseek-v4-tool-dsml/docs/benchmark-plans/2026-05-30-deepseek-v4-flash-tool-template.md