Instructions to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP") model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
- SGLang
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Docker Model Runner:
docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Error when deploying this model with sglang
When deploying this model with sglang, an error occurs:
"
DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
nvidia-smi not found. Ensure NVIDIA drivers are installed and accessible. Falling back to torch.cuda.mem_get_info(). Reported total GPU memory per device (MiB): [60386], using min: 60386 MiB.
Disabling overlap schedule since mamba no_buffer is not compatible with overlap schedule, try to use --disable-radix-cache if overlap schedule is necessary
[2026-05-28 17:57:06 TP0] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-05-28 17:57:06 TP0] Init torch distributed begin.
[2026-05-28 17:57:06 TP0] Init torch distributed ends. elapsed=0.04 s, mem usage=0.01 GB
[2026-05-28 17:57:07 TP0] Load weight begin. avail mem=49.19 GB
[2026-05-28 17:57:07 TP0] Using ModelOptModelLoader due to ModelOpt quantization config.
[2026-05-28 17:57:07 TP0] ModelOptModelLoader: Loading base model...
[2026-05-28 17:57:07 TP0] Model is already quantized, loading directly...
[2026-05-28 17:57:07 TP0] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
[2026-05-28 17:57:07 TP0] Multimodal attention backend not set. Use triton_attn.
[2026-05-28 17:57:07 TP0] Using triton_attn as multimodal attention backend.
[transformers] torch_dtype is deprecated! Use dtype instead!
[2026-05-28 17:57:07 TP0] using attn output gate!
Multi-thread loading shards: 0% Completed | 0/1 [00:00<?, ?it/s][2026-05-28 17:57:36 TP0] Parameter model.layers.0.linear_attn.in_proj_ba.input_scale not found in params_dict
[rank0]: Traceback (most recent call last):
[rank0]: File "", line 198, in _run_module_as_main
[rank0]: File "", line 88, in _run_code
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 982, in
[rank0]: main(server_args, bench_args)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 944, in main
[rank0]: work_func(server_args, port_args, bench_args, 0, 0)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 834, in latency_test
[rank0]: model_runner, tokenizer = load_model(server_args, port_args, gpu_id, tp_rank)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 316, in load_model
[rank0]: model_runner = ModelRunner(**runner_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 508, in init
[rank0]: self.initialize(pre_model_load_memory)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 625, in initialize
[rank0]: self.load_model()
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1380, in load_model
[rank0]: self.model = self.loader.load_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 2933, in load_model
[rank0]: return super().load_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 704, in load_model
[rank0]: self.load_weights_and_postprocess(
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 713, in load_weights_and_postprocess
[rank0]: model.load_weights(weights)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_5.py", line 1560, in load_weights
[rank0]: weight_loader(param, loaded_weight, shard_id)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_5.py", line 354, in weight_loader
[rank0]: return original_weight_loader(param, loaded_weight, loaded_shard_id)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/layers/linear.py", line 719, in weight_loader
[rank0]: assert param_data.shape == loaded_weight.shape
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError
Multi-thread loading shards: 0% Completed | 0/1 [00:27<?, ?it/s]
[rank0]:[W528 17:57:37.447275596 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
"
The command is as below:
"
python3 -m sglang.bench_one_batch
--model-path /mnt/nas/models/vLLM/Qwen3.6-27B-Text-NVFP4-MTP-sakamakismile/
--trust-remote-code
--tp 1
"
ChatGPT said its because of the mismatch between the architecture of checkpoints and the definition of SGLang/Qwen3.5.
But when we deploying mmangkad/Qwen3.6-27B-NVFP4, which is supposed to be the same structure with sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP in text part, it worked.
So can anyone give some advice to this? Thx a lot!
When we print " param_data.shape " and " loaded_weight.shape", it shows:
torch.Size([48, 5120])
torch.Size([48, 2560])
Maybe its caused by the compatible issue of NVFP4 in SGLang?