Instructions to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

SGLang

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Docker Model Runner:
```
docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
```

Error when deploying this model with sglang

#16

by MyJerry1996 - opened 28 days ago

Discussion

MyJerry1996

28 days ago

When deploying this model with sglang, an error occurs:
"
DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
nvidia-smi not found. Ensure NVIDIA drivers are installed and accessible. Falling back to torch.cuda.mem_get_info(). Reported total GPU memory per device (MiB): [60386], using min: 60386 MiB.
Disabling overlap schedule since mamba no_buffer is not compatible with overlap schedule, try to use --disable-radix-cache if overlap schedule is necessary
[2026-05-28 17:57:06 TP0] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-05-28 17:57:06 TP0] Init torch distributed begin.
[2026-05-28 17:57:06 TP0] Init torch distributed ends. elapsed=0.04 s, mem usage=0.01 GB
[2026-05-28 17:57:07 TP0] Load weight begin. avail mem=49.19 GB
[2026-05-28 17:57:07 TP0] Using ModelOptModelLoader due to ModelOpt quantization config.
[2026-05-28 17:57:07 TP0] ModelOptModelLoader: Loading base model...
[2026-05-28 17:57:07 TP0] Model is already quantized, loading directly...
[2026-05-28 17:57:07 TP0] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
[2026-05-28 17:57:07 TP0] Multimodal attention backend not set. Use triton_attn.
[2026-05-28 17:57:07 TP0] Using triton_attn as multimodal attention backend.
[transformers] torch_dtype is deprecated! Use dtype instead!
[2026-05-28 17:57:07 TP0] using attn output gate!
Multi-thread loading shards: 0% Completed | 0/1 [00:00<?, ?it/s][2026-05-28 17:57:36 TP0] Parameter model.layers.0.linear_attn.in_proj_ba.input_scale not found in params_dict
[rank0]: Traceback (most recent call last):
[rank0]: File "", line 198, in _run_module_as_main
[rank0]: File "", line 88, in _run_code
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 982, in
[rank0]: main(server_args, bench_args)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 944, in main
[rank0]: work_func(server_args, port_args, bench_args, 0, 0)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 834, in latency_test
[rank0]: model_runner, tokenizer = load_model(server_args, port_args, gpu_id, tp_rank)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 316, in load_model
[rank0]: model_runner = ModelRunner(**runner_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 508, in init
[rank0]: self.initialize(pre_model_load_memory)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 625, in initialize
[rank0]: self.load_model()
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1380, in load_model
[rank0]: self.model = self.loader.load_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 2933, in load_model
[rank0]: return super().load_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 704, in load_model
[rank0]: self.load_weights_and_postprocess(
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 713, in load_weights_and_postprocess
[rank0]: model.load_weights(weights)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_5.py", line 1560, in load_weights
[rank0]: weight_loader(param, loaded_weight, shard_id)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_5.py", line 354, in weight_loader
[rank0]: return original_weight_loader(param, loaded_weight, loaded_shard_id)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/layers/linear.py", line 719, in weight_loader
[rank0]: assert param_data.shape == loaded_weight.shape
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError
Multi-thread loading shards: 0% Completed | 0/1 [00:27<?, ?it/s]
[rank0]:[W528 17:57:37.447275596 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
"

The command is as below:
"
python3 -m sglang.bench_one_batch
--model-path /mnt/nas/models/vLLM/Qwen3.6-27B-Text-NVFP4-MTP-sakamakismile/
--trust-remote-code
--tp 1
"

ChatGPT said its because of the mismatch between the architecture of checkpoints and the definition of SGLang/Qwen3.5.

But when we deploying mmangkad/Qwen3.6-27B-NVFP4, which is supposed to be the same structure with sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP in text part, it worked.

So can anyone give some advice to this? Thx a lot!

MyJerry1996

28 days ago

When we print " param_data.shape " and " loaded_weight.shape", it shows:
torch.Size([48, 5120])
torch.Size([48, 2560])
Maybe its caused by the compatible issue of NVFP4 in SGLang?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment