Aquileo | sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP · Error when deploying this model with sglang

Error when deploying this model with sglang

#16
by MyJerry1996 - opened

When deploying this model with sglang, an error occurs:
"
DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
nvidia-smi not found. Ensure NVIDIA drivers are installed and accessible. Falling back to torch.cuda.mem_get_info(). Reported total GPU memory per device (MiB): [60386], using min: 60386 MiB.
Disabling overlap schedule since mamba no_buffer is not compatible with overlap schedule, try to use --disable-radix-cache if overlap schedule is necessary
[2026-05-28 17:57:06 TP0] DeepGemm is enabled but the scale_fmt of checkpoint is not ue8m0. This might cause accuracy degradation on Blackwell.
[2026-05-28 17:57:06 TP0] Init torch distributed begin.
[2026-05-28 17:57:06 TP0] Init torch distributed ends. elapsed=0.04 s, mem usage=0.01 GB
[2026-05-28 17:57:07 TP0] Load weight begin. avail mem=49.19 GB
[2026-05-28 17:57:07 TP0] Using ModelOptModelLoader due to ModelOpt quantization config.
[2026-05-28 17:57:07 TP0] ModelOptModelLoader: Loading base model...
[2026-05-28 17:57:07 TP0] Model is already quantized, loading directly...
[2026-05-28 17:57:07 TP0] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
[2026-05-28 17:57:07 TP0] Multimodal attention backend not set. Use triton_attn.
[2026-05-28 17:57:07 TP0] Using triton_attn as multimodal attention backend.
[transformers] torch_dtype is deprecated! Use dtype instead!
[2026-05-28 17:57:07 TP0] using attn output gate!
Multi-thread loading shards: 0% Completed | 0/1 [00:00<?, ?it/s][2026-05-28 17:57:36 TP0] Parameter model.layers.0.linear_attn.in_proj_ba.input_scale not found in params_dict
[rank0]: Traceback (most recent call last):
[rank0]: File "", line 198, in _run_module_as_main
[rank0]: File "", line 88, in _run_code
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 982, in
[rank0]: main(server_args, bench_args)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 944, in main
[rank0]: work_func(server_args, port_args, bench_args, 0, 0)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 834, in latency_test
[rank0]: model_runner, tokenizer = load_model(server_args, port_args, gpu_id, tp_rank)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/bench_one_batch.py", line 316, in load_model
[rank0]: model_runner = ModelRunner(**runner_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 508, in init
[rank0]: self.initialize(pre_model_load_memory)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 625, in initialize
[rank0]: self.load_model()
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1380, in load_model
[rank0]: self.model = self.loader.load_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 2933, in load_model
[rank0]: return super().load_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 704, in load_model
[rank0]: self.load_weights_and_postprocess(
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 713, in load_weights_and_postprocess
[rank0]: model.load_weights(weights)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_5.py", line 1560, in load_weights
[rank0]: weight_loader(param, loaded_weight, shard_id)
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/models/qwen3_5.py", line 354, in weight_loader
[rank0]: return original_weight_loader(param, loaded_weight, loaded_shard_id)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/.sglang/lib/python3.12/site-packages/sglang/srt/layers/linear.py", line 719, in weight_loader
[rank0]: assert param_data.shape == loaded_weight.shape
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError
Multi-thread loading shards: 0% Completed | 0/1 [00:27<?, ?it/s]
[rank0]:[W528 17:57:37.447275596 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
"

The command is as below:
"
python3 -m sglang.bench_one_batch
--model-path /mnt/nas/models/vLLM/Qwen3.6-27B-Text-NVFP4-MTP-sakamakismile/
--trust-remote-code
--tp 1
"

ChatGPT said its because of the mismatch between the architecture of checkpoints and the definition of SGLang/Qwen3.5.

But when we deploying mmangkad/Qwen3.6-27B-NVFP4, which is supposed to be the same structure with sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP in text part, it worked.

So can anyone give some advice to this? Thx a lot!

When we print " param_data.shape " and " loaded_weight.shape", it shows:
torch.Size([48, 5120])
torch.Size([48, 2560])
Maybe its caused by the compatible issue of NVFP4 in SGLang?

Sign up or log in to comment