Qwen3.6-27B-NVFP4 is slower than official FP8 on Blackwell; possible fallback / FLA path mismatch on vLLM

by garrussun - opened Apr 23

Apr 23

Hi, thanks for sharing this model.

I tested sakamakismile/Qwen3.6-27B-NVFP4 on Blackwell and wanted to report a behavior that looks abnormal compared to the official FP8 baseline.

Environment：
· GPU:
RTX 5090
RTX PRO 6000 Blackwell
· vLLM tested:
0.19.1
0.19.1rc1.dev328+g18013df6a.cu130 (nightly-based image)
· Container/runtime:
vllm/vllm-openai:latest
custom nightly-based image with pandas added
· Model under test:
sakamakismile/Qwen3.6-27B-NVFP4
· Baseline for comparison:
Official Qwen3.6-27B-FP8

What I observed：
Throughput is clearly worse than official FP8
On my setup, this NVFP4 model is slower than the official FP8 model, both in single-GPU and TP=2 tests.

That is unexpected, since I would at least expect comparable or better serving throughput if the intended NVFP4 path is being used correctly.

On vLLM 0.19.1, I get a suspicious FLA warning;

With the stable vLLM 0.19.1, I see warnings like:
/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113:
UserWarning: Input tensor shape suggests potential format mismatch:
seq_len (20) < num_heads (24/48).
This may indicate the inputs were passed in head-first format [B, H, T, ...]
when head_first=False was specified.
Please verify your input tensor format matches the expected shape [B, T, H, ...].

This appears on both:
TP=2
single GPU

So it does not look like a TP-only communication issue.

On the older nightly build, the warning disappears, but performance is still poor
When I switch back to the older nightly-based vLLM image, the warning is gone, but throughput is still clearly below the official FP8 baseline.

That makes me suspect one of the following:
· some fallback path is being taken silently
· the intended NVFP4 fast path is not actually active
· there is an attention / FLA compatibility issue on Blackwell
· or the model is loading but not executing on the expected optimized path

Questions：

Is this throughput expected for this model on Blackwell?
Is there a known requirement beyond the model card quick-start to ensure the intended NVFP4 path is actually active?
Is the seq_len < num_heads FLA warning on vLLM 0.19.1 something you have also seen?
Should this model currently be used only with a specific nightly commit / patched vLLM path?

If useful, I can also provide:
· exact launch commands
· single-GPU vs TP=2 comparisons
· FP8 baseline numbers
· full logs

Thanks.

livepeer-ren

Apr 23

What is ur throughput? I can see 55+ pet single requeston 5090, 700+ in total with multiple requests. Using services:
vllm:
image: vllm/vllm-openai:cu130-nightly docker image

garrussun

Apr 24

What is ur throughput? I can see 55+ pet single requeston 5090, 700+ in total with multiple requests. Using services:
vllm:
image: vllm/vllm-openai:cu130-nightly docker image

Similar as you, but I have 70+ (rtx pro 6000, decode, single request) when i'm using offical FP8 model, have you try that before? I think nvfp4 should be faster than FP8 if it works normally.

sakamakismile

Owner Apr 25

Hi @garrussun — thanks a lot for the detailed report, this is exactly the
kind of signal we need.

Quick triage from our side:

1. The FLA warning is a false alarm.
seq_len(20) < num_heads is a layout heuristic in
vllm/model_executor/layers/fla/ops/utils.py that misfires on Qwen3.6's
hybrid (linear-attn + full-attn) blocks. It does not mean a wrong path
was taken — output is correct. Recent nightlies silenced it but the
behavior is the same. So this is not the cause of the slowdown.

2. NVFP4 < FP8 on single-request decode is plausible (but fixable).
On Blackwell, NVFP4 only beats FP8 cleanly when:

the GEMM is the bottleneck (large batch / prefill / long context), or
the dequant + scale path is fused properly.
On short single-request decode, the non-quantized parts (norms, SSM/gating,
lm_head, sampling) dominate, and vLLM 0.19.x still has rough edges in the
NVFP4 dispatch for hybrid models — we've seen silent partial fallback to
the FP8/BF16 LinearMethod when the per-tensor scale metadata isn't matched.

3. Reproducing on our side.
We run this stack on 7x RTX PRO 6000 Blackwell daily (same family as yours)
and currently get >2k tok/s aggregate on a 397B NVFP4 sibling, so a 27B
NVFP4 falling behind FP8 on the same hardware is unexpected and worth
chasing. We'll re-bench Qwen3.6-27B-NVFP4 vs official FP8 on RTX PRO 6000,
single-req and concurrency=8/16, and post numbers here within a day or two.

Could you share:

exact vllm serve launch command (TP, max-model-len, --kv-cache-dtype,
--quantization flag if any),
decode tok/s for both models under the same client load,
the first 200 lines of vLLM startup log (we want to see which
LinearMethod / quant_method it picks per layer).

If it's a vLLM dispatch bug we'll file upstream; if it's a metadata bug in
our checkpoint we'll re-export. Either way you won't be left hanging.

TonoKen3 Lna-Lab

sakamakismile

Owner Apr 25

Hi @garrussun — circling back. We rebuilt the model in modelopt NVFP4 format (the original repo is compressed-tensors, which is known to take a slower fallback path on SM120 in vLLM 0.19.x).

The new sibling repo is here:

→ https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

It's text-only (vision tower stripped, ~4 GB smaller) and ships with a working MTP draft head (Discussion #7), so on Blackwell you should now get:

native NVFP4 GEMM via FlashInferCutlassNvFp4LinearKernel (visible in the vLLM startup log),
speculative decoding throughput on top.

Local numbers on RTX PRO 6000, vLLM 0.19.1rc1, single request, 2000-token completions across 6 domains (English/Japanese/code/math/philosophy):

mean 85.1 tok/s, peak 91.2 tok/s (math/logic),
79.6% MTP acceptance over the run.

That puts it in the same ballpark as the official FP8 you cited (and sometimes ahead, depending on the prompt). If you can rerun your 5090/PRO 6000 benchmark against this repo I'd really like to see the comparison from your side.

About the FLA seq_len < num_heads warning you reported: that's a layout heuristic in vllm/model_executor/layers/fla/ops/utils.py misfiring on Qwen3.6's hybrid full + linear-attention blocks. It's noise, not the cause of the slowdown.

— sakamakismile

livepeer-ren

Apr 25

•

edited Apr 25

That is great update I am running it on single 5090, peaks 120tks/sec. Any chance to ship this update to uncensored model as well you ve just dropped?

Hi @garrussun — circling back. We rebuilt the model in modelopt NVFP4 format (the original repo is compressed-tensors, which is known to take a slower fallback path on SM120 in vLLM 0.19.x).

The new sibling repo is here:

→ https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

It's text-only (vision tower stripped, ~4 GB smaller) and ships with a working MTP draft head (Discussion #7), so on Blackwell you should now get:

native NVFP4 GEMM via FlashInferCutlassNvFp4LinearKernel (visible in the vLLM startup log),

speculative decoding throughput on top.

Local numbers on RTX PRO 6000, vLLM 0.19.1rc1, single request, 2000-token completions across 6 domains (English/Japanese/code/math/philosophy):

mean 85.1 tok/s, peak 91.2 tok/s (math/logic),

79.6% MTP acceptance over the run.

That puts it in the same ballpark as the official FP8 you cited (and sometimes ahead, depending on the prompt). If you can rerun your 5090/PRO 6000 benchmark against this repo I'd really like to see the comparison from your side.

About the FLA seq_len < num_heads warning you reported: that's a layout heuristic in vllm/model_executor/layers/fla/ops/utils.py misfiring on Qwen3.6's hybrid full + linear-attention blocks. It's noise, not the cause of the slowdown.

— sakamakismile

That is great update I am running it on single 5090, peaks 120tks/sec. Any chance to ship this update to uncensored model as well you ve just dropped? Stripping image weights is crucial to fit it on 5090.

jpsequeira

Apr 25

hey would this be runnable in sglang?

livepeer-ren

Apr 25

•

edited Apr 25

hey would this be runnable in sglang?

I ve been trying for 2hrs .. no love. I couldnt trick sglang to not use multimodal mode, and somehow text only still was invoking some image related tensors.

sakamakismile

Owner Apr 26

Closing the loop on this — your diagnosis was correct. I rebuilt in modelopt format with the MTP head restored:

→ sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

Verified on RTX PRO 6000 Blackwell + vLLM 0.19.1rc1 with num_speculative_tokens=3:

Prompt	tok/s
Short (50 tok)	132.5
Medium (350 tok)	105.5
Long-form (700 tok)	106.5

FlashInferCutlassNvFp4LinearKernel is selected by vLLM at startup (visible in init logs), MTP per-position acceptance ~87/72/61 % at n=3 (mean accept length ~3.0/4.0). Comparable to or above official FP8.

Thanks for catching this cleanly — credited in the new repo's README.

— Tonoken3 / Lna-Lab

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment