🪽How to Run MTP Models: Multi-Token Prediction Guide

MTP, or Multi-Token Prediction, speeds up inference by letting a model predict multiple upcoming tokens at once instead of generating one token per step. It enables faster inference without accuracy loss and is especially effective on GPUs. In this guide, you’ll learn how to use MTP models like Gemma 4 or Qwen3.6 on your local device.

MTP predicts several future tokens, then the main model verifies them in parallel. This reduces the number of forward passes needed during generation, making output faster. Because only verified tokens are kept, output quality remains unchanged while decoding work is significantly reduced.

When running GGUFs, MTP can make generation ~1.4× to 2.2× faster. Dense models like Gemma-4-31B benefit most, reaching >1.4× speedup over the original baseline, which is especially useful for local inference. Gains are smaller on devices with lower memory bandwidth, such as older Macs. You can run MTP models directly in Unsloth Studio’s UI or llama.cpp.

MTP uses more memory than standard, so plan for ~2 GB additional RAM/VRAM headroom.

Gemma 4 MTPQwen3.6 MTP

We found --spec-draft-n-max 2 is the best starting point however, do not assume 2 is optimal, as performance is hardware-dependent. Try any value from 1 through 6 and use whichever is fastest for your system. Unsloth Studio automatically sets the ideal MTP settings optimized for your specific hardware (Mac, CPU, GPU etc.) - you can still change it later.

Gemma 4 MTP

Google DeepMind trained MTP separately from the original Gemma 4 models, including for QAT variants. Unlike Qwen, Google released specific MTP variants under the assistant name. For best results, we only upload 3 precision options: 8-bit and 16-bit (BF16, F16). You can access all the MTP models here.

We uploaded mtp- prefixed GGUFs to each repo, so the below just works (this uses the 8-bit one)

llama-server -hf unsloth/gemma-4-31B-it-GGUF --spec-type draft-mtp --spec-draft-n-max 4

Table: MTP hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Gemma 4 variant
4-bit
8-bit
BF16 / FP16

E2B

5 GB

6–9 GB

11 GB

E4B

6.5–7 GB

10–13 GB

17 GB

12B Unified

8–9 GB

14–15 GB

26 GB

26B A4B

17–18 GB

29–31 GB

53 GB

31B

18–21 GB

35–39 GB

63 GB

To run the Gemma 4 MTP models, follow the steps either for Unsloth Studio or llama.cpp.

🦥 Run in Unsloth Studio🦙 Run in llama.cpp

Qwen3.6 MTP

Qwen directly trained MTP inside of the Qwen3.6 and Qwen3.5 models. This enables Qwen3.6 27B MTP to reach 160 tokens/s and Qwen3.6 35B-A3B reach 240 tokens/s on an RTX 6000 GPU. GGUF uploads:

Table: MTP hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Qwen3.6
3-bit
4-bit
6-bit
8-bit
BF16

27B

16 GB

19 GB

25 GB

31 GB

56 GB

35B-A3B

18 GB

24 GB

31 GB

39 GB

71 GB

Below are graphs of inference throughput for MTP vs. no MTP:

We also uploaded MTP GGUFs for the Qwen3.5 model family including: 0.8B, 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B and 397B-A17B. Llama.cpp is continually improving MTP performance, so expect it to get faster overtime!

To run the Qwen MTP models, follow the steps either for Unsloth Studio or llama.cpp.

🦥 Unsloth Studio MTP Guide

Unsloth Studio automatically sets the ideal MTP settings optimized for your specific hardware (Mac, CPU, GPU etc.) - you can still change it later.

1

Install Unsloth

Run in your terminal:

MacOS, Linux, WSL:

Windows PowerShell:

2

Launch Unsloth

MacOS, Linux, WSL and Windows:

Then open http://127.0.0.1:8888 (or your specific URL) in your browser.

3

Search and download your desired MTP model

On first launch you will need to create a password to secure your account and sign in again later. Then go to the Studio Chat tab and search for your MTP model (e.g. Qwen3.6 MTP) in the search bar and download your desired model and quant.

4

Run your MTP model

Inference and MTP settings should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide. Below, the 2-bit Qwen3.6 MTP GGUF made 10+ tool calls, searched 10 sites and executed Python code:

🦙 Llama.cpp MTP Guide

1

Install the latest version of llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the commands for the specific models:

Gemma 4Qwen3.6

Gemma 4 MTP:

Don't forget to change the model name to your desired Gemma 4 model size like Gemma-4-26B-A4B etc. as the instructions below are for Gemma-4-12B. Notice we provided a mtp- prefixed GGUF, so the below -hf command should auto download and use MTP.

Thinking mode:

Please see Gemma 4's new Preserved Thinking.

Non-thinking mode:

Qwen3.6 MTP:

Don't forget to change the model name to your desired Qwen3.6 variant like Qwen3.6-35B-A3B or Qwen3.5 etc. as the instructions below are for Qwen3.6-27B:

Thinking mode (General tasks):

For precise coding tasks, change: temperature=0.6

Please see Qwen3.6's new Preserved Thinking.

Non-thinking mode (General tasks):

3

Download the model via the code below (after installing pip install huggingface_hub hf_transfer). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

Gemma 4 MTP:

Qwen3.6 MTP:

4

Then run the model in conversation mode:

Gemma 4 MTP:

Qwen3.6 MTP:

Last updated

Was this helpful?