Aquileo | yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF · MTP?!

MTP?!

#5
by victor11sk - opened

Can you add mtp support?

It supports MTP; you can download it via my other model, which is a general-purpose one.

@victor11skget it working? How did you configure it?

how to configure it?

@victor11sk @JaJones11 @bagaszai12 Here's the full setup 👇

1. Build requirement — you need a recent llama.cpp, build b9553 or newer (the Gemma-4 MTP arch landed in PR
#23398). Older builds fail with unknown model architecture: 'gemma4-assistant'.

2. Grab the MTP draft head — it lives in the MTP/ folder of my general-purpose repo. It's the original Gemma-4
assistant head, fully compatible with this coding model (same base & vocab):

hf download yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF --include "MTP/*"

3. Run with speculative decoding — note the --spec-type draft-mtp flag (this is not a generic -md draft):

llama-server -m gemma4-coding-Q4_K_M.gguf \
  --model-draft MTP/gemma-4-12B-it-MTP-Q8_0.gguf \
  --spec-type draft-mtp --spec-draft-n-max 4 \
  -ngl 99 -ngld 99 -fa on --jinja

For a quick speed A/B, use llama-cli --single-turn instead — llama-completion does not support
--model-draft.

Heads-up on speed — this draft is the original Gemma-4 assistant head, not re-aligned to my fine-tune, so on my
RTX 5090 I measured $\sim 1.2\text{–}1.3\times$ (greedy / thinking). Speculative decoding is always lossless, so
output quality is identical — only throughput changes. Re-training the draft for a higher accept rate is on my list
but not done yet.

yuxinlu1 pinned discussion

Sign up or log in to comment