Instructions to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf", filename="gemma-4-26b-a4b-opus-4.7-distilled.BF16-mmproj.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
Use Docker
docker model run hf.co/glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
- Ollama
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf with Ollama:
ollama run hf.co/glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
- Unsloth Studio
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf to start chatting
- Pi
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf with Docker Model Runner:
docker model run hf.co/glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
- Lemonade
How to use glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull glyphsoftware/gemma-4-26b-a4b-opus-4.7-distilled-gguf:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-26b-a4b-opus-4.7-distilled-gguf-Q4_K_M
List all available models
lemonade list
Gemma-4-26B-A4B-IT — Claude Opus 4.6/4.7 Reasoning Fine-tune · GGUF (Unsloth)
GGUF (llama.cpp) quantizations of a fine-tune of google/gemma-4-26B-A4B-it (via the Unsloth-fixed checkpoint unsloth/gemma-4-26b-a4b-it), trained on angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k — a ~8.7k-example reasoning trace dataset distilled from Claude Opus 4.6 / 4.7.
These files are designed for CPU / GPU inference with llama.cpp and downstream runtimes (Ollama, LM Studio, GPT4All, KoboldCpp, text‑generation‑webui, llama-cpp-python, etc.) and include a separate multimodal projector (mmproj) so the vision tower can be loaded for image inputs.
Quantized with Unsloth from the bf16 fine-tune. See the original (unquantized) HF Transformers weights for full multimodal (audio + video) support.
Available Quants
All quants are derived from the same bf16 fine-tune. Pair any text quant with the mmproj file to enable image input.
| File | Bits | Size | Recommended for | Notes |
|---|---|---|---|---|
gemma-4-26b-a4b-it.BF16-mmproj.gguf |
bf16 | 1.19 GB | Multimodal projector | Required --mmproj companion for image input. Not a text model on its own. |
gemma-4-26b-a4b-it.Q2_K_L.gguf |
~2.6 bpw | 10.76 GB | Lowest-VRAM / CPU-only | Largest quality loss; usable for chat on modest hardware. |
gemma-4-26b-a4b-it.Q3_K_M.gguf |
~3.4 bpw | 13.29 GB | 16 GB VRAM | Decent quality / size trade-off. |
gemma-4-26b-a4b-it.Q4_K_M.gguf |
~4.5 bpw | 16.80 GB | Recommended default | Best balance of quality and footprint for most users. |
gemma-4-26b-a4b-it.Q5_K_M.gguf |
~5.5 bpw | 19.13 GB | 24 GB VRAM | Very close to bf16 quality. |
gemma-4-26b-a4b-it.Q6_K.gguf |
~6.6 bpw | 22.64 GB | High-fidelity | Near-lossless vs. bf16 for almost all use cases. |
gemma-4-26b-a4b-it.Q8_0.gguf |
8.0 bpw | 26.86 GB | Reference / eval | Effectively lossless; largest text quant. |
Note: this is a sparse MoE (~26 B total / ~4 B active per token). Memory footprint is dominated by the stored experts (all 128), so file sizes scale like a 26 B dense model — but inference compute scales like a 4 B model since only the top-8 experts run per token.
Sizing guide
| Hardware | Suggested quant |
|---|---|
| 8 GB GPU / 16 GB RAM CPU | Q2_K_L (offload some layers) |
| 12 GB GPU | Q3_K_M |
| 16 GB GPU | Q4_K_M |
| 24 GB GPU | Q5_K_M or Q6_K |
| 32 GB+ GPU / dual-GPU | Q6_K or Q8_0 |
| Apple Silicon (≥32 GB unified) | Q4_K_M – Q6_K |
Model Summary
| Property | Description |
|---|---|
| Base model | unsloth/gemma-4-26b-a4b-it (google/gemma-4-26B-A4B-it) |
| Architecture | Gemma 4 (Mixture-of-Experts, multimodal) |
| Total parameters | ~26 B |
| Active parameters / token | ~4 B (MoE: 128 experts, top-8 routing) |
| Modalities (in GGUF) | Text + Image (via mmproj). Audio/video not supported in current llama.cpp. |
| Max context length | 262,144 tokens (262K) — limited in practice by your KV cache budget |
| Vocab size | 262,144 |
| Source dtype | bfloat16 |
| Chat template | Gemma-4 conversational template with `< |
| Quantization | Unsloth-derived K-quants from llama.cpp |
| Fine-tuning dataset | angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k |
| License | GPL-3.0 (this fine-tune); base model under Gemma Terms of Use |
How to Run
Requires a build of
llama.cpp(or downstream runtime) with Gemma 4 support. Older builds will fail to load the GGUF.
llama.cpp — text only
# Build (one-time)
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON # or -DGGML_METAL=ON on macOS
cmake --build build -j
# Interactive chat
./build/bin/llama-cli \
-m gemma-4-26b-a4b-it.Q4_K_M.gguf \
-c 8192 \
-ngl 99 \
--temp 1.0 --top-p 0.95 --top-k 64 \
-cnv
llama.cpp — multimodal (image input)
Pair any text quant with the BF16-mmproj.gguf companion:
./build/bin/llama-mtmd-cli \
-m gemma-4-26b-a4b-it.Q4_K_M.gguf \
--mmproj gemma-4-26b-a4b-it.BF16-mmproj.gguf \
-c 8192 -ngl 99 \
--image path/to/picture.png \
-p "Describe the picture and reason about anything unusual."
OpenAI-compatible server
./build/bin/llama-server \
-m gemma-4-26b-a4b-it.Q4_K_M.gguf \
--mmproj gemma-4-26b-a4b-it.BF16-mmproj.gguf \
-c 8192 -ngl 99 \
--host 0.0.0.0 --port 8080 \
--jinja
--jinjamakes the server use the model's embedded Gemma‑4 chat template (with the<|channel>thoughtreasoning channel and native tool-calling).
Then point any OpenAI client at http://localhost:8080/v1.
Ollama
# Create a Modelfile
cat > Modelfile <<'EOF'
FROM ./gemma-4-26b-a4b-it.Q4_K_M.gguf
# Multimodal projector
ADAPTER ./gemma-4-26b-a4b-it.BF16-mmproj.gguf
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64
PARAMETER num_ctx 8192
EOF
ollama create gemma4-26b-a4b-reasoning -f Modelfile
ollama run gemma4-26b-a4b-reasoning
LM Studio / Jan / GPT4All / KoboldCpp / text-generation-webui
Drop the chosen *.gguf (and the BF16-mmproj.gguf if you want image support) into the app's model directory. Make sure the runtime ships a llama.cpp build with Gemma 4 support.
llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path = "gemma-4-26b-a4b-it.Q4_K_M.gguf",
n_ctx = 8192,
n_gpu_layers= -1,
chat_format = "gemma", # uses embedded Jinja template
)
resp = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a careful, step-by-step reasoner."},
{"role": "user", "content": "If a train leaves at 9:15 and travels for 2h 47m, when does it arrive?"},
],
temperature=1.0, top_p=0.95, top_k=64,
max_tokens=1024,
)
print(resp["choices"][0]["message"]["content"])
Recommended Sampling
From the source generation_config.json:
| Param | Value |
|---|---|
temperature |
1.0 |
top_p |
0.95 |
top_k |
64 |
eos_token_id |
[1, 106, 50] |
pad_token_id |
0 |
bos_token_id |
2 |
For deterministic reasoning, drop temperature to ~0.3–0.6.
Chat Template & Reasoning Channel
The Gemma-4 chat template is embedded in every GGUF, and supports:
- Role turns:
<|turn>system|user|model<turn|> - Thinking channel:
<|channel>thought ... <channel|>— enabled when the template receivesenable_thinking=true(llama-server: send"chat_template_kwargs": {"enable_thinking": true}in the request, or use/v1/chat/completionswith"reasoning_effort": "high"on a compatible client). - Tool declarations:
<|tool>…<tool|> - Tool calls / responses:
<|tool_call>…<tool_call|>/<|tool_response>…<tool_response|> - Multimodal placeholders:
<|image|>,<|audio|>,<|video|>(only<|image|>is wired up in llama.cpp via themmproj).
When generation starts without enable_thinking, the template emits an empty <|channel>thought<channel|> block to suppress reasoning. Pass enable_thinking=true to unlock the model's full chain-of-thought.
Training
| Property | Details |
|---|---|
| Method | Supervised Fine-Tuning (SFT) on reasoning traces |
| Framework | Unsloth + 🤗 Transformers / TRL |
| Precision | bf16 |
| Dataset size | ~8,700 multi-turn reasoning examples |
| Dataset source | Reasoning rollouts distilled from Claude Opus 4.6 / 4.7 |
| Reasoning format | Preserves Gemma-4's native `< |
The training corpus emphasizes:
- Long, structured chain-of-thought reasoning
- Math, code, logic and step-wise problem decomposition
- Self-verification and answer revision patterns
- Instruction following with explicit thinking → answer separation
Reasoning data is distilled from Anthropic's Claude models. Outputs may reflect stylistic patterns of Claude (e.g. hedged tone, explicit step labels, "Let me think…" preambles).
Intended Use
Primary use cases
- Local reasoning-heavy assistants (math, coding, agentic planning) on commodity GPUs / Apple Silicon
- Multimodal Q&A over images
- Long-context summarization, retrieval, and document analysis
- Tool-calling / function-calling agents (template-native)
- Edge / offline deployments where bf16 weights are too large
Out-of-scope / not recommended
- High-stakes decisions (medical, legal, financial advice) without human review
- Generation of disallowed content under the Gemma Prohibited Use Policy
- Safety-critical autonomous deployments without guardrails
Files
| File | Purpose |
|---|---|
gemma-4-26b-a4b-it.BF16-mmproj.gguf |
Vision tower / multimodal projector (bf16). Use with --mmproj for image inputs. |
gemma-4-26b-a4b-it.Q2_K_L.gguf |
Q2_K_L text quant (smallest). |
gemma-4-26b-a4b-it.Q3_K_M.gguf |
Q3_K_M text quant. |
gemma-4-26b-a4b-it.Q4_K_M.gguf |
Q4_K_M text quant — recommended default. |
gemma-4-26b-a4b-it.Q5_K_M.gguf |
Q5_K_M text quant. |
gemma-4-26b-a4b-it.Q6_K.gguf |
Q6_K text quant. |
gemma-4-26b-a4b-it.Q8_0.gguf |
Q8_0 text quant — near-lossless. |
export_metadata.json |
Export provenance. |
Limitations & Biases
- Quantization loss: Lower-bit quants (
Q2_K_L,Q3_K_M) will degrade reasoning quality, especially on long chains of thought. PreferQ4_K_Mor higher for reasoning tasks. - MoE quirks: K-quant kernels for MoE experts are still being optimized in llama.cpp; performance and quality may improve in newer builds.
- Multimodal scope in GGUF: Only image input is supported via
mmproj. Audio and video inputs require the original Transformers checkpoint. - Hallucinations: Like all LLMs, the model can produce confident but incorrect answers, especially outside its training distribution.
- Reasoning style transfer: Because SFT data is distilled from Claude, stylistic and refusal patterns may leak into outputs.
- Dataset size: ~8.7k examples is small; expect a targeted style/reasoning shift rather than broad capability uplift.
- Safety: No additional safety fine-tuning was performed. Base Gemma-4 safety guarantees apply; add your own guardrails for production.
License
- This fine-tune: GPL-3.0
- Base model: subject to the Gemma Terms of Use and Gemma Prohibited Use Policy. You must comply with both when using or redistributing this model.
- Training data: see the dataset card for terms.
Citation
@misc{gemma4_2025,
title = {Gemma 4},
author = {Google DeepMind},
year = {2025},
url = {https://ai.google.dev/gemma}
}
@misc{unsloth,
title = {Unsloth: 2x faster LLM fine-tuning with 70% less memory},
author = {Daniel Han and Michael Han and {Unsloth team}},
year = {2024-2026},
url = {https://github.com/unslothai/unsloth}
}
@misc{llama_cpp,
title = {llama.cpp},
author = {Georgi Gerganov and contributors},
year = {2023-2026},
url = {https://github.com/ggml-org/llama.cpp}
}
@misc{claude_reasoning_8k7,
title = {claude-opus-4.6-4.7-reasoning-8.7k},
author = {angrygiraffe},
year = {2026},
url = {https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k}
}
Acknowledgements
- Google DeepMind — Gemma-4 base model
- Unsloth team — Quant-fixed checkpoint, training framework, and GGUF quantization
- Georgi Gerganov & llama.cpp contributors — GGUF format and inference runtime
- angrygiraffe — Reasoning distillation dataset
- Anthropic — Source model family (Claude Opus 4.6 / 4.7) for the distilled reasoning traces
- Downloads last month
- 999
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit