Aquileo | fix: gpt-oss disallows quantized cache types due to its use of attention sinks by louis-jan · Pull Request #7541 · janhq/jan

louis-jan · 2026-02-19T14:01:39Z

Describe Your Changes

GPT-OSS explicitly disallows quantized cache types due to its use of attention sinks, loading this model with MLX backend now will throw quantized attention does not support zero sinks.
QuantizedKVCache should be enabled via settings or at least detected for support before enabling by default.

Fixes Issues

Closes #
Closes #

Self Checklist

Added relevant comments, esp in complex areas
Updated docs (for bug fixes / features)
Created issues for follow-up changes or refactoring needed

…ion sinks

Copilot

Pull request overview

This PR prevents MLX inference from enabling quantized KV cache settings by default (to avoid failures with GPT-OSS models that use attention sinks), and improves streaming usage reporting by populating prompt/total token counts from MLX generation info.

Changes:

Stop passing kvBits, kvGroupSize, and quantizedKVStart into GenerateParameters (disables default KV-cache quantization).
Populate UsageInfo.prompt_tokens and UsageInfo.total_tokens for streaming completions using info.promptTokenCount and info.generationTokenCount.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

louis-jan · 2026-02-19T14:11:09Z

There will be a follow-up PR to enable quantized kvcache properly.

fix: gpt-oss disallows quantized cache types due to its use of attent…

e51506c

…ion sinks

Copilot AI review requested due to automatic review settings February 19, 2026 14:01

github-project-automation Bot added this to Jan Feb 19, 2026

github-actions Bot assigned louis-jan Feb 19, 2026

Copilot started reviewing on behalf of louis-jan February 19, 2026 14:02 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

urmauur approved these changes Feb 23, 2026

View reviewed changes

louis-jan merged commit d07dfaf into main Feb 23, 2026
7 checks passed

louis-jan deleted the fix/quantized_attention_does_not_support_zero_sinks branch February 23, 2026 09:33

github-project-automation Bot moved this to QA in Jan Feb 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: gpt-oss disallows quantized cache types due to its use of attention sinks#7541

fix: gpt-oss disallows quantized cache types due to its use of attention sinks#7541
louis-jan merged 1 commit into
mainfrom
fix/quantized_attention_does_not_support_zero_sinks

louis-jan commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

louis-jan commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

louis-jan commented Feb 19, 2026

Describe Your Changes

Fixes Issues

Self Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

louis-jan commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants