Skip to content

fix: gpt-oss disallows quantized cache types due to its use of attention sinks#7541

Merged
louis-jan merged 1 commit into
mainfrom
fix/quantized_attention_does_not_support_zero_sinks
Feb 23, 2026
Merged

fix: gpt-oss disallows quantized cache types due to its use of attention sinks#7541
louis-jan merged 1 commit into
mainfrom
fix/quantized_attention_does_not_support_zero_sinks

Conversation

@louis-jan

Copy link
Copy Markdown
Contributor

Describe Your Changes

  • GPT-OSS explicitly disallows quantized cache types due to its use of attention sinks, loading this model with MLX backend now will throw quantized attention does not support zero sinks.
  • QuantizedKVCache should be enabled via settings or at least detected for support before enabling by default.

Fixes Issues

  • Closes #
  • Closes #

Self Checklist

  • Added relevant comments, esp in complex areas
  • Updated docs (for bug fixes / features)
  • Created issues for follow-up changes or refactoring needed

Copilot AI review requested due to automatic review settings February 19, 2026 14:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents MLX inference from enabling quantized KV cache settings by default (to avoid failures with GPT-OSS models that use attention sinks), and improves streaming usage reporting by populating prompt/total token counts from MLX generation info.

Changes:

  • Stop passing kvBits, kvGroupSize, and quantizedKVStart into GenerateParameters (disables default KV-cache quantization).
  • Populate UsageInfo.prompt_tokens and UsageInfo.total_tokens for streaming completions using info.promptTokenCount and info.generationTokenCount.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@louis-jan

Copy link
Copy Markdown
Contributor Author

There will be a follow-up PR to enable quantized kvcache properly.

@louis-jan louis-jan merged commit d07dfaf into main Feb 23, 2026
7 checks passed
@louis-jan louis-jan deleted the fix/quantized_attention_does_not_support_zero_sinks branch February 23, 2026 09:33
@github-project-automation github-project-automation Bot moved this to QA in Jan Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: QA

Development

Successfully merging this pull request may close these issues.

3 participants