Skip to content

fix(evals): wait for a real painted frame before classifying a widget render#2563

Merged
chelojimenez merged 3 commits into
mainfrom
claude/widget-paint-fix
Jun 11, 2026
Merged

fix(evals): wait for a real painted frame before classifying a widget render#2563
chelojimenez merged 3 commits into
mainfrom
claude/widget-paint-fix

Conversation

@chelojimenez

@chelojimenez chelojimenez commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Problem

In the eval Render Observations, a genuinely-rendered widget could be reported as "Blank screenshot — rendered but painted blank." The clearest repro is the Excalidraw quickstart: the observation snapshot was blank (captured at ~250ms), while the same create_view widget rendered the Start → End diagram perfectly in the live trace.

So the harness was misreporting a working render as a failure — a false negative — and the screenshot it stored was blank/half-drawn.

Root cause

The harness snapshotted + classified blankness the instant the bridge handshake completed, using isDocumentBlank (a DOM check). Two compounding issues:

  1. Handshaking ≠ painting. Data/CDN-driven widgets only draw after they receive their tool data and fetch their code (e.g. Excalidraw from esm.sh) — seconds later. So whether the snapshot caught content depended on CDN cache warmth (warm: "rendered"; cold: "blank").
  2. DOM-presence ≠ pixels. A <canvas> widget mounts its canvas element (a layout box, so "non-blank" to the DOM) before it draws into it. Even the DOM check passed while the canvas was still empty, so the stored screenshot was blank.

Fix

Add a pixel-level paint wait before snapshotting/classifying. Once the bridge is live, poll real screenshots and measure per-channel stdev with sharp until the frame is non-uniform (actual pixels drawn) or a new paintTimeoutMs budget elapses, then let it settle (networkidle + a short grace) and capture.

  • A truly-blank frame has per-channel stdev 0; any real content (even a small button) clears the threshold, so genuinely-blank widgets still classify as blank_screenshot.
  • Fast widgets return on the first frame (no added latency); only a genuinely-blank widget waits the full budget.
  • The stored screenshot is now the finished widget, so the eval UI shows the painted render instead of a blank/half-drawn one.

Verification

  • Live Excalidraw quickstart with default budgets: now status: "rendered" and the captured observation screenshot is the complete Start → End diagram (previously blank).
  • New deterministic regression test: a widget that handshakes immediately but paints ~600ms later is classified rendered, not blank_screenshot.
  • Full harness suite green (26/26), including the genuinely-blank fixture still resolving to blank_screenshot.

Follow-up to #2553.


Generated by Claude Code


Note

Medium Risk
Changes render observation timing and classification for all browser-backed widget evals; mis-tuned budgets could add latency or misclassify edge cases, but scope is limited to the harness with regression tests.

Overview
Fixes false blank_screenshot on slow/CDN-driven MCP App widgets by deferring snapshot and status until the page has actually painted, not just finished the bridge handshake.

After bridgeInitialized, waitForWidgetPaint (bounded by new paintTimeoutMs, default 8s) optionally waits for networkidle, then polls PNG screenshots and compares them byte-for-byte to a one-time empty-host baseline and to the previous sample until the frame diverges from baseline and stabilizes (or the budget expires). blank_screenshot now follows that pixel verdict instead of the in-page DOM pageResult.blank at handshake time. The observation stores encodeScreenshot output from the same frame used for classification.

Tests add a delayed-paint guest fixture and cover the edge case where settle wait must not prevent at least one paint poll.

Reviewed by Cursor Bugbot for commit b0362cc. Bugbot is set up for automated code reviews on this repo. Configure here.

…ting a widget render

The render observation snapshotted widgets the instant the bridge handshook
and classified blankness via isDocumentBlank — a DOM check. Two problems made
genuinely-rendered widgets (e.g. Excalidraw) report 'Blank screenshot':

1. Handshaking != painting. Data/CDN-driven widgets only draw after they
   receive tool data and fetch their code (esm.sh), seconds later, so the
   result depended on CDN cache warmth.
2. DOM-presence != pixels. A <canvas> widget mounts its canvas element (a
   layout box, so 'non-blank' to the DOM) before drawing into it, so even the
   DOM check passed while the canvas was still empty.

Add a pixel-level paint wait: once the bridge is live, poll real screenshots
(sharp per-channel stdev) until the frame is non-uniform or the new
paintTimeoutMs budget elapses, then let it settle (networkidle + grace) and
snapshot. The eval now records a 'rendered' status AND the fully-painted
screenshot instead of a blank/half-drawn one. Fast widgets return immediately;
only genuinely-blank ones wait the budget. Adds a deterministic late-paint
regression test.

https://claude.ai/code/session_01SdpskaihKUgYPjMagqmWSb
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working labels Jun 11, 2026
@chelojimenez

chelojimenez commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Comment thread mcpjam-inspector/server/utils/mcp-app-browser-harness.ts
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

Hidden review stack artifact

Walkthrough

The harness now captures a one-time empty-page PNG baseline and uses a byte-for-byte screenshot comparison helper to poll the widget frame until it diverges from the baseline and stabilizes within a configurable paintTimeoutMs. renderWidget uses waitForWidgetPaint() (instead of the in-page DOM blank flag) to decide blank_screenshot vs rendered. Defaults add paintTimeoutMs; dispose clears the baseline. Tests add a delayed-paint fixture and assert late paints are classified as rendered.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
mcpjam-inspector/server/utils/mcp-app-browser-harness.ts (1)

757-781: 💤 Low value

Post-paint settling may exceed the declared paint budget.

Once paint is detected, the method waits for networkidle (up to settleTimeoutMs) plus a 400ms grace period—potentially overshooting paintTimeoutMs significantly. If budget precision matters, consider capping the total elapsed time.

That said, since the goal is capturing the finished widget rather than a half-painted frame, this overshoot is arguably desirable behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts` around lines 757 -
781, The post-paint settling (in waitForWidgetPaint) can exceed the paint budget
because after detecting a non-blank frame the method unconditionally waits up to
settleTimeoutMs for networkidle plus 400ms; cap these waits against the original
paint budget so total elapsed time never exceeds paintTimeoutMs. Modify
waitForWidgetPaint to compute remaining = deadline - Date.now() when blank
becomes false, and pass a bounded timeout to page.waitForLoadState (use
Math.min(this.budgets.settleTimeoutMs, remaining)) and bound the subsequent
waitForTimeout to the leftover remaining time (e.g., Math.min(400, remaining -
loadWaitSpent) or skip if none left); ensure you handle zero/negative remaining
by skipping further waits and still return non-blank. Use the existing symbols
waitForWidgetPaint, this.budgets.paintTimeoutMs, this.budgets.settleTimeoutMs,
page.waitForLoadState and page.waitForTimeout to locate the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts`:
- Around line 757-781: The post-paint settling (in waitForWidgetPaint) can
exceed the paint budget because after detecting a non-blank frame the method
unconditionally waits up to settleTimeoutMs for networkidle plus 400ms; cap
these waits against the original paint budget so total elapsed time never
exceeds paintTimeoutMs. Modify waitForWidgetPaint to compute remaining =
deadline - Date.now() when blank becomes false, and pass a bounded timeout to
page.waitForLoadState (use Math.min(this.budgets.settleTimeoutMs, remaining))
and bound the subsequent waitForTimeout to the leftover remaining time (e.g.,
Math.min(400, remaining - loadWaitSpent) or skip if none left); ensure you
handle zero/negative remaining by skipping further waits and still return
non-blank. Use the existing symbols waitForWidgetPaint,
this.budgets.paintTimeoutMs, this.budgets.settleTimeoutMs, page.waitForLoadState
and page.waitForTimeout to locate the changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 71bacda7-b14f-432b-a287-cce342995cc4

📥 Commits

Reviewing files that changed from the base of the PR and between 9f4aadd and 56b2823.

📒 Files selected for processing (2)
  • mcpjam-inspector/server/utils/__tests__/mcp-app-browser-harness.test.ts
  • mcpjam-inspector/server/utils/mcp-app-browser-harness.ts

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Internal preview

Preview URL will appear in Railway after the deploy finishes.
Deployed commit: fb79113
PR head commit: b0362cc
Backend target: staging fallback.
Access is employee-only in non-production environments.

…line diff)

The prior commit imported `sharp` to pixel-check blankness, which broke the
production server. The ESM server bundle inlines sharp's CJS, so at startup
sharp/lib/constructor.js hits 'Dynamic require of util is not supported' and the
server won't boot (CI: Build and Test + E2E Smoke both failed on this). sharp
was also only a devDependency, so it wouldn't be installed for npx users anyway.

Drop sharp and detect paint decode-free: capture an empty-host baseline once,
then poll whole-frame PNG screenshots until the frame both diverges from the
baseline (painted) and stops changing between samples (settled), bounded by
paintTimeoutMs. Chromium encodes identical pixels to identical bytes (verified),
so frame equality is a plain Buffer compare — no native-image dep in the bundle.

The check is content-agnostic, so it also generalizes past canvas apps: DOM
widgets, canvas/WebGL, progressive hydration, and slow CDN loads are handled the
same way with no per-widget delay tuning. Genuinely-blank widgets stay identical
to the baseline (blank_screenshot); perpetually-animating ones hit the budget
and we snapshot the latest painted frame (rendered).

https://claude.ai/code/session_01SdpskaihKUgYPjMagqmWSb
@dosubot dosubot Bot removed the size:M This PR changes 30-99 lines, ignoring generated files. label Jun 11, 2026
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 11, 2026
Comment thread mcpjam-inspector/server/utils/mcp-app-browser-harness.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
mcpjam-inspector/server/utils/mcp-app-browser-harness.ts (1)

774-774: ⚡ Quick win

Inconsistent fail-safe: return false rather than true.

Line 800 returns false (rendered) when frame or baseline is unavailable, following the stated philosophy (lines 798–799) to "fail toward 'rendered' rather than mislabel a real widget blank." Line 774's defensive !page check should do the same.

Proposed fix
-  if (!page) return true;
+  if (!page) return false;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts` at line 774, The
early guard currently returns true when page is missing ("if (!page) return
true;") which is inconsistent with the module's fail-safe policy; change that to
return false so it fails toward "rendered" like the other guard (see the similar
check around lines 798–799). Update the statement "if (!page) return true;" to
"if (!page) return false;" so the function (the block using the page variable)
treats a missing page the same as a missing frame/baseline and marks as
rendered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts`:
- Line 774: The early guard currently returns true when page is missing ("if
(!page) return true;") which is inconsistent with the module's fail-safe policy;
change that to return false so it fails toward "rendered" like the other guard
(see the similar check around lines 798–799). Update the statement "if (!page)
return true;" to "if (!page) return false;" so the function (the block using the
page variable) treats a missing page the same as a missing frame/baseline and
marks as rendered.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 033fd18d-b3c2-4549-beb1-2b76af8b1fac

📥 Commits

Reviewing files that changed from the base of the PR and between 56b2823 and 90fe421.

📒 Files selected for processing (1)
  • mcpjam-inspector/server/utils/mcp-app-browser-harness.ts

…ored frame

Addresses two Cursor Bugbot findings on the paint check:

1. Paint deadline before polling. The paint deadline was fixed before the
   upfront networkidle wait, so when settleTimeoutMs >= paintTimeoutMs that wait
   could consume the whole budget, the poll loop never ran, and a blank widget
   fell through to 'rendered'. Cap the networkidle wait to the paint budget and
   make the poll a do/while so it always samples at least one real frame.

2. Status vs. stored screenshot could disagree. The blank verdict was decided on
   waitForWidgetPaint's internal frame while the stored screenshot came from a
   separate captureScreenshot. Return the classified frame and store that exact
   image (encoded to fit the byte budget), so status and screenshot always match.
   Factor the byte-budget encoder into encodeScreenshot, shared by both paths.

Adds a regression test: a blank widget with paintTimeoutMs <= settleTimeoutMs
still classifies blank_screenshot.

https://claude.ai/code/session_01SdpskaihKUgYPjMagqmWSb

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b0362cc. Configure here.

// Blank iff the settled frame is still indistinguishable from the empty
// baseline. With no baseline (capture failed), fail toward "rendered" rather
// than mislabel a real widget blank.
const blank = !frame || !baseline ? false : framesEqual(frame, baseline);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paint poll aborts screenshot retries

Medium Severity

In waitForWidgetPaint, when a poll screenshot fails and prev is still null, the loop breaks immediately instead of retrying until the paint deadline. That forces blank to false and can mark a genuinely blank widget as rendered, which the surrounding comments say the do/while was meant to prevent.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b0362cc. Configure here.

@chelojimenez chelojimenez merged commit a4889b1 into main Jun 11, 2026
11 of 12 checks passed
@chelojimenez chelojimenez deleted the claude/widget-paint-fix branch June 11, 2026 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants