fix(evals): wait for a real painted frame before classifying a widget render#2563
Conversation
…ting a widget render The render observation snapshotted widgets the instant the bridge handshook and classified blankness via isDocumentBlank — a DOM check. Two problems made genuinely-rendered widgets (e.g. Excalidraw) report 'Blank screenshot': 1. Handshaking != painting. Data/CDN-driven widgets only draw after they receive tool data and fetch their code (esm.sh), seconds later, so the result depended on CDN cache warmth. 2. DOM-presence != pixels. A <canvas> widget mounts its canvas element (a layout box, so 'non-blank' to the DOM) before drawing into it, so even the DOM check passed while the canvas was still empty. Add a pixel-level paint wait: once the bridge is live, poll real screenshots (sharp per-channel stdev) until the frame is non-uniform or the new paintTimeoutMs budget elapses, then let it settle (networkidle + grace) and snapshot. The eval now records a 'rendered' status AND the fully-painted screenshot instead of a blank/half-drawn one. Fast widgets return immediately; only genuinely-blank ones wait the budget. Adds a deterministic late-paint regression test. https://claude.ai/code/session_01SdpskaihKUgYPjMagqmWSb
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
Caution Review failedPull request was closed or merged during review Hidden review stack artifactWalkthroughThe harness now captures a one-time empty-page PNG baseline and uses a byte-for-byte screenshot comparison helper to poll the widget frame until it diverges from the baseline and stabilizes within a configurable paintTimeoutMs. renderWidget uses waitForWidgetPaint() (instead of the in-page DOM blank flag) to decide blank_screenshot vs rendered. Defaults add paintTimeoutMs; dispose clears the baseline. Tests add a delayed-paint fixture and assert late paints are classified as rendered. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
mcpjam-inspector/server/utils/mcp-app-browser-harness.ts (1)
757-781: 💤 Low valuePost-paint settling may exceed the declared paint budget.
Once paint is detected, the method waits for
networkidle(up tosettleTimeoutMs) plus a 400ms grace period—potentially overshootingpaintTimeoutMssignificantly. If budget precision matters, consider capping the total elapsed time.That said, since the goal is capturing the finished widget rather than a half-painted frame, this overshoot is arguably desirable behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts` around lines 757 - 781, The post-paint settling (in waitForWidgetPaint) can exceed the paint budget because after detecting a non-blank frame the method unconditionally waits up to settleTimeoutMs for networkidle plus 400ms; cap these waits against the original paint budget so total elapsed time never exceeds paintTimeoutMs. Modify waitForWidgetPaint to compute remaining = deadline - Date.now() when blank becomes false, and pass a bounded timeout to page.waitForLoadState (use Math.min(this.budgets.settleTimeoutMs, remaining)) and bound the subsequent waitForTimeout to the leftover remaining time (e.g., Math.min(400, remaining - loadWaitSpent) or skip if none left); ensure you handle zero/negative remaining by skipping further waits and still return non-blank. Use the existing symbols waitForWidgetPaint, this.budgets.paintTimeoutMs, this.budgets.settleTimeoutMs, page.waitForLoadState and page.waitForTimeout to locate the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts`:
- Around line 757-781: The post-paint settling (in waitForWidgetPaint) can
exceed the paint budget because after detecting a non-blank frame the method
unconditionally waits up to settleTimeoutMs for networkidle plus 400ms; cap
these waits against the original paint budget so total elapsed time never
exceeds paintTimeoutMs. Modify waitForWidgetPaint to compute remaining =
deadline - Date.now() when blank becomes false, and pass a bounded timeout to
page.waitForLoadState (use Math.min(this.budgets.settleTimeoutMs, remaining))
and bound the subsequent waitForTimeout to the leftover remaining time (e.g.,
Math.min(400, remaining - loadWaitSpent) or skip if none left); ensure you
handle zero/negative remaining by skipping further waits and still return
non-blank. Use the existing symbols waitForWidgetPaint,
this.budgets.paintTimeoutMs, this.budgets.settleTimeoutMs, page.waitForLoadState
and page.waitForTimeout to locate the changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 71bacda7-b14f-432b-a287-cce342995cc4
📒 Files selected for processing (2)
mcpjam-inspector/server/utils/__tests__/mcp-app-browser-harness.test.tsmcpjam-inspector/server/utils/mcp-app-browser-harness.ts
…line diff) The prior commit imported `sharp` to pixel-check blankness, which broke the production server. The ESM server bundle inlines sharp's CJS, so at startup sharp/lib/constructor.js hits 'Dynamic require of util is not supported' and the server won't boot (CI: Build and Test + E2E Smoke both failed on this). sharp was also only a devDependency, so it wouldn't be installed for npx users anyway. Drop sharp and detect paint decode-free: capture an empty-host baseline once, then poll whole-frame PNG screenshots until the frame both diverges from the baseline (painted) and stops changing between samples (settled), bounded by paintTimeoutMs. Chromium encodes identical pixels to identical bytes (verified), so frame equality is a plain Buffer compare — no native-image dep in the bundle. The check is content-agnostic, so it also generalizes past canvas apps: DOM widgets, canvas/WebGL, progressive hydration, and slow CDN loads are handled the same way with no per-widget delay tuning. Genuinely-blank widgets stay identical to the baseline (blank_screenshot); perpetually-animating ones hit the budget and we snapshot the latest painted frame (rendered). https://claude.ai/code/session_01SdpskaihKUgYPjMagqmWSb
There was a problem hiding this comment.
🧹 Nitpick comments (1)
mcpjam-inspector/server/utils/mcp-app-browser-harness.ts (1)
774-774: ⚡ Quick winInconsistent fail-safe: return
falserather thantrue.Line 800 returns
false(rendered) when frame or baseline is unavailable, following the stated philosophy (lines 798–799) to "fail toward 'rendered' rather than mislabel a real widget blank." Line 774's defensive!pagecheck should do the same.Proposed fix
- if (!page) return true; + if (!page) return false;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts` at line 774, The early guard currently returns true when page is missing ("if (!page) return true;") which is inconsistent with the module's fail-safe policy; change that to return false so it fails toward "rendered" like the other guard (see the similar check around lines 798–799). Update the statement "if (!page) return true;" to "if (!page) return false;" so the function (the block using the page variable) treats a missing page the same as a missing frame/baseline and marks as rendered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts`:
- Line 774: The early guard currently returns true when page is missing ("if
(!page) return true;") which is inconsistent with the module's fail-safe policy;
change that to return false so it fails toward "rendered" like the other guard
(see the similar check around lines 798–799). Update the statement "if (!page)
return true;" to "if (!page) return false;" so the function (the block using the
page variable) treats a missing page the same as a missing frame/baseline and
marks as rendered.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 033fd18d-b3c2-4549-beb1-2b76af8b1fac
📒 Files selected for processing (1)
mcpjam-inspector/server/utils/mcp-app-browser-harness.ts
…ored frame Addresses two Cursor Bugbot findings on the paint check: 1. Paint deadline before polling. The paint deadline was fixed before the upfront networkidle wait, so when settleTimeoutMs >= paintTimeoutMs that wait could consume the whole budget, the poll loop never ran, and a blank widget fell through to 'rendered'. Cap the networkidle wait to the paint budget and make the poll a do/while so it always samples at least one real frame. 2. Status vs. stored screenshot could disagree. The blank verdict was decided on waitForWidgetPaint's internal frame while the stored screenshot came from a separate captureScreenshot. Return the classified frame and store that exact image (encoded to fit the byte budget), so status and screenshot always match. Factor the byte-budget encoder into encodeScreenshot, shared by both paths. Adds a regression test: a blank widget with paintTimeoutMs <= settleTimeoutMs still classifies blank_screenshot. https://claude.ai/code/session_01SdpskaihKUgYPjMagqmWSb
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b0362cc. Configure here.
| // Blank iff the settled frame is still indistinguishable from the empty | ||
| // baseline. With no baseline (capture failed), fail toward "rendered" rather | ||
| // than mislabel a real widget blank. | ||
| const blank = !frame || !baseline ? false : framesEqual(frame, baseline); |
There was a problem hiding this comment.
Paint poll aborts screenshot retries
Medium Severity
In waitForWidgetPaint, when a poll screenshot fails and prev is still null, the loop breaks immediately instead of retrying until the paint deadline. That forces blank to false and can mark a genuinely blank widget as rendered, which the surrounding comments say the do/while was meant to prevent.
Reviewed by Cursor Bugbot for commit b0362cc. Configure here.


Problem
In the eval Render Observations, a genuinely-rendered widget could be reported as "Blank screenshot — rendered but painted blank." The clearest repro is the Excalidraw quickstart: the observation snapshot was blank (captured at ~250ms), while the same
create_viewwidget rendered the Start → End diagram perfectly in the live trace.So the harness was misreporting a working render as a failure — a false negative — and the screenshot it stored was blank/half-drawn.
Root cause
The harness snapshotted + classified blankness the instant the bridge handshake completed, using
isDocumentBlank(a DOM check). Two compounding issues:<canvas>widget mounts its canvas element (a layout box, so "non-blank" to the DOM) before it draws into it. Even the DOM check passed while the canvas was still empty, so the stored screenshot was blank.Fix
Add a pixel-level paint wait before snapshotting/classifying. Once the bridge is live, poll real screenshots and measure per-channel stdev with
sharpuntil the frame is non-uniform (actual pixels drawn) or a newpaintTimeoutMsbudget elapses, then let it settle (networkidle+ a short grace) and capture.0; any real content (even a small button) clears the threshold, so genuinely-blank widgets still classify asblank_screenshot.Verification
status: "rendered"and the captured observation screenshot is the complete Start → End diagram (previously blank).rendered, notblank_screenshot.blank_screenshot.Follow-up to #2553.
Generated by Claude Code
Note
Medium Risk
Changes render observation timing and classification for all browser-backed widget evals; mis-tuned budgets could add latency or misclassify edge cases, but scope is limited to the harness with regression tests.
Overview
Fixes false
blank_screenshoton slow/CDN-driven MCP App widgets by deferring snapshot and status until the page has actually painted, not just finished the bridge handshake.After
bridgeInitialized,waitForWidgetPaint(bounded by newpaintTimeoutMs, default 8s) optionally waits fornetworkidle, then polls PNG screenshots and compares them byte-for-byte to a one-time empty-host baseline and to the previous sample until the frame diverges from baseline and stabilizes (or the budget expires).blank_screenshotnow follows that pixel verdict instead of the in-page DOMpageResult.blankat handshake time. The observation storesencodeScreenshotoutput from the same frame used for classification.Tests add a delayed-paint guest fixture and cover the edge case where settle wait must not prevent at least one paint poll.
Reviewed by Cursor Bugbot for commit b0362cc. Bugbot is set up for automated code reviews on this repo. Configure here.