Aquileo | fix(evals): wait for a real painted frame before classifying a widget render by chelojimenez · Pull Request #2563 · MCPJam/inspector

chelojimenez · 2026-06-11T01:29:33Z

Problem

In the eval Render Observations, a genuinely-rendered widget could be reported as "Blank screenshot — rendered but painted blank." The clearest repro is the Excalidraw quickstart: the observation snapshot was blank (captured at ~250ms), while the same create_view widget rendered the Start → End diagram perfectly in the live trace.

So the harness was misreporting a working render as a failure — a false negative — and the screenshot it stored was blank/half-drawn.

Root cause

The harness snapshotted + classified blankness the instant the bridge handshake completed, using isDocumentBlank (a DOM check). Two compounding issues:

Handshaking ≠ painting. Data/CDN-driven widgets only draw after they receive their tool data and fetch their code (e.g. Excalidraw from esm.sh) — seconds later. So whether the snapshot caught content depended on CDN cache warmth (warm: "rendered"; cold: "blank").
DOM-presence ≠ pixels. A <canvas> widget mounts its canvas element (a layout box, so "non-blank" to the DOM) before it draws into it. Even the DOM check passed while the canvas was still empty, so the stored screenshot was blank.

Fix

Add a pixel-level paint wait before snapshotting/classifying. Once the bridge is live, poll real screenshots and measure per-channel stdev with sharp until the frame is non-uniform (actual pixels drawn) or a new paintTimeoutMs budget elapses, then let it settle (networkidle + a short grace) and capture.

A truly-blank frame has per-channel stdev 0; any real content (even a small button) clears the threshold, so genuinely-blank widgets still classify as blank_screenshot.
Fast widgets return on the first frame (no added latency); only a genuinely-blank widget waits the full budget.
The stored screenshot is now the finished widget, so the eval UI shows the painted render instead of a blank/half-drawn one.

Verification

Live Excalidraw quickstart with default budgets: now status: "rendered" and the captured observation screenshot is the complete Start → End diagram (previously blank).
New deterministic regression test: a widget that handshakes immediately but paints ~600ms later is classified rendered, not blank_screenshot.
Full harness suite green (26/26), including the genuinely-blank fixture still resolving to blank_screenshot.

Follow-up to #2553.

Generated by Claude Code

Note

Medium Risk
Changes render observation timing and classification for all browser-backed widget evals; mis-tuned budgets could add latency or misclassify edge cases, but scope is limited to the harness with regression tests.

Overview
Fixes false blank_screenshot on slow/CDN-driven MCP App widgets by deferring snapshot and status until the page has actually painted, not just finished the bridge handshake.

After bridgeInitialized, waitForWidgetPaint (bounded by new paintTimeoutMs, default 8s) optionally waits for networkidle, then polls PNG screenshots and compares them byte-for-byte to a one-time empty-host baseline and to the previous sample until the frame diverges from baseline and stabilizes (or the budget expires). blank_screenshot now follows that pixel verdict instead of the in-page DOM pageResult.blank at handshake time. The observation stores encodeScreenshot output from the same frame used for classification.

Tests add a delayed-paint guest fixture and cover the edge case where settle wait must not prevent at least one paint poll.

^{Reviewed by Cursor Bugbot for commit b0362cc. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ting a widget render The render observation snapshotted widgets the instant the bridge handshook and classified blankness via isDocumentBlank — a DOM check. Two problems made genuinely-rendered widgets (e.g. Excalidraw) report 'Blank screenshot': 1. Handshaking != painting. Data/CDN-driven widgets only draw after they receive tool data and fetch their code (esm.sh), seconds later, so the result depended on CDN cache warmth. 2. DOM-presence != pixels. A <canvas> widget mounts its canvas element (a layout box, so 'non-blank' to the DOM) before drawing into it, so even the DOM check passed while the canvas was still empty. Add a pixel-level paint wait: once the bridge is live, poll real screenshots (sharp per-channel stdev) until the frame is non-uniform or the new paintTimeoutMs budget elapses, then let it settle (networkidle + grace) and snapshot. The eval now records a 'rendered' status AND the fully-painted screenshot instead of a blank/half-drawn one. Fast widgets return immediately; only genuinely-blank ones wait the budget. Adds a deterministic late-paint regression test. https://claude.ai/code/session_01SdpskaihKUgYPjMagqmWSb

chatgpt-codex-connector · 2026-06-11T01:29:37Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

chelojimenez · 2026-06-11T01:29:47Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

coderabbitai · 2026-06-11T01:33:03Z

Caution

Review failed

Pull request was closed or merged during review

Hidden review stack artifact

Walkthrough

The harness now captures a one-time empty-page PNG baseline and uses a byte-for-byte screenshot comparison helper to poll the widget frame until it diverges from the baseline and stabilizes within a configurable paintTimeoutMs. renderWidget uses waitForWidgetPaint() (instead of the in-page DOM blank flag) to decide blank_screenshot vs rendered. Defaults add paintTimeoutMs; dispose clears the baseline. Tests add a delayed-paint fixture and assert late paints are classified as rendered.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

mcpjam-inspector/server/utils/mcp-app-browser-harness.ts (1)
757-781: 💤 Low value

Post-paint settling may exceed the declared paint budget.

Once paint is detected, the method waits for networkidle (up to settleTimeoutMs) plus a 400ms grace period—potentially overshooting paintTimeoutMs significantly. If budget precision matters, consider capping the total elapsed time.

That said, since the goal is capturing the finished widget rather than a half-painted frame, this overshoot is arguably desirable behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts` around lines 757 -
781, The post-paint settling (in waitForWidgetPaint) can exceed the paint budget
because after detecting a non-blank frame the method unconditionally waits up to
settleTimeoutMs for networkidle plus 400ms; cap these waits against the original
paint budget so total elapsed time never exceeds paintTimeoutMs. Modify
waitForWidgetPaint to compute remaining = deadline - Date.now() when blank
becomes false, and pass a bounded timeout to page.waitForLoadState (use
Math.min(this.budgets.settleTimeoutMs, remaining)) and bound the subsequent
waitForTimeout to the leftover remaining time (e.g., Math.min(400, remaining -
loadWaitSpent) or skip if none left); ensure you handle zero/negative remaining
by skipping further waits and still return non-blank. Use the existing symbols
waitForWidgetPaint, this.budgets.paintTimeoutMs, this.budgets.settleTimeoutMs,
page.waitForLoadState and page.waitForTimeout to locate the changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts`:
- Around line 757-781: The post-paint settling (in waitForWidgetPaint) can
exceed the paint budget because after detecting a non-blank frame the method
unconditionally waits up to settleTimeoutMs for networkidle plus 400ms; cap
these waits against the original paint budget so total elapsed time never
exceeds paintTimeoutMs. Modify waitForWidgetPaint to compute remaining =
deadline - Date.now() when blank becomes false, and pass a bounded timeout to
page.waitForLoadState (use Math.min(this.budgets.settleTimeoutMs, remaining))
and bound the subsequent waitForTimeout to the leftover remaining time (e.g.,
Math.min(400, remaining - loadWaitSpent) or skip if none left); ensure you
handle zero/negative remaining by skipping further waits and still return
non-blank. Use the existing symbols waitForWidgetPaint,
this.budgets.paintTimeoutMs, this.budgets.settleTimeoutMs, page.waitForLoadState
and page.waitForTimeout to locate the changes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 71bacda7-b14f-432b-a287-cce342995cc4

📥 Commits

Reviewing files that changed from the base of the PR and between 9f4aadd and 56b2823.

📒 Files selected for processing (2)

mcpjam-inspector/server/utils/__tests__/mcp-app-browser-harness.test.ts
mcpjam-inspector/server/utils/mcp-app-browser-harness.ts

github-actions · 2026-06-11T01:39:41Z

Internal preview

Preview URL will appear in Railway after the deploy finishes.
Deployed commit: fb79113
PR head commit: b0362cc
Backend target: staging fallback.
Access is employee-only in non-production environments.

…line diff) The prior commit imported `sharp` to pixel-check blankness, which broke the production server. The ESM server bundle inlines sharp's CJS, so at startup sharp/lib/constructor.js hits 'Dynamic require of util is not supported' and the server won't boot (CI: Build and Test + E2E Smoke both failed on this). sharp was also only a devDependency, so it wouldn't be installed for npx users anyway. Drop sharp and detect paint decode-free: capture an empty-host baseline once, then poll whole-frame PNG screenshots until the frame both diverges from the baseline (painted) and stops changing between samples (settled), bounded by paintTimeoutMs. Chromium encodes identical pixels to identical bytes (verified), so frame equality is a plain Buffer compare — no native-image dep in the bundle. The check is content-agnostic, so it also generalizes past canvas apps: DOM widgets, canvas/WebGL, progressive hydration, and slow CDN loads are handled the same way with no per-widget delay tuning. Genuinely-blank widgets stay identical to the baseline (blank_screenshot); perpetually-animating ones hit the budget and we snapshot the latest painted frame (rendered). https://claude.ai/code/session_01SdpskaihKUgYPjMagqmWSb

coderabbitai

🧹 Nitpick comments (1)

mcpjam-inspector/server/utils/mcp-app-browser-harness.ts (1)
774-774: ⚡ Quick win

Inconsistent fail-safe: return false rather than true.

Line 800 returns false (rendered) when frame or baseline is unavailable, following the stated philosophy (lines 798–799) to "fail toward 'rendered' rather than mislabel a real widget blank." Line 774's defensive !page check should do the same.
Proposed fix
-  if (!page) return true;
+  if (!page) return false;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts` at line 774, The
early guard currently returns true when page is missing ("if (!page) return
true;") which is inconsistent with the module's fail-safe policy; change that to
return false so it fails toward "rendered" like the other guard (see the similar
check around lines 798–799). Update the statement "if (!page) return true;" to
"if (!page) return false;" so the function (the block using the page variable)
treats a missing page the same as a missing frame/baseline and marks as
rendered.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@mcpjam-inspector/server/utils/mcp-app-browser-harness.ts`:
- Line 774: The early guard currently returns true when page is missing ("if
(!page) return true;") which is inconsistent with the module's fail-safe policy;
change that to return false so it fails toward "rendered" like the other guard
(see the similar check around lines 798–799). Update the statement "if (!page)
return true;" to "if (!page) return false;" so the function (the block using the
page variable) treats a missing page the same as a missing frame/baseline and
marks as rendered.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 033fd18d-b3c2-4549-beb1-2b76af8b1fac

📥 Commits

Reviewing files that changed from the base of the PR and between 56b2823 and 90fe421.

📒 Files selected for processing (1)

mcpjam-inspector/server/utils/mcp-app-browser-harness.ts

…ored frame Addresses two Cursor Bugbot findings on the paint check: 1. Paint deadline before polling. The paint deadline was fixed before the upfront networkidle wait, so when settleTimeoutMs >= paintTimeoutMs that wait could consume the whole budget, the poll loop never ran, and a blank widget fell through to 'rendered'. Cap the networkidle wait to the paint budget and make the poll a do/while so it always samples at least one real frame. 2. Status vs. stored screenshot could disagree. The blank verdict was decided on waitForWidgetPaint's internal frame while the stored screenshot came from a separate captureScreenshot. Return the classified frame and store that exact image (encoded to fit the byte budget), so status and screenshot always match. Factor the byte-budget encoder into encodeScreenshot, shared by both paths. Adds a regression test: a blank widget with paintTimeoutMs <= settleTimeoutMs still classifies blank_screenshot. https://claude.ai/code/session_01SdpskaihKUgYPjMagqmWSb

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b0362cc. Configure here.}

cursor · 2026-06-11T02:57:06Z

+    // Blank iff the settled frame is still indistinguishable from the empty
+    // baseline. With no baseline (capture failed), fail toward "rendered" rather
+    // than mislabel a real widget blank.
+    const blank = !frame || !baseline ? false : framesEqual(frame, baseline);


Paint poll aborts screenshot retries

Medium Severity

In waitForWidgetPaint, when a poll screenshot fails and prev is still null, the loop breaks immediately instead of retrying until the paint deadline. That forces blank to false and can mark a genuinely blank widget as rendered, which the surrounding comments say the do/while was meant to prevent.

^{Reviewed by Cursor Bugbot for commit b0362cc. Configure here.}

chelojimenez temporarily deployed to preview-pr-2563 June 11, 2026 01:29 — with GitHub Actions Inactive

dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working labels Jun 11, 2026

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread mcpjam-inspector/server/utils/mcp-app-browser-harness.ts

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

dosubot Bot removed the size:M This PR changes 30-99 lines, ignoring generated files. label Jun 11, 2026

chelojimenez deployed to preview-pr-2563 June 11, 2026 02:21 — with GitHub Actions View deployment

dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 11, 2026

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread mcpjam-inspector/server/utils/mcp-app-browser-harness.ts Outdated

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

chelojimenez had a problem deploying to preview-pr-2563 June 11, 2026 02:53 — with GitHub Actions Error

cursor Bot reviewed Jun 11, 2026

View reviewed changes

chelojimenez merged commit a4889b1 into main Jun 11, 2026
11 of 12 checks passed

chelojimenez deleted the claude/widget-paint-fix branch June 11, 2026 02:57

chelojimenez mentioned this pull request Jun 11, 2026

fix(evals): keep polling after a transient poll-screenshot failure #2579

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(evals): wait for a real painted frame before classifying a widget render#2563

fix(evals): wait for a real painted frame before classifying a widget render#2563
chelojimenez merged 3 commits into
mainfrom
claude/widget-paint-fix

chelojimenez commented Jun 11, 2026 •
edited by cursor Bot

Loading

Uh oh!

chatgpt-codex-connector Bot commented Jun 11, 2026

Uh oh!

chelojimenez commented Jun 11, 2026 •
edited

Loading

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •
edited

Loading

Review failed

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented Jun 11, 2026 •
edited

Loading

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chelojimenez commented Jun 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Verification

Uh oh!

chatgpt-codex-connector Bot commented Jun 11, 2026

Uh oh!

chelojimenez commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Hidden review stack artifact

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Internal preview

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Paint poll aborts screenshot retries

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chelojimenez commented Jun 11, 2026 •
edited by cursor Bot

Loading

chelojimenez commented Jun 11, 2026 •
edited

Loading

coderabbitai Bot commented Jun 11, 2026 •
edited

Loading

github-actions Bot commented Jun 11, 2026 •
edited

Loading