Skip to content

feat(v1,sdk): rerun eval suites against their saved server selection#2609

Merged
chelojimenez merged 9 commits into
mainfrom
claude/sharp-gauss-aotkgt
Jun 11, 2026
Merged

feat(v1,sdk): rerun eval suites against their saved server selection#2609
chelojimenez merged 9 commits into
mainfrom
claude/sharp-gauss-aotkgt

Conversation

@chelojimenez

@chelojimenez chelojimenez commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #2603, closing out the Codex P2 finding there (confirmed: "implement the API fix"). Pairs with MCPJam/mcpjam-backend#514, which must deploy first.

The v1 rerun path connected its MCP manager from the caller's serverIds, while startTestSuiteRun snapshots the run's server set from the suite itself (attachments / suite host config / environment bindings). Any caller-side guess that missed the suite's frozen selection — including run_eval_suite's previous all-enabled-HTTP default — produced a 202 followed by execution-time failures on never-connected servers.

POST /projects/{p}/eval-runs

  • serverIds is now optional on reruns: when omitted with a suiteId, the route derives the suite's saved selection via the new testSuites:getSuiteRunServerSelection query (backend PR) and connects exactly the set the run snapshot references. Explicit serverIds keeps overriding.
  • New-suite creation (suiteName + tests) still requires serverIds — there is no saved selection to derive.
  • A suite with no saved selection rejects with VALIDATION_ERROR + details.reason: "NO_SAVED_SERVER_SELECTION"; a suite the caller can't see maps to 404; a backend without the query yet degrades to an explicit-serverIds instruction instead of a 500 (deploy-order tolerance).
  • The 202 response now lists the resolved servers ({id, name?}) — additive — so callers that omitted serverIds can see what the run targets.
  • OpenAPI (EvalRunCreateRequest anyOf/required, EvalRunCreated.servers) and public-api.mdx updated.

@mcpjam/sdk/platformrun_eval_suite

Drops the client-side all-enabled-HTTP default: when servers is omitted, the request omits serverIds and the platform resolves the suite's saved selection; the result's servers comes from the API response. Explicit servers overrides keep their client-side resolution and validation (id/unique-name, stdio/URL-less rejection, disabled-selectable semantics). PlatformEvalRunCreated gains the optional servers field. Changeset: @mcpjam/sdk minor.

Testing

  • routes/v1/__tests__/write-routes.test.ts: derivation happy path (query called, manager gets derived ids+names, 202 lists servers), NO_SAVED_SERVER_SELECTION, not-visible → 404, older-backend degradation, new-suite-requires-serverIds. 73 v1 route tests green.
  • sdk/tests/platform/: omitted-servers sends no serverIds and performs no project-server listing; explicit-override tests unchanged. Full SDK suite 87 files / 1446 passed; worker tests + typecheck and both runtime-safety guards green.
  • Server tsc error count identical to main (285 pre-existing, none in routes/v1).

Deploy order

  1. MCPJam/mcpjam-backend#514 (the Convex query)
  2. This PR (route + SDK + docs)

Until 1 deploys, omitted-serverIds requests answer with the actionable VALIDATION_ERROR rather than failing opaquely.

https://claude.ai/code/session_01UQSNMVcYZrJd9vabfXk1nY


Generated by Claude Code


Note

Medium Risk
Changes eval-run connection semantics and API contract (optional serverIds on reruns); depends on backend deploy order but includes graceful degradation for skew.

Overview
Eval suite reruns no longer require callers to guess which MCP servers to connect. POST /eval-runs treats serverIds as optional when suiteId is present: the route loads the suite’s saved selection via testSuites:getSuiteRunServerSelection, wires the manager to that set, and returns the resolved servers on the 202. New suites (suiteName + tests) still require serverIds. Missing selection yields NO_SAVED_SERVER_SELECTION; older backends without the Convex query get a VALIDATION_ERROR telling callers to pass serverIds.

run_eval_suite in @mcpjam/sdk drops the client-side “all enabled HTTP servers” default: omitted servers means the request omits serverIds, and the result’s servers come from the API. Explicit servers overrides keep client resolution/validation. OpenAPI, public API docs, MCP README, and tests are updated accordingly.

Reviewed by Cursor Bugbot for commit b54d367. Bugbot is set up for automated code reviews on this repo. Configure here.

claude added 7 commits June 11, 2026 08:08
Extends the @mcpjam/sdk/platform catalog (PR #2589) over the public
API's eval and chatbox surface, and registers every operation as a
tool on the remote MCP worker.

PlatformApiClient gains listChatboxes/getChatbox, createEvalRun,
getEvalRun, listEvalRunIterations, getEvalIterationTrace, and
listEvalSuiteRuns, with matching wire DTOs.

Nine new curated operations: list_eval_suites, list_eval_suite_runs,
run_eval_suite (async rerun: suite and servers resolve by name or ID,
servers default to the project's enabled HTTP servers), get_eval_run,
list_eval_run_iterations, get_eval_iteration_trace, list_chatboxes,
get_chatbox, and list_chat_sessions. Operations now carry a readOnly
flag; the worker maps it to MCP tool annotations (run_eval_suite is
the only non-read tool: readOnlyHint false, destructiveHint false).

Worker tool registration is now catalog-driven (platformTools.ts);
showServers.ts keeps only the MCP Apps widget variant. README Status,
Auth, and Architecture sections refreshed (they still described the
pre-#2589 Convex path and a single tool).

https://claude.ai/code/session_01UQSNMVcYZrJd9vabfXk1nY
…ed-runnable

Explicitly selected stdio or URL-less servers now fail at resolution
with an error naming the selector, instead of shipping a serverId the
hosted runner can never connect to. The default branch makes the
stdio exclusion explicit alongside the URL check. Disabled servers
remain explicitly selectable by design: naming one is a deliberate
choice and the default selection already excludes them.

Also clarifies the readOnly-flag phrasing in the changeset.

https://claude.ai/code/session_01UQSNMVcYZrJd9vabfXk1nY
get_eval_run, list_eval_run_iterations, and get_eval_iteration_trace
no longer default the project to the most recently updated one. A run
is an existing resource in one specific project; guessing the project
made a run created anywhere else read as NOT_FOUND when polled with
only its runId. run_eval_suite and list_eval_suite_runs return the
resolved project, so callers address the polls exactly. Listing
operations keep their default-project ergonomics, where a wrong
default produces an actionable, enumerated error instead of a
phantom-missing resource.

https://claude.ai/code/session_01UQSNMVcYZrJd9vabfXk1nY
A whitespace-only selector passed min(1), then resolveProject trimmed
it to empty and fell back to the most recently updated project. For
list_chat_sessions that silently narrowed the unfiltered listing to
one project; for the eval polling reads it would have reintroduced
the default-project guess their required project exists to prevent.
Selector inputs now trim before validating, so blank values fail
loudly; list_chat_sessions also trims in execute() so raw SDK callers
bypassing the schema get the unfiltered listing rather than a silent
filter. Pagination cursors stay untrimmed: they're opaque.

https://claude.ai/code/session_01UQSNMVcYZrJd9vabfXk1nY
Explicitly naming a disabled server runs it; the default selection
excludes disabled servers. This mirrors the platform's own contract:
eval-run authorization (convex webAuthorize, behind
/web/authorize-batch) is project-membership-based and never consults
the enabled toggle, which only shapes the default connection set.
Stated in the tool's input schema, where agents read it before
spending credits, plus the resolver comment and worker README.

https://claude.ai/code/session_01UQSNMVcYZrJd9vabfXk1nY
POST /eval-runs no longer requires serverIds on a rerun: when omitted
with a suiteId, the route derives the suite's saved server selection
via the new backend query testSuites:getSuiteRunServerSelection and
connects exactly the set the run snapshot references. Previously the
manager connected whatever the caller guessed, while startTestSuiteRun
froze the suite's own server set — a guess that missed it produced a
202 followed by execution-time failures. New-suite creation still
requires serverIds (there is no saved selection to derive). A suite
with no saved selection rejects with VALIDATION_ERROR
(NO_SAVED_SERVER_SELECTION), and older backends without the query
degrade to an explicit-serverIds instruction instead of a 500. The
202 now lists the resolved servers.

The SDK's run_eval_suite drops its client-side all-enabled-HTTP
default in favor of the platform-derived selection; explicit servers
overrides keep their resolution and validation. OpenAPI and the
public-api docs updated to match.

Deploy order: mcpjam-backend (getSuiteRunServerSelection) first.

https://claude.ai/code/session_01UQSNMVcYZrJd9vabfXk1nY
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 11, 2026
@chelojimenez

chelojimenez commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@dosubot dosubot Bot added the enhancement New feature or request label Jun 11, 2026
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Internal preview

Preview URL: https://mcp-inspector-pr-2609.up.railway.app
Deployed commit: 5c5bf37
PR head commit: b54d367
Backend target: staging fallback.
Health: ✅ Convex reachable
Access is employee-only in non-production environments.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

MCP worker preview

Preview worker mcpjam-mcp-pr-2609 deleted — the preview URL no longer resolves.
Merged changes are live on mcpjam-mcp-staging via deploy-mcp-staging.yml.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9043f314-f73a-46b0-934f-a677ac1e34cc

📥 Commits

Reviewing files that changed from the base of the PR and between b08e053 and b54d367.

📒 Files selected for processing (2)
  • mcpjam-inspector/server/routes/v1/__tests__/write-routes.test.ts
  • mcpjam-inspector/server/routes/v1/evals.ts

Walkthrough

This PR shifts eval run server selection from the SDK's client-side defaulting to the Platform API's saved suite selection. When a rerun request omits servers, the platform resolves the suite's saved configuration and returns the server set. The SDK now sends serverIds only for explicit overrides; a new backend Convex helper validates the saved selection and propagates it through the response.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@mcpjam-inspector/server/routes/v1/evals.ts`:
- Around line 228-268: The query result `selection` from
convex.query("testSuites:getSuiteRunServerSelection") can be null and should
yield a 404, not a 400; after the try/catch and before you derive
serverIds/serverNames, add a null-check: if selection === null then throw new
WebRouteError(404, ErrorCode.NOT_FOUND, "Eval suite not found") (matching other
read-not-found behavior in this file). This ensures the nullable `selection` is
treated as not found rather than falling through to the
NO_SAVED_SERVER_SELECTION validation error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 40530527-157d-4850-bb3f-78ceb6578930

📥 Commits

Reviewing files that changed from the base of the PR and between a8ae274 and b08e053.

📒 Files selected for processing (9)
  • .changeset/run-eval-suite-saved-selection.md
  • docs/reference/openapi.json
  • docs/reference/public-api.mdx
  • mcp/README.md
  • mcpjam-inspector/server/routes/v1/__tests__/write-routes.test.ts
  • mcpjam-inspector/server/routes/v1/evals.ts
  • sdk/src/platform/operations.ts
  • sdk/src/platform/types.ts
  • sdk/tests/platform/operations.test.ts

Comment thread mcpjam-inspector/server/routes/v1/evals.ts
A null result from getSuiteRunServerSelection means the suite itself
was not found; surface it like every other Convex read-not-found in
this file instead of misreporting it as NO_SAVED_SERVER_SELECTION.

https://claude.ai/code/session_01UQSNMVcYZrJd9vabfXk1nY
@mintlify

mintlify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
mcpjam 🟢 Ready View Preview Jun 11, 2026, 6:53 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@chelojimenez chelojimenez merged commit 9180118 into main Jun 11, 2026
16 checks passed
@chelojimenez chelojimenez deleted the claude/sharp-gauss-aotkgt branch June 11, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants