Aquileo | feat(relay): WS-only inbound on the gateway adapter (Phase 3) by benbarclay · Pull Request #48294 · NousResearch/hermes-agent

benbarclay · 2026-06-18T07:23:10Z

Summary

The gateway half of relay Phase 3 option B2: inbound (messages + /stop interrupts) now arrives over the gateway's outbound /relay WebSocket, not a signed HTTP POST to an inbound endpoint. The gateway needs no inbound HTTP port — which is what makes hosted gateways (no public IP) able to receive inbound at all.

Pairs with gateway-gateway (relay bus + WsGatewayDelivery) and the NAS GATEWAY_RELAY_URL stamp. The cross-repo E2E (connector repo) proves the full multi-instance path against this production adapter code.

What changed

gateway/relay/adapter.py — connect() wires set_interrupt_inbound_handler(self.on_interrupt) so connector→gateway interrupt_inbound frames bridge into the existing per-session interrupt path (the inbound message handler was already wired). Removed _maybe_start_inbound_receiver() + the _inbound_runner lifecycle — there is no HTTP receiver anymore.
gateway/relay/inbound_receiver.py — deleted (the signed-HTTP InboundDeliveryReceiver).
gateway/relay/__init__.py — removed relay_inbound_config() (dead with the receiver gone). The delivery key is still set in-process by self-provision for forward-compat but is no longer consumed for inbound.
docs/relay-connector-contract.md — §3 rewritten (inbound is the WS back-channel, routed cross-instance via the connector's relay bus); §5 interrupt + §6 auth table updated; the old signed-HTTP-POST + per-tenant-delivery-key-signing path documented as superseded. gatewayEndpoint noted as passthrough-plane only.

The WS transport already parsed inbound / interrupt_inbound frames and dispatched the message handler — this PR is largely removing the parallel HTTP path and wiring the one remaining handler, hence the net deletion.

Test plan

stub_connector grows set_interrupt_inbound_handler + push_interrupt; a new test_relay_interrupt case proves connect() wires both inbound handlers and that an interrupt_inbound frame over the WS cancels exactly the right session (siblings untouched).
Removed the HTTP-receiver test; updated the crypto-shedding scan and the self-provision delivery-key assertion.
88 relay tests pass. Contract-doc conformance test green.

Diffstat

9 files, −478 / +119 (the deletion is inbound_receiver.py + its test). Additive-safe within the EXPERIMENTAL relay surface.

EXPERIMENTAL: the relay contract may change until ≥2 Class-1 platforms validate it.

The connector now delivers inbound (messages + interrupts) over the gateway's OUTBOUND /relay WebSocket, not a signed HTTP POST to an inbound endpoint. The gateway needs no inbound HTTP port — which is what makes hosted gateways (no public IP) able to receive inbound at all. - gateway/relay/adapter.py: connect() wires set_interrupt_inbound_handler( self.on_interrupt) so connector->gateway interrupt_inbound frames bridge into the existing per-session interrupt path (the inbound message handler was already wired). Removed _maybe_start_inbound_receiver() + the _inbound_runner lifecycle — there is no HTTP receiver anymore. - gateway/relay/inbound_receiver.py: deleted (the signed-HTTP InboundDelivery receiver). - gateway/relay/__init__.py: removed relay_inbound_config() (dead with the receiver gone). The delivery key is still set in-process by self-provision for forward-compat but is no longer consumed for inbound. - docs/relay-connector-contract.md: §3 rewritten — inbound is the WS back-channel routed cross-instance via the connector's relay bus; §5 interrupt + §6 auth table updated; the old signed-HTTP-POST + per-tenant-delivery-key-signing path is documented as superseded. gatewayEndpoint noted as passthrough-plane only. Tests: stub_connector grows set_interrupt_inbound_handler + push_interrupt; new test_relay_interrupt case proves connect() wires BOTH inbound handlers and an interrupt_inbound frame over the WS cancels the right session. Removed the HTTP-receiver test; updated the crypto-shedding scan + self-provision delivery-key assertion. 88 relay tests pass. EXPERIMENTAL. Pairs with gateway-gateway (relay bus + WsGatewayDelivery) and the NAS GATEWAY_RELAY_URL stamp. The cross-repo E2E (connector repo) proves the full multi-instance path against this production adapter code.

github-actions · 2026-06-18T07:24:50Z

🔎 Lint report: `feat/relay-inbound-ws` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 11005 on HEAD, 11011 on base (✅ -6)

🆕 New issues: none

✅ Fixed issues (6):

Rule	Count
`unresolved-attribute`	3
`unresolved-import`	3

First entries

gateway/relay/inbound_receiver.py:188: [unresolved-attribute] unresolved-attribute: Attribute `Application` is not defined on `None` in union `Unknown | None`
gateway/relay/inbound_receiver.py:189: [unresolved-attribute] unresolved-attribute: Attribute `Response` is not defined on `None` in union `Unknown | None`
gateway/relay/inbound_receiver.py:50: [unresolved-import] unresolved-import: Cannot resolve imported module `aiohttp`
gateway/relay/inbound_receiver.py:204: [unresolved-attribute] unresolved-attribute: Attribute `json_response` is not defined on `None` in union `Unknown | None`
gateway/relay/adapter.py:110: [unresolved-import] unresolved-import: Cannot resolve imported module `aiohttp`
tests/gateway/relay/test_inbound_receiver.py:15: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`

Unchanged: 5764 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

…the engine (#6) * fix(dashboard): clean up upload temp file on client disconnect + pin python-multipart (NS-501) Follow-up to #47663 (streaming multipart upload), fixing two issues that landed with it. 1. Temp file leaked on client disconnect. The streaming upload endpoint's except chain caught only HTTPException / PermissionError / OSError — all Exception subclasses. asyncio.CancelledError, raised when a browser aborts a large upload mid-stream (the exact NS-501 scenario), is a BaseException, so it bypassed every except clause and reached a finally that only closed the file handle and never unlinked the temp file. Every aborted large upload orphaned a partial `.{name}.*.upload` file (up to ~100 MB) in the target directory. Cleanup now lives in finally, keyed on a `renamed` success flag, so the temp file is removed on every non-success exit including BaseException paths. Added test_stream_upload_cleans_temp_on_cancellation, which fails on the pre-fix code (leaks the temp file) and passes with the fix. 2. python-multipart pinned to ==0.0.27 instead of ==0.0.20. The package was already resolved at 0.0.27 transitively (via daytona) before #47663; the explicit ==0.0.20 pin in the [web] extra and the tool.dashboard lazy-install set downgraded it. Bumped both to ==0.0.27 and regenerated with `uv lock`, keeping the lockfile coherent. The base dependency stays >=0.0.9,<1. * fix(openviking): adapt memory provider for current api (cherry picked from commit cbb87389f33583518975fbf72671de3fd224bb28) * fix(install): resolve PowerShell host instead of bare `powershell` for uv The Windows installer's Install-Uv spawned the astral uv installer with a hardcoded bare `powershell -ExecutionPolicy ByPass -c "irm .../uv | iex"`. That name resolves only to Windows PowerShell, and only when its System32 directory is on PATH. Run under PowerShell 7+ (`pwsh`) — or any session where `powershell` isn't on PATH — the spawn dies with "The term 'powershell' is not recognized", and uv installation aborts (the installer then appears stuck). Add Get-PowerShellHostExe, which prefers the absolute path of the host we're already running in (PATH-independent), then falls back to powershell/pwsh via Get-Command, then to the bare name. Install-Uv now invokes that resolved exe. * test(install): lock uv installer to a resolved PowerShell host Source-level guard (install.ps1 only runs on Windows, so there's no Linux CI runner to execute it): the astral uv install line must be invoked via the call operator on a resolved host variable, the bare-`powershell` literal that produced the field-reported "The term 'powershell' is not recognized" must be gone, and the resolver must be PATH-independent (Get-Process -Id $PID) and pwsh-aware. * fix(install): relax native stderr handling in install.ps1 (#48352) * fix(#37878): scrub operator environment before launching cua-driver MCP - Use _sanitize_subprocess_env() to filter Hermes-managed credentials from the cua-driver subprocess environment (issue #37878) - Prevents credential exfiltration to the third-party cua-driver binary - Aligns with existing pattern used by browser-tool and other tools - Add regression test to verify environment sanitization The cua-driver is a lower-trust MCP subprocess per SECURITY.md §2.3. Its inherited environment is now scrubbed by default, removing provider API keys, gateway tokens, and platform credentials that should not leak to third-party binaries. Fixes #37878 * fix(#37878): Address review feedback — fix trailing whitespace and add ANTHROPIC_API_KEY test Review feedback from egilewski: 1. Remove trailing whitespace from test docstring and mock patches (lines 1430, 1469, 1476, 1482) 2. Expand test coverage: also verify ANTHROPIC_API_KEY is stripped (not just OPENAI_API_KEY) Changes: - Remove trailing whitespace from test file - Add ANTHROPIC_API_KEY to test environment - Add assertion verifying ANTHROPIC_API_KEY is stripped from cua-driver subprocess env - Syntax verified: python3 -m py_compile tests/tools/test_computer_use.py ✓ * chore(release): map iamlukethedev to AUTHOR_MAP * fix(skills): rmtree scope guard + default pre_update_backup to true (#48200) Defense-in-depth fix for the silent wipe of ~/.hermes/ documented in #48200. A `hermes update --yes` run silently destroyed a user's .env, MEMORY.md, kanban.db, custom skills, and scripts. Two changes: 1. `_rmtree_writable` in tools/skills_sync.py now refuses to rmtree anything outside SKILLS_DIR (the HERMES_HOME/skills/ root). All five call sites pass paths under SKILLS_DIR, so the guard is a no-op for current code and a loud, recoverable failure for any future regression (bad path join, malicious bundled manifest, stale path in scope after an exception). 2. The default `updates.pre_update_backup` flips from false to true in hermes_cli/config.py. A few minutes of zip per update is negligible compared to silent total data loss. Still overridable; --no-backup still works for one-off opt-out. Five new tests in TestRmtreeWritableScopeGuard (root path, hermes home, sibling dir, skills root itself, subdir) plus a flipped `test_default_enabled_creates_backup` in test_backup.py. 178/178 tests pass in the two affected files. Public method signatures unchanged, no test-stub blast radius. Closes #48200 * fix(skills): refuse SKILLS_DIR root in rmtree guard, not just outside-tree The salvaged guard allowed _rmtree_writable(SKILLS_DIR) itself. No call site ever passes the root — every site passes a skill subdir or its .bak sibling — so allowing the root only preserves the #48200 footgun (a dest that collapses to the root wipes every installed skill). Require a strict strict-child relationship and update the test that documented the nonexistent 'full reset' capability. * fix(tui): don't make Enter swallow trailing-space-only slash completions (#48425) * fix(tui): don't make Enter swallow trailing-space-only slash completions Submitting a slash command in the TUI took three Enter presses: one to complete the name (/ex → /exit), a second that only appended the trailing space the gateway adds to keep the classic-CLI prompt_toolkit dropdown open (/exit → "/exit "), and a third to actually submit. The composer's submit handler accepted the highlighted completion whenever applying it changed the input at all, so the whitespace-only delta ate an extra keypress. Treat a completion whose only change is trailing whitespace on an already-complete token as "already complete" and fall through to submit. Partial-name and argument completions (a real token change) still accept on Enter as before. The replace/accept logic is extracted into pure helpers (applyCompletion, completionToApplyOnSubmit) in domain/slash.ts. * test(tui): cover Enter/completion trailing-space behavior and isolate poller queue - completionApply.test.ts asserts completionToApplyOnSubmit accepts real token completions (partial command name, argument) but returns null for a trailing-space-only delta on an already-complete command, so Enter submits instead of needing extra presses. - test_notification_poller_delivers_completion / _skips_consumed previously shared the process-global process_registry.completion_queue. Their events carry no session_key, so a leaked/concurrent poller could dequeue and dispatch them to a fixture agent without run_conversation, flaking CI ("AttributeError: '_FakeAgent' object has no attribute 'run_conversation'"). Isolate the queue per test (fresh queue.Queue via monkeypatch), matching the sibling poller tests that already do this. * feat(mcp-catalog): add official Unreal Engine 5.8 MCP server Epic's experimental Unreal MCP plugin embeds an MCP server inside the Unreal Editor process, served over local HTTP (127.0.0.1:8000/mcp by default). HTTP transport, no auth, no install block — the user enables the plugin in-editor and Hermes connects to the URL. Also drops test_optional_mcps_manifests_ship_in_both_wheel_and_sdist: it asserted wheel/sdist packaging targets for pip/Homebrew/Nix installs, which Hermes does not support — installs run from the repo checkout, where the catalog is discovered by directory iteration with no packaging step. * feat(dashboard): surface full per-MCP catalog detail; fix pip-install doc (#48520) The dashboard MCP catalog only showed name/description/transport and a non-clickable source. Users couldn't see what an entry connects to or runs before installing — the exact detail the docs trust model tells them to vet. - /api/mcp/catalog now returns transport target (url, or command+args), auth_type, git install source/ref + bootstrap commands, default-enabled tool hint, and post-install guidance per entry. - McpPage renders the endpoint URL (http) or command+args (stdio), the git install source/ref, a collapsible bootstrap-commands list, setup notes, and the source as a clickable link when it's a URL. - Docs: drop the 'uv pip install -e .[mcp]' quick-start step (Hermes does not support pip installs; MCP ships with the standard install) and note the dashboard now surfaces this detail. - Strengthen the catalog endpoint test to assert the new inspection fields. * fix(install): fail fast when uv venv genuinely fails under relaxed EAP PR #48372 relaxes EAP=Stop around the uv venv call so PowerShell 5.1 doesn't mistake uv's 'Using CPython ...' stderr for a terminating NativeCommandError. But relaxing EAP also means a *genuine* uv venv failure (exit != 0) no longer aborts on its own — Install-Venv would continue and print 'Virtual environment ready', and in stage mode Invoke-Stage would report ok=true, even though no venv was created. Capture $LASTEXITCODE immediately after the relaxed call and throw on non-zero (Pop-Location first, matching the function's other exit paths), so the venv stage fails fast instead of falsely succeeding. This is the explicit guard originally proposed in #48463 (devorun), composed on top of #48372's reusable helper + regression test. Adds a regression test asserting the uv venv exit-code capture + throw. * feat(memory): batch operations for single-turn memory updates (#48507) The memory tool was strictly one-op-per-call. With the store running near its char limit by design, a new add that would overflow gets rejected with 'consolidate now, then retry' -- but the model could not consolidate and add in one call. It had to remove/replace across several turns, then retry the add, each turn re-sending the whole conversation context. Expensive thrash. Add an 'operations' array: a list of add/replace/remove ops applied atomically against the FINAL char budget. The model frees space and adds new entries in ONE call, even when an add alone would overflow. All-or-nothing: any bad op aborts the whole batch, nothing written. Root-cause note: the two agent-level memory interception sites (agent_runtime_helpers.py, tool_executor.py) silently dropped any param not in their explicit kwarg list, so 'operations' never reached the handler and batch calls failed with 'Unknown action None'. Both now pass it through and bridge each add/replace op to external memory providers. Also: success response is now terminal (done=true + 'do not repeat' note, no full-entries echo that invited re-edits); schema rewritten to lead with the batch mechanism and an explicit one-shot stop rule (2138 -> 1476 chars). Live-verified: near-full consolidate-and-add went 7 calls -> 1 call, stable across 3 reps. 103 memory/approval tests + 398 background-review/ run_agent tests green; 6 new batch tests added. * fix(desktop): never persist or restore a named custom provider as bare "custom" (#48547) * Port from cline/cline#11514: encourage parallel tool calls Add a universal system-prompt guidance block telling the model to batch independent tool calls (reads, searches, web fetches, read-only commands) into a single assistant turn instead of one call per turn. The runtime already executes independent batches concurrently (read-only tools always; non-overlapping path-scoped file ops); the open-source system prompt had nothing steering the model to PRODUCE the batch. Fewer round-trips means less resent context, which compounds over a long conversation. - prompt_builder.py: new PARALLEL_TOOL_CALL_GUIDANCE block (short, static, cache-amortised) modeled on TASK_COMPLETION_GUIDANCE. - system_prompt.py: inject right after the task-completion block, gated by agent.valid_tool_names + the new toggle. - agent_init.py: read agent.parallel_tool_call_guidance (default True). - config.py: add the default under the agent section. - test_prompt_builder.py: behavior-contract tests (batching steer, dependent carve-out, length bound) — invariants, not wording snapshots. Adapted from Cline's TypeScript tool-surface guidance to hermes-agent's Python prompt-assembly architecture and config-over-env conventions. * fix(desktop): never persist or restore a named custom provider as bare "custom" Custom providers vanish from the Desktop/TUI model picker with "No LLM provider configured" — repeatedly fixed (#44062, #44109, #45578) and repeatedly regressed (#44022, #47714) because every fix only recovered the entry identity from a persisted base_url. When a session is persisted/restored with the resolved provider "custom" and NO base_url, bare "custom" leaked through verbatim; resolve_runtime_provider("custom") routes to the OpenRouter default URL with no api_key, so the next turn/resume dies. Bare "custom" is the resolved billing class shared by every named providers:/ custom_providers: entry — it is not a routable identity. Centralize the "never let bare custom escape" invariant in one helper, runtime_provider.canonical_custom_identity(), and apply it at all four leak sites in tui_gateway/server.py: - _ensure_session_db_row — the ORIGIN: first DB write seeds the bad row - _runtime_model_config — live persist - _stored_session_runtime_overrides — resume restore (heals old rows; drops unrecoverable bare custom so resume falls back to config default) - _make_agent — rebuild / per-turn The helper recovers custom:<name> from the endpoint URL when present, else from config.model.provider (the durable identity left when no base_url survived). Regression tests in test_custom_provider_session_persistence.py lock the no-base_url vector at every site so it cannot regress again. * fix(prompt): dedupe parallel-tool-call steer; correct its rationale The universal PARALLEL_TOOL_CALL_GUIDANCE block already lives on main, but it shipped with two rough edges this change cleans up: - It duplicated the batching steer for Google models. The GOOGLE_MODEL_OPERATIONAL_GUIDANCE block still carried its own "Parallel tool calls" bullet, so Gemini/Gemma received the instruction twice in one prompt. Drop the redundant bullet — the universal block is now the single source. - Its comment claimed "nothing in the open-source system prompt encouraged batching," which was wrong: the steer existed for Google models only. Reword to say the gap was that every *other* model got nothing. - Tighten the test that asserts the steer (precedence-correct), and add an invariant guarding against re-introducing the Google duplicate. * fix(desktop,tui): surface self-improvement review summary + honor memory_notifications The "💾 Self-improvement review" summary (skill/memory updated) was invisible on two surfaces: - Desktop Electron app had no review.summary event handler — skill/memory writes happened silently. Now appends a persistent system message to the transcript (matching the Ink TUI's persistent-line semantics, not a transient toast that can be missed). - tui_gateway (backs both 'hermes --tui' and the desktop) never read display.memory_notifications, so it always behaved as 'on' and ignored a user who set 'off'/'verbose'. Added _load_memory_notifications() (mirrors the messaging gateway's bool->str normalization, defaults to 'on') and wired it to agent.memory_notifications, matching gateway/run.py and the CLI. Delivery chain now reaches all surfaces: background_review.py -> background_review_callback -> review.summary event -> desktop transcript / Ink TUI line / gateway message / CLI print. * fix(dashboard): use DS Button prefix/size API instead of inline icons @nous-research/ui@0.18.2 Button is grid-based: size=xs is an aspect-square icon-only box, and icons belong in prefix/suffix. The dashboard used shadcn-style size=xs + inline <Icon/> text children, which forced text buttons into broken tall squares (Configure, Run setup, Select, Save keys) and split icon/label across grid columns elsewhere (Schedule it, Prune/Delete actions). Move leading icons to prefix and size text buttons as sm/default. For the post-setup spinner, drive the spin from a button-level [&_svg]:animate-spin selector since the prefix slot clones the icon and overwrites its className. - ToolsetConfigDrawer: Select, Save keys, Run setup - SkillsPage: New skill, Configure - AutomationBlueprints: Schedule it - SessionsPage: Prune old sessions, Delete empty, Delete selected * feat(billing): /billing terminal billing — interactive TUI + CLI client (#45449) * feat(billing): nous_billing http client + BillingState core (phase 2b) Phase 2b terminal-billing client foundation: - hermes_cli/nous_billing.py: typed client for the 4 /api/billing/* endpoints (state/charge/poll/auto-top-up). Raises typed errors (BillingScopeRequired, BillingRateLimited, BillingAuthError) mapped from the live-verified contract; fail-open is the caller's job. Idempotency-Key enforced client-side. - agent/billing_view.py: surface-agnostic BillingState core + Decimal money parsing (server emits decimal strings, not 2dp), fail-open builder, idempotency-key gen, custom-amount validation. - 51 unit tests (decimal parse/format, payload tiering, error->exception matrix, fail-open, amount validation). Plan: docs/plans/2026-06-13-001-phase-2b-terminal-billing-tui-plan.md * feat(billing): billing:manage scope + lazy step-up re-auth (phase 2b) - NOUS_BILLING_MANAGE_SCOPE constant. - nous_token_has_billing_scope(): split-based scope check (no false-positive substring match). - step_up_nous_billing_scope(): re-runs the device flow requesting billing:manage, reusing the held credential's portal/inference URLs + client_id (so a preview stays a preview), persists like _login_nous but WITHOUT the model picker. Returns True iff the minted token carries the scope (False when NAS silently downscopes a non-admin / unticked grant). Lazy step-up (plan D-A): normal login path unchanged; 403 insufficient_scope from a billing call triggers this. 7 unit tests. * feat(billing): billing JSON-RPC methods for the TUI (phase 2b) billing.state / charge / charge_status / auto_reload / step_up in tui_gateway/server.py. Return STRUCTURED success envelopes (result.ok + result.error=<code>) rather than JSON-RPC-level errors, so the Ink rpc() promise always resolves and the TUI branches on the typed billing error code (insufficient_scope, rate_limited, no_payment_method, …) to render the right affordance. Money serialized as decimal STRINGS + display strings. charge mints + echoes an idempotency_key for retry reuse. 16 unit tests. * feat(billing): /billing CLI handler + command registry (phase 2b) - CommandDef("billing", subcommands=buy|auto-reload|limit), added to _SLACK_VIA_HERMES_ONLY so it routes via /hermes on Slack (keeps the 50-cap parity test green, same as /credits). - cli.py::_show_billing + screen helpers: all 5 screens (overview, buy→confirm→ poll, auto-reload, monthly-limit read-only). Reuses _prompt_text_input_modal / _prompt_text_input (D-C). Non-interactive (_app is None) renders text + portal deep-link, never prompts (R7). Decimal money end-to-end. 2s/5-min cancellable poll loop; 429/503 = retry not failure; settled = ledger truth. Lazy step-up on 403 insufficient_scope. no_payment_method treated as mainline funnel-to-portal. - 6 CLI tests; 156 command tests (incl. Slack/Telegram parity) green. * feat(billing): /billing Ink TUI screens + tests (phase 2b) - ui-tui/src/app/slash/commands/billing.ts: /billing TUI command covering all 5 screens — overview (text), buy <amt> → ConfirmReq → charge → non-blocking 2s/ 5-min poll loop → settled/failed/timeout branches, auto-reload <below> <to> → ConfirmReq → PATCH, limit (read-only). Reuses the existing ConfirmReq overlay (D-C) — no bespoke component. Typed-error envelope branching: insufficient_scope arms the lazy step-up confirm; no_payment_method/rate_limited/cap funnel to portal. Client-side amount validation mirrors the server (bounds + 2dp). - gatewayTypes.ts: Billing* response interfaces. - registry.ts: register billingCommands. - billingCommand.test.ts: 12 vitest cases (overview/gating/buy-confirm-poll- settled/no_payment_method/step-up/limit/auto-reload/validation). TUI build green; 12/12 vitest pass; slash tests pass once @hermes/ink is built. * docs(billing): scrub private cross-repo references NAS is a private repo — remove all references to it from the public PR: - drop the cross-repo planning doc (planning scaffolding, not a deliverable; the PR description documents the design) - replace 'NAS' / 'PR #412 preview' mentions in code + test comments with generic 'the server' / 'a preview deployment' * docs(billing): scrub final NAS reference in step-up docstring * docs(billing): drop dangling plan-doc refs The phase-2b plan doc was removed in the cross-repo scrub (300afcc0b) but two module docstrings still pointed at it. Drop the dead refs. * feat(billing): interactive /billing overlay + step-up UX, portal-URL & token fixes Adds the interactive /billing TUI overlay and hardens the terminal-billing client across CLI and TUI. - TUI: full /billing overlay state machine (overview to buy to confirm, auto-reload, read-only monthly limit) reusing the existing confirm overlay. - Step-up: surface the verification link in-transcript and open the browser via the TUI's own opener (the device flow runs in the headless gateway, so a printed URL was being dropped); run the step-up handler off the main loop and emit the link as an out-of-band event so the gateway stays responsive. - Step-up copy is scope-accurate ("Billing permission granted") and re-checks /state so it never claims "enabled" when the org kill-switch is still off. - Portal deep-links resolve to absolute URLs against the active portal base (the server emits them relative) - fixes a bare "/billing?topup=open" link. - Billing calls refresh an expired access token via the stored refresh token instead of reporting a false "not logged in". - Optimistic funnel: advise "set up a saved card on the portal" up front when no card is on file (advisory, not a hard gate). - Token resolution is cached briefly so the 2s charge poll loop stops re-locking + re-reading the auth store on every tick; 401 re-resolves fresh. - Remove the temporary demo-mode shims. Validation: 87 Python billing tests, 88 TS tests (billing command + gateway event handler), tsc clean, ink + ui-tui builds green. * docs(billing): add /billing TUI screenshots for PR * fix(cli): guard _last_invalidate on bare instances; update stale prompt-fallback test The UI-invalidate throttle read self._last_invalidate unconditionally, which raised AttributeError on HermesCLI instances built without __init__ (the thread-safety test's object.__new__ shell). Guard the read with getattr. The off-main-thread branch of _prompt_text_input was changed (#23185) to cancel cleanly to None instead of falling back to a bare input() that would hang on the slash-worker thread; the test still asserted the old direct-input fallback. Update it to assert the current intended behavior: returns None, calls neither run_in_terminal nor input(), and does not hang. * fix(agent): flush un-persisted messages before session rotation (#47202) compress_context() rotates the session (end_session -> create_session) mid-turn when auto-compress triggers, but never called _flush_messages_to_session_db() first. Messages generated during the current turn that hadn't been persisted to state.db were silently lost. The same bug existed in cli.py:new_session() (/new command). Both paths now flush un-persisted messages before ending the old session. * fix(gateway): preserve original transcript when /compress rotation is skipped The manual /compress handler called rewrite_transcript() unconditionally on the session id returned by _compress_context(). When rotation does not occur (e.g. _session_db unavailable, or the DB split raised), session_id is unchanged and rewrite_transcript() DELETEs the original messages and replaces them with only the compressed summary — permanent data loss (#44794, #39704). Guard the rewrite on actual rotation: only overwrite when _compress_context produced a new session id. Otherwise leave the original transcript intact and log a warning. * fix(thread): allow scrolling long user messages in chat history (#48619) * fix(picker): remove max_models=50 cap in interactive model pickers The interactive model pickers (Desktop REST API, TUI model.options, CLI /model) were hard-capped at max_models=50, which truncated large provider catalogs like Kilo Gateway (336 models) to just 50 entries. This made most models undiscoverable via the picker search box. Changes: - Change build_models_payload() default from max_models=50 to None (unlimited) - Change list_authenticated_providers() default from max_models=8 to None - Change list_picker_providers() default from max_models=8 to None - Fix all [:max_models] slicing to handle None as 'no limit' - Remove max_models=50 from 5 interactive picker callers: * web_server.py: get_model_options (Desktop /api/model/options) * web_server.py: get_recommended_default_model * model_switch.py: prewarm_picker_cache_async * tui_gateway/server.py: model.options JSON-RPC * cli.py: HermesCLI model picker - Telegram/Discord inline keyboard picker (gateway/slash_commands.py) still passes max_models=50 explicitly — unchanged behavior. The total_models field was already in the response payload and is now meaningful since models.length == total_models for interactive pickers. Fixes #48279 * fix(picker): keep max_models=0 distinct from unlimited; lock cap semantics Follow-up to the cap-removal salvage. The contributor guarded the new unlimited default with `[:max_models] if max_models else ...`, which conflates max_models=0 (used by slug-only callers that want an empty model list) with None (unlimited). Tighten to `is not None` at all five slicing sites in list_authenticated_providers / list_picker_providers, and add a regression test asserting the three-way contract: None=full, 0=empty, N=first N. * fix(gateway): resume follows the compression tip so post-compression replies render Auto-compression ends the live session and forks a continuation child (linked via parent_session_id). A long-lived parent keeps its own flushed message rows, so resolve_resume_session_id()'s empty-head walk never redirected it — resuming the parent id reloaded the pre-compression transcript and dropped every turn generated after compression, including the assistant's response. On the desktop this is the recurring "I sent a message, came back, and the reply isn't there" report on large sessions: the chat's routed id is the pre-rotation id, and both the gateway session.resume RPC and the REST /messages read anchored on it. Fix the resolver at the chokepoint: resolve_resume_session_id() now follows the compression-continuation chain forward via get_compression_tip() before its existing empty-head descendant walk. get_compression_tip() only follows children whose parent ended with end_reason='compression' (created after the parent was ended), so delegation/branch children never hijack a resume. This fixes every resume caller at once (REST /messages, CLI --resume, gateway /resume). session.resume in tui_gateway was the one resume path that never called the resolver — it used the raw target id directly. Route it through resolve_resume_session_id() too (non-lazy only; lazy watch windows must stay on their exact child branch). Resolving up front also re-anchors the live-session fast path so a still-live rotated session is reused by its new key instead of rebuilding a duplicate agent on the stale parent. Tests: - resolve_resume_session_id follows the tip even when the parent retains messages, and is not confused by a delegation child. - session.resume binds the agent to the continuation tip and returns the post-compression reply. * test: narrow db._conn before raw SQL so ty stops flagging None-union access The new compression-tip tests poke started_at/ended_at directly via db._conn to force deterministic lineage ordering. _conn is typed Optional[Connection], so ty flagged .execute/.commit as unresolved on None. Bind a local and assert it's non-None first to narrow the union. * feat(kanban): auto-subscribe calling session on kanban_create When a worker calls kanban_create from inside a session that has a persistent delivery channel, the originating session is now subscribed to the new task's completion/block events automatically. The agent that dispatched the task gets notified instead of having to poll. - Gateway sessions (telegram/discord/slack): HERMES_SESSION_PLATFORM + HERMES_SESSION_CHAT_ID ContextVars, set by the messaging gateway. - TUI / desktop sessions: HERMES_SESSION_KEY in the subprocess env. The TUI notification poller keys on platform='tui' + chat_id=<key>. - CLI / cron / test: no persistent channel, no subscription. Gated by kanban.auto_subscribe_on_create in config.yaml (default True). Disable to mirror pre-feature behaviour — users who want explicit kanban_notify-subscribe calls per task can set it to false. This config gate addresses the design concern that got PR #19718 reverted upstream (unconditional implicit auto-subscribe on tool-driven kanban_create was too aggressive for orchestrator users). HERMES_SESSION_ID is intentionally not a fallback channel — it is set by ACP/agent subprocess telemetry for every invocation, not just TUI, so treating it as a notification target would auto-subscribe every CLI session and re-introduce the over-eager behaviour. The kanban_create response now includes a 'subscribed' bool so orchestrators can react if subscription failed (e.g. by falling back to explicit kanban_notify-subscribe or to polling). Includes 6 tests covering the gateway / TUI / CLI / partial-context / gated / add_notify_sub-failure paths. All 90 tests in test_kanban_tools.py pass; 509 broader kanban tests pass. * fix(desktop): make session delete idempotent and id-resolving (#48641) DELETE /api/sessions/{id} was the only session endpoint that didn't resolve the id (detail, messages, rename, export all call resolve_session_id) and 404'd when the row was already gone. The desktop optimistically removes the sidebar row, then RESTORES it and shows the error on any failure — so deleting a session that had just been reaped (empty-session hygiene) or removed by a concurrent client resurrected a ghost row and surfaced "session not found". /goal + auto-compression churn leaves transient empty rows that race the sidebar snapshot, which is the exact "I deleted the empty one and got 'session not found'" report. Resolve exact ids / unique prefixes, and treat an already-absent session as an idempotent success — DELETE's contract is "ensure it's gone". This mirrors the bulk-delete endpoint, which already treats ghost ids as success. Tests: deleting an absent id is idempotent (200, not 404); delete resolves a unique prefix; a real session still deletes. * feat(prompt): configurable per-platform system-prompt hint overrides Add platform_hints config so an admin can append to or replace Hermes' built-in platform hint for a single messaging platform (WhatsApp, Slack, Telegram, ...) without affecting other platforms. Enables enterprise managed profiles to steer platform-aware skills (e.g. invoke a custom table-formatting skill on WhatsApp where Markdown tables don't render) while leaving Telegram/Slack/CLI behavior unchanged. - hermes_cli/config.py: document platform_hints in DEFAULT_CONFIG - agent/agent_init.py: load platform_hints -> agent._platform_hint_overrides - agent/system_prompt.py: _resolve_platform_hint() applies append/replace (replace wins; bare string = append shorthand); defensive on bad config - tests: 16 cases covering append/replace/shorthand/isolation/malformed Override only affects the platform-hint segment of the system prompt; SOUL/context/memory tiers and general instructions are unchanged. * docs(prompt): document platform_hints config override Adds a 'Customizing platform hints' section to the Prompt Assembly developer guide covering the append/replace/shorthand shapes, the defensive fallback, and the cache-stable lifecycle (stable tier, resolved at build time). * fix(npm): lock react-simple-icons to 13.11.1 suppress annoying message about engines that's completely benign but people seem to complain * fix(desktop): show Hindsight memory provider (#37546) * fix(desktop): show Hindsight memory provider * feat(desktop): configure Hindsight memory provider * fix(desktop): limit Hindsight modes to supported setup * refactor(desktop): generic memory-provider config surface Replace the bespoke Hindsight settings surface with a declarative, schema-driven path so adding a memory provider is pure declaration — no per-provider page, conditional, or endpoint. - memory_providers.py: declarative registry. Each provider lists its fields {key, label, kind, default, options, secret-vs-plain}. Hindsight's mode is a select(cloud, local_external), so rejecting local_embedded falls out of generic enum validation instead of a hand-written check. - One generic endpoint pair GET/PUT /api/memory/providers/{name}/config. GET returns declared fields + current values (secrets only as is_set, never read back); PUT validates selects against their options, writes plain fields to the provider config file, secrets to the env store, and flips memory.provider. - ProviderConfigPanel renders straight from the schema, replacing hindsight-settings.tsx and the memory.provider === 'hindsight' conditional in config-settings.tsx — same pattern as toolset-config-panel.tsx off env_vars. Scoped to memory providers; storage layout is unchanged so the runtime Hindsight plugin reads the same config.json / HINDSIGHT_API_KEY / provider keys as before. Tests cover the registry, endpoint behavior (defaults, write+secret, select rejection, unknown provider, secret-never-returned), and the generic panel. * feat(relay): WS-only inbound on the gateway adapter (Phase 3) (#48294) The connector now delivers inbound (messages + interrupts) over the gateway's OUTBOUND /relay WebSocket, not a signed HTTP POST to an inbound endpoint. The gateway needs no inbound HTTP port — which is what makes hosted gateways (no public IP) able to receive inbound at all. - gateway/relay/adapter.py: connect() wires set_interrupt_inbound_handler( self.on_interrupt) so connector->gateway interrupt_inbound frames bridge into the existing per-session interrupt path (the inbound message handler was already wired). Removed _maybe_start_inbound_receiver() + the _inbound_runner lifecycle — there is no HTTP receiver anymore. - gateway/relay/inbound_receiver.py: deleted (the signed-HTTP InboundDelivery receiver). - gateway/relay/__init__.py: removed relay_inbound_config() (dead with the receiver gone). The delivery key is still set in-process by self-provision for forward-compat but is no longer consumed for inbound. - docs/relay-connector-contract.md: §3 rewritten — inbound is the WS back-channel routed cross-instance via the connector's relay bus; §5 interrupt + §6 auth table updated; the old signed-HTTP-POST + per-tenant-delivery-key-signing path is documented as superseded. gatewayEndpoint noted as passthrough-plane only. Tests: stub_connector grows set_interrupt_inbound_handler + push_interrupt; new test_relay_interrupt case proves connect() wires BOTH inbound handlers and an interrupt_inbound frame over the WS cancels the right session. Removed the HTTP-receiver test; updated the crypto-shedding scan + self-provision delivery-key assertion. 88 relay tests pass. EXPERIMENTAL. Pairs with gateway-gateway (relay bus + WsGatewayDelivery) and the NAS GATEWAY_RELAY_URL stamp. The cross-repo E2E (connector repo) proves the full multi-instance path against this production adapter code. * fix(docker): support WebUI installs from read-only sources (#48541) * fix(relay): trigger self-provision on relay-config + NAS token, not is_managed() (#48724) self_provision_if_managed() gated on is_managed(), but is_managed() means "NixOS/package-manager-managed" (it keys on HERMES_MANAGED or a ~/.hermes/.managed marker) — NOT "NAS-hosted". A NAS-provisioned Fly agent sets NEITHER, so the gate was always False and relay self-provision SILENTLY no-oped on exactly the hosted agents it was built for. Caught live: a staging agent with GATEWAY_RELAY_URL correctly stamped logged "No messaging platforms enabled" and never dialed the connector; HERMES_MANAGED was unset on the machine. The unit tests had mocked is_managed()->True, so they passed while the real trigger never fired (mocked- trigger blind spot). Fix: drop the is_managed() gate and rename self_provision_if_managed -> self_provision_relay. The real trigger is now "relay_url() set + no pinned secret + a resolvable NAS token", which is both NAS-independent and self-guarding: - NAS-hosted agent: GATEWAY_RELAY_URL + no pinned secret + bootstrapped NAS token -> self-provisions. - Self-hosted + `hermes gateway enroll`: pinned GATEWAY_RELAY_SECRET -> skipped (existing secret-present guard). - Self-hosted, unenrolled, no NAS identity: resolve_nous_access_token() fails -> graceful no-op (existing fail-soft path). Security: unchanged trust model. The connector still derives tenant from the validated NAS token; this only broadens WHEN the provision attempt fires, and every broadened case is still guarded by token-resolution + pinned-secret-skip. Tests: replaced the (wrong) "skips when not managed" test with a regression test proving a NAS host where is_managed()==False STILL provisions; renamed all call sites; added a "no NAS token -> non-fatal skip" test for the self-hosted branch. 88 relay tests pass. Relay-adapter lane. EXPERIMENTAL. * fix(agent): handle missing trigram tokenizer without disabling FTS5 _is_fts5_unavailable_error only matched 'no such module: fts5', but SQLite builds that ship FTS5 without the optional trigram tokenizer raise 'no such tokenizer: trigram' instead. This caused SessionDB init to crash on those builds. Additionally, the trigram failure path called _warn_fts5_unavailable which set _fts_enabled = False, globally disabling full-text search even though the base FTS5 table was created successfully. Fix: - Extend _is_fts5_unavailable_error to also match 'no such tokenizer' - Add _is_tokenizer_unavailable_error to distinguish tokenizer-specific failures from whole-module absence - Only call _warn_fts5_unavailable for module-level failures; skip it for tokenizer-specific failures so base FTS5 remains usable Fixes #47002 * fix(agent): address review feedback on trigram tokenizer fallback - Scope 'no such tokenizer' matcher to trigram specifically (#779) - Decouple base FTS and trigram backfill in v11 migration (#1195) - CJK search falls back to LIKE when trigram unavailable (#3384/#3430) - Add _trigram_available tracking across init, migration, and startup - Add regression tests for migration backfill and CJK LIKE fallback - Add _is_trigram_unavailable_error and _warn_trigram_unavailable helpers * fix(agent): rebuild base fts without trigram * fix(cli): surface a visible warning when the session store is unavailable When SessionDB init fails, the CLI/Desktop previously continued live with only a buried log line. The chat looks healthy, but the transcript is never written to state.db — so resume later shows a truncated or empty session and the user only discovers the loss after the fact (#41386). Emit a prominent stderr banner at startup when the store is unavailable, making it explicit that the conversation will not be saved and cannot be resumed, with a pointer to fix the store. Also set _session_db_unavailable so downstream code can detect the degraded state. * chore(release): map chanyoung.kim@nota.ai -> channkim for #47049 salvage * feat(cli): lock hermes worktrees so concurrent processes can't clobber them git worktree lock at creation and unlock before removal. A locked worktree refuses 'git worktree remove' (and prune), so a second hermes process or a stray cleanup can't silently delete an in-use isolated worktree. Fail-soft on both paths — a lock/unlock error never blocks the session or cleanup. Salvaged from #47029 (Issue #46303). Unlock moved to the actual-removal path so a preserved (unpushed-commits) worktree stays locked while in use. * chore: add JoaoMarcos44 to AUTHOR_MAP * fix(dashboard): resolve chat TUI argv off event loop (#48561) * fix(dashboard): resolve chat TUI argv off event loop Dashboard chat now resolves its TUI launch command off the FastAPI/WebSocket event loop. The resolver can run `npm install` / `npm run build` through `_make_tui_argv()`, and doing that synchronously in `/api/pty` can block proxy keepalives and other dashboard WebSocket work long enough for reverse-proxy deployments to drop the chat connection. This keeps the current TUI build policy intact: normal production launches still run the correctness-first `npm run build` path, while `HERMES_TUI_DIR` remains the prebuilt/no-build path for distros and containers. The change only moves the potentially slow resolver work to a worker thread for the dashboard chat path, serialized by an `asyncio.Lock` so concurrent chat tabs preserve one-build-at-a-time behavior. `SystemExit` (node/npm missing) and the profile `HTTPException` path still propagate cleanly through `asyncio.to_thread()`. Salvaged from #26124 — rebased onto current main. The async wrapper now threads the `profile` parameter that `_resolve_chat_argv` gained on main since the PR was opened, so cross-profile chat is preserved. Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com> * chore: add 0xdany to AUTHOR_MAP * fix(dashboard): bind chat-argv lock to app.state; cover error propagation Self-review hardening on top of the salvaged fix: - Move `_chat_argv_lock` from a module-level `asyncio.Lock()` onto `app.state` (initialised in `_lifespan`, lazy fallback via `_get_chat_argv_lock`), mirroring `event_lock`. A module-level `asyncio.Lock()` binds to whatever event loop is active at import time, which is the exact pattern `_get_event_state`'s docstring warns against (breaks across TestClient instances / uvicorn reloads). This keeps the lock on the running loop. - Add two tests exercising the real `_resolve_chat_argv_async` → `asyncio.to_thread` → lock → re-raise chain: `SystemExit` (node/npm missing) and `HTTPException` (invalid profile) both propagate out of the worker thread and are caught by `pty_ws`'s existing handlers. The prior tests mocked `asyncio.to_thread` away and never covered this path. * test(dashboard): dedupe pty error-propagation tests; assert close code simplify-code cleanup pass on the salvage stack: - Extract the shared scaffolding of the two pty_ws error-propagation tests into `_assert_pty_propagates`, keeping the two tests as distinct contracts for the `except SystemExit` and `except HTTPException` arms. - Assert the stable WebSocket close code (1011) instead of relying solely on the user-facing "Chat unavailable" notice wording — a behavior contract per the AGENTS.md "behavior contracts over snapshots" rule, robust to notice rewording. The detail substring ("unknown profile") is still checked for the HTTPException case since proving the detail survives the thread hop is the point of that test. No production-code change; the helper exercises the same real _resolve_chat_argv_async -> asyncio.to_thread -> lock -> re-raise chain. --------- Co-authored-by: draihan <draihan@student.ubc.ca> * feat(model-picker): add Refresh Models control to bust stale model cache (#48691) The desktop model picker had no way to force a fresh model fetch: model.options went through the 1h-cached provider_models_cache.json, and there was no flag to bust it. When a provider's cached list expired and its next live fetch failed, the picker fell back to the curated static list — silently dropping live-only models (e.g. OpenCode Zen's free tier like deepseek-v4-flash-free) the user had been using. - Thread refresh through model.options (RPC + REST /api/model/options) -> build_models_payload -> list_authenticated_providers, which calls clear_provider_models_cache() up front when set so every row re-fetches live. - Add a 'Refresh Models' control to the desktop picker (5-locale i18n, spinning sync icon). Normal opens leave refresh=false to stay snappy on the cache. Verified: stale cache hides deepseek-v4-flash-free -> refresh busts it -> live re-fetch surfaces it. refresh=false never touches the cache. * fix(agent): summarize structured provider error messages * Update Stripe Projects skill docs (#48673) Committed-By-Agent: codex Committed-By-Agent: codex Committed-By-Agent: codex Committed-By-Agent: codex Co-authored-by: codex <noreply@openai.com> * feat(image-gen): add image-to-image / editing to image_generate (#48705) * feat(image-gen): add image-to-image / editing to image_generate Brings image generation to parity with video generation: the unified image_generate tool now edits/transforms a source image (image-to-image) when given image_url / reference_image_urls, routing to each backend's edit endpoint, exactly as video_generate routes to image-to-video. - ImageGenProvider ABC: generate() gains keyword-only image_url + reference_image_urls; new capabilities() declares modalities + max_reference_images (defaults to text-only, backward compatible). success_response gains a modality field; adds normalize_reference_images. - image_generate tool: schema exposes image_url + reference_image_urls; dynamic schema reflects the active model's actual edit capability so the agent knows when image_url is honored. Handler + plugin dispatch forward the new inputs; legacy/text-only providers get a clear modality_unsupported error instead of silently dropping the source image. - In-tree FAL: 7 models gain edit endpoints (flux-2-klein, flux-2-pro, nano-banana-pro, gpt-image-1.5, gpt-image-2, ideogram/v3, qwen-image) with per-model edit_supports whitelists + reference caps; routes to the /edit endpoint and skips the upscaler for edits. - Plugins: openai (images.edit, 16 refs), xai (/v1/images/edits via grok-imagine-image-quality, JSON body per xAI docs), krea (image_style_references, 10 refs). openai-codex stays text-only and rejects edits with an actionable error. - Tests: 15 new (payload, routing, dispatch forwarding, dynamic schema, capabilities); updated 2 change-detector/lambda tests for the new schema. - Docs: image-generation feature page, image-gen provider plugin guide, tools reference. * fix(image-gen): preserve legacy passthrough in fal/krea plugin tests Two existing plugin tests asserted pre-image-to-image behavior: - fal: forward image_url/reference_image_urls only when supplied, so a text-to-image delegation stays byte-identical (no None kwargs). - krea: keep dict-shaped image_style_references refs verbatim (the unified string refs go through normalize_reference_images; legacy non-string ref objects pass through unchanged) — fixes KeyError when callers pass the richer Krea ref-object shape. * fix(image-gen): clearer not-capable message for text-to-image-only models When a text-to-image-only model (incl. gpt-image-2 on the Codex OAuth path, which can't do editing through the Responses image_generation tool) gets a source image, say 'this model is not capable of image-to-image / editing — provide a text-only prompt' rather than sending the user shopping for other backends. Applies to the openai-codex guard, the in-tree FAL no-edit-endpoint error, and the dynamic tool-schema text-only line. * fix(terminal): block gateway lifecycle commands from inside the gateway process systemctl --user restart hermes-gateway run via the terminal tool is a child of the gateway itself. When systemd delivers SIGTERM the gateway kills this subprocess before it can complete, so the service may never restart — reproducing issue #37453. The hermes gateway restart/stop guard (hermes_cli/gateway.py) and the cron-path guard (hermes_cli/cron.py) already block equivalent commands in their respective paths but the terminal tool had no such defense. Add a hard-block before command execution in terminal_tool: when _HERMES_GATEWAY=1 and the command matches _contains_gateway_lifecycle_command, return an error immediately. force=True cannot bypass it — unlike the normal dangerous-command approval flow, here even a user-approved restart would fail because the SIGTERM propagates to child processes. Also extend _GATEWAY_LIFECYCLE_PATTERNS to match systemctl with flags (e.g. systemctl --user restart) — the previous regex required the action word immediately after systemctl with no flags in between. Adds 9 regression tests: 6 blocked variants (parametrized), force bypass attempt, safe systemctl passthrough, and guard-inactive-outside-gateway. * fix(relay): make hosted gateways actually connect AND complete the inbound/outbound round-trip (#48828) * fix(relay): enable RELAY platform + normalize dial URL so hosted gateways actually connect Three bugs blocked a self-provisioned hosted gateway from ever establishing its inbound relay WS (found while standing up the live staging end-to-end). Each masked the next; all three are needed for inbound to work. 1. RELAY platform never enabled in config.platforms (gateway/config.py). register_relay_adapter() puts the adapter in the platform_registry, but start_gateway()'s connect loop iterates self.config.platforms — which never contained Platform.RELAY. So the adapter was "registered" but never connected (logs showed "relay adapter registered" then "No messaging platforms enabled"). Fix: _apply_env_overrides now enables Platform.RELAY (mirroring relay_url into extra for the connected-checker) when GATEWAY_RELAY_URL (env) or gateway.relay_url (yaml) is set. Absent -> no RELAY entry (direct/ single-tenant gateways unaffected). 2. URL scheme not converted for the WS dial (gateway/relay/ws_transport.py). The relay URL is configured once as the http(s):// base (used as-is for the provision POST), but websockets.connect rejects http(s):// with "scheme isn't ws or wss". Fix: _ws_dial_url converts https->wss / http->ws. 3. /relay path not appended (same helper). The connector mounts its WebSocketServer at path "/relay" and returns HTTP 400 on an upgrade to any other path. GATEWAY_RELAY_URL is the base (no /relay), so the dial hit "/" -> 400. Fix: _ws_dial_url ensures the path ends in /relay. Idempotent — a URL already carrying ws(s):// and/or /relay is unchanged, so provision's _provision_url (which derives /relay/provision from either form) still works. Why the cross-repo E2E missed #2/#3: the stub connector binds ws://host:port and its websockets.serve accepts ANY path, so neither the scheme nor the /relay path was exercised. Real connector needs both. Verified live on staging hermes-agent-stg-automated-perception-5054: after the fixes the gateway logs "Connecting to relay..." -> "✓ relay connected" -> "Gateway running with 1 platform(s)" against wss://gateway-gateway.staging-nousresearch.com/relay, stable. Tests: added _ws_dial_url scheme+path+idempotency cases (test_ws_transport.py) and RELAY-platform-enablement cases for env + yaml + absent (test_config.py). Full gateway/relay + config suites green (191 passed). Relay-adapter lane. EXPERIMENTAL. * fix(relay): re-attach guild_id to outbound so connector egress resolves the tenant The final bug in the hosted-relay round-trip. Inbound worked end to end (Discord -> connector -> bus -> agent WS -> agent runs -> reply), but the reply's egress was declined by the connector: "discord egress declined: target not routed to an onboarded tenant". Cause: the connector's routedEgressGuard resolves the owning tenant from the OUTBOUND action's metadata.guild_id (Discord's routing discriminator). The gateway's generic delivery path builds outbound metadata via run.py _thread_metadata_for_source, which only carries thread_id (and returns None entirely for a non-threaded message) — so guild_id never reached the connector, tenant resolution failed, and the shared bot refused to post. Fix (relay-adapter-local, no perturbation of the generic delivery path or other platforms): RelayAdapter learns chat_id -> guild_id from each inbound event (_capture_scope) and re-attaches it to the outbound action's metadata in send() (_with_scope) when not already present. No-op for chats we never saw inbound (e.g. DMs) and never overwrites an explicit guild_id. Verified live on staging hermes-agent-stg-automated-perception-5054: an @mention in #general now produces a visible bot reply — full multi-tenant relay round-trip (real Discord -> shared connector bot -> tenant routing -> agent WS -> reply egress -> Discord). Tests: _capture_scope/_with_scope reattach, no-scope no-op, explicit-guild_id preserved (test_relay_adapter.py). Full relay + config suites green (160 passed). Relay-adapter lane. EXPERIMENTAL. * fix(tui): restart dashboard chat on idle exit hotkeys * port(opentui): billing overlay, review.summary, Enter/slash fix into the engine Ports the four TUI features deferred in the last upstream sync into the OpenTUI (Solid) engine, on top of a fresh merge of upstream/main. No gateway/Python changes — engine-only, mirroring the Ink behavior. 1. Terminal billing /billing — new full-screen overlay (5-screen state machine: overview/buy/confirm/auto-reload/limit) as net-new Solid components (boundary RPC types + openExternalUrl + logic/billing RPC layer + Solid view + store openBilling/closeBilling/patchBilling + slash /billing + App Match). Follows the engine's overlay conventions (picker/sessionPicker/ agentsDashboard): state/ctx read reactively via getters (state()/ctx()), and theme read via useTheme() directly in each component (no theme prop-drilling). 2. review.summary — wire the self-improvement memory digest into the store reducer (schema already defined); honors display.memory_notifications. 3. Enter/slash-completion fix — completionMenu.acceptChangesToken gates the composer's Enter-accept so an already-complete command submits instead of being swallowed into "<cmd> " (Ink completionToApplyOnSubmit parity). 4. custom-provider session-info — verified already round-tripped; strengthened the test to assert a custom:<name> slug is kept verbatim. Adversarial review (Claude Code, 2 rounds, opentui+tmux skills + a live billing smoke harness driving the real overlay) found + fixed two billing bugs: - AutoReloadScreen mounted EMPTY (the engine <input> `value` is init-only) — added value={value()} so an existing config prefills; + a captureFrame test. - ConfirmScreen double-charge race on rapid Enter/Y — added a `paying` guard. Gate green: 852 engine tests, prettier+tsc+eslint clean, build.mjs builds dist. --------- Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com> Co-authored-by: qin-ctx <qinhaojie.exe@bytedance.com> Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com> Co-authored-by: Tranquil-Flow <66773372+Tranquil-Flow@users.noreply.github.com> Co-authored-by: Luke The Dev <iamlukethedev@users.noreply.github.com> Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com> Co-authored-by: Kewe63 <Kewe63@users.noreply.github.com> Co-authored-by: xxxigm <54813621+xxxigm@users.noreply.github.com> Co-authored-by: Brooklyn Nicholson <brooklyn.bb.nicholson@gmail.com> Co-authored-by: emozilla <emozilla@nousresearch.com> Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com> Co-authored-by: alelpoan <155192176+alelpoan@users.noreply.github.com> Co-authored-by: islam666 <islam666@users.noreply.github.com> Co-authored-by: flooryyyy <67979730+flooryyyy@users.noreply.github.com> Co-authored-by: Victor Kyriazakos <victor@rocketfueldev.com> Co-authored-by: ethernet <arilotter@gmail.com> Co-authored-by: Ben <62250174+benfrank241@users.noreply.github.com> Co-authored-by: Ben Barclay <ben@nousresearch.com> Co-authored-by: Evo <r2668940489@gmail.com> Co-authored-by: liuhao1024 <sunsky.lau@gmail.com> Co-authored-by: channkim <chanyoung.kim@nota.ai> Co-authored-by: JoaoMarcos44 <87440198+JoaoMarcos44@users.noreply.github.com> Co-authored-by: draihan <draihan@student.ubc.ca> Co-authored-by: Gille <4317663+helix4u@users.noreply.github.com> Co-authored-by: colinwren-stripe <92538686+colinwren-stripe@users.noreply.github.com> Co-authored-by: codex <noreply@openai.com> Co-authored-by: AhmetArif0 <147827411+AhmetArif0@users.noreply.github.com> Co-authored-by: Shannon Sands <shannon.sands.1979@gmail.com>

alt-glitch added type/feature New feature or request comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels Jun 18, 2026

benbarclay merged commit d2c53ff into main Jun 18, 2026
42 of 43 checks passed

benbarclay deleted the feat/relay-inbound-ws branch June 18, 2026 23:33

Haderach-Ram mentioned this pull request Jun 19, 2026

Ecosystem Digest — 2026-06-19 Haderach-Ram/openclaw-radar#55

Open

github-actions Bot mentioned this pull request Jun 20, 2026

chore: bump NousResearch/hermes-agent version from v2026.6.5 to v2026.6.19 Docker-Hub-sirmark/docker-hermes-agent#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(relay): WS-only inbound on the gateway adapter (Phase 3)#48294

feat(relay): WS-only inbound on the gateway adapter (Phase 3)#48294
benbarclay merged 1 commit into
mainfrom
feat/relay-inbound-ws

benbarclay commented Jun 18, 2026 •
edited

Loading

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

benbarclay commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Test plan

Diffstat

Uh oh!

github-actions Bot commented Jun 18, 2026

🔎 Lint report: feat/relay-inbound-ws vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benbarclay commented Jun 18, 2026 •
edited

Loading

🔎 Lint report: `feat/relay-inbound-ws` vs `origin/main`