Skip to content

fix(discord): recover from runtime gateway task exits#44383

Merged
austinpickett merged 3 commits into
mainfrom
salvage/39416-discord-task-exit-recovery
Jun 11, 2026
Merged

fix(discord): recover from runtime gateway task exits#44383
austinpickett merged 3 commits into
mainfrom
salvage/39416-discord-task-exit-recovery

Conversation

@austinpickett

Copy link
Copy Markdown
Collaborator

Salvage of #39416 (AMEOBIUS). Cherry-picked the 2-file fix from a branch that was 1081 commits behind with 28 unrelated commits. Original author was silent for 5+ days after a split-out request.

What this fixes

A post-ready discord.py WebSocket crash (e.g. ClientOSError, ConnectionResetError) left the gateway split-brained: producer tasks stayed alive while Discord stopped responding. The adapter had no task-exit callback, so no reconnect was ever triggered.

Changes

  • plugins/platforms/discord/adapter.py — adds _handle_bot_task_done callback wired into Bot.start()'s wrapping task via add_done_callback. On any post-ready exit, calls _set_fatal_error(retryable=True) + _notify_fatal_error() so the existing GatewayRunner._handle_adapter_fatal_error reconnect watcher replaces the dead adapter. Also adds _wait_for_ready_or_bot_exit() so startup failures (SOCKS/proxy errors, invalid tokens) surface immediately instead of burning the full ready timeout.
  • tests/plugins/test_discord_runtime_failure.py — 2 tests: (1) runtime task exit triggers retryable fatal notification, (2) ready event resolves normally when startup succeeds.

Verification

Both tests pass locally (2 passed in 0.29s). Hooks (_set_fatal_error, _notify_fatal_error, _handle_adapter_fatal_error) confirmed present on main.

Closes #39416.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: salvage/39416-discord-task-exit-recovery vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10763 on HEAD, 10758 on base (🆕 +5)

🆕 New issues (4):

Rule Count
unresolved-attribute 2
invalid-assignment 1
unresolved-import 1
First entries
tests/plugins/test_discord_runtime_failure.py:47: [invalid-assignment] invalid-assignment: Object of type `AsyncMock` is not assignable to attribute `_notify_fatal_error` of type `def _notify_fatal_error(self) -> CoroutineType[Any, Any, None]`
tests/plugins/test_discord_runtime_failure.py:38: [unresolved-attribute] unresolved-attribute: Object of type `bound method DiscordAdapter._notify_fatal_error() -> CoroutineType[Any, Any, None]` has no attribute `assert_awaited_once`
tests/plugins/test_discord_runtime_failure.py:4: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/test_discord_runtime_failure.py:59: [unresolved-attribute] unresolved-attribute: Object of type `bound method DiscordAdapter._notify_fatal_error() -> CoroutineType[Any, Any, None]` has no attribute `assert_not_awaited`

✅ Fixed issues: none

Unchanged: 5632 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Salvaged from #39416 (AMEOBIUS) — cherry-picked only the task-exit
recovery; the original PR was 1081 commits behind with 28 unrelated
commits.

A post-ready discord.py WebSocket crash left the gateway split-brained:
producers stayed active while Discord stopped responding. After this fix
the adapter calls _set_fatal_error(retryable=True) + _notify_fatal_error()
so the existing GatewayRunner reconnect watcher replaces the dead adapter.

Also adds _wait_for_ready_or_bot_exit() so startup failures (SOCKS/proxy
errors, invalid tokens) surface fast instead of burning the full ready
timeout. Because connect() no longer waits via asyncio.wait_for on that
path, test_connect_releases_token_lock_on_timeout is updated to trigger
the timeout through the new helper (same lock-release contract).

3 tests pass (2 new runtime-failure tests + the updated timeout test);
test_discord_connect.py and test_discord_slash_commands.py green.

Co-Authored-By: ameobius <ameobius@local.host>
@austinpickett austinpickett force-pushed the salvage/39416-discord-task-exit-recovery branch from 545e5a8 to c9f1a32 Compare June 11, 2026 17:13

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Summary

Fixes Discord adapter to recover from runtime gateway task exits. Discord.py reconnects normal gateway interruptions internally, but when its top-level Bot.start() task actually exits after the adapter has been marked running, a split-brain state occurs. This PR surfaces those exits to the gateway supervisor for retry.

Changes

  • Adds _wait_for_ready_or_bot_exit() that races the ready event against the bot task, surfacing early startup failures.
  • Adds _handle_bot_task_done() callback to detect post-startup task exits.
  • Adds _disconnecting flag to distinguish operator shutdown from crash.

Review

  • Good resilience fix with proper error propagation to the gateway supervisor.
  • Extensive comments explaining the Discord.py reconnect behavior.
  • No security concerns.

Reviewed by Hermes Agent (batch cron)

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery platform/discord Discord bot adapter P2 Medium — degraded but workaround exists labels Jun 11, 2026
@liuhao1024

Copy link
Copy Markdown
Contributor

Verification — Discord runtime task exit recovery

Reviewed the split-brain prevention for Discord adapter. The design is well-structured:

  1. _wait_for_ready_or_bot_exit() — Races ready_event.wait() against bot_task with asyncio.wait(FIRST_COMPLETED). If the bot task finishes first, its exception is re-raised. The finally block properly cancels and awaits the pending ready_task to avoid task leaks.

  2. _handle_bot_task_done() — Three early-exit guards before declaring fatal error:

    • _disconnecting flag → intentional shutdown
    • Stale task check (task is not self._bot_task) → reconnect already replaced the task
    • not self._running → startup failures handled by connect()

    Only after all three pass does it mark _set_fatal_error(retryable=True) and notify the supervisor.

  3. _disconnecting flag — Set in disconnect() before closing the client. This cleanly prevents the done-callback from treating operator-initiated shutdowns as crashes.

  4. Tests — Runtime exit test confirms fatal error is set with retryable=True and supervisor is notified. Intentional disconnect test confirms no false alarm. Both pass CI.

No issues found. This addresses a real operational gap where a dead Discord websocket could leave the gateway process alive but unresponsive to Discord messages.

Resolve disconnect() conflict by keeping both the _disconnecting guard
from this PR and the _cancel_bot_task() zombie-client cleanup from main.

Co-authored-by: Cursor <cursoragent@cursor.com>
@cursor

This comment was marked as spam.

connect() no longer uses asyncio.wait_for for the ready handshake, so
test_connect_timeout_cancels_bot_task was hanging for 30s in CI.

Co-authored-by: Cursor <cursoragent@cursor.com>
@austinpickett austinpickett merged commit c3464ec into main Jun 11, 2026
28 checks passed
@austinpickett austinpickett deleted the salvage/39416-discord-task-exit-recovery branch June 11, 2026 19:39
alt-glitch pushed a commit that referenced this pull request Jun 14, 2026
* fix(discord): recover from runtime gateway task exits

Salvaged from #39416 (AMEOBIUS) — cherry-picked only the task-exit
recovery; the original PR was 1081 commits behind with 28 unrelated
commits.

A post-ready discord.py WebSocket crash left the gateway split-brained:
producers stayed active while Discord stopped responding. After this fix
the adapter calls _set_fatal_error(retryable=True) + _notify_fatal_error()
so the existing GatewayRunner reconnect watcher replaces the dead adapter.

Also adds _wait_for_ready_or_bot_exit() so startup failures (SOCKS/proxy
errors, invalid tokens) surface fast instead of burning the full ready
timeout. Because connect() no longer waits via asyncio.wait_for on that
path, test_connect_releases_token_lock_on_timeout is updated to trigger
the timeout through the new helper (same lock-release contract).

3 tests pass (2 new runtime-failure tests + the updated timeout test);
test_discord_connect.py and test_discord_slash_commands.py green.

Co-Authored-By: ameobius <ameobius@local.host>

* fix(test): patch _wait_for_ready_or_bot_exit in timeout cancel test

connect() no longer uses asyncio.wait_for for the ready handshake, so
test_connect_timeout_cancels_bot_task was hanging for 30s in CI.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: ameobius <ameobius@local.host>
Co-authored-by: Cursor <cursoragent@cursor.com>
AIalliAI pushed a commit to AIalliAI/Hermes that referenced this pull request Jun 14, 2026
…4383)

* fix(discord): recover from runtime gateway task exits

Salvaged from NousResearch#39416 (AMEOBIUS) — cherry-picked only the task-exit
recovery; the original PR was 1081 commits behind with 28 unrelated
commits.

A post-ready discord.py WebSocket crash left the gateway split-brained:
producers stayed active while Discord stopped responding. After this fix
the adapter calls _set_fatal_error(retryable=True) + _notify_fatal_error()
so the existing GatewayRunner reconnect watcher replaces the dead adapter.

Also adds _wait_for_ready_or_bot_exit() so startup failures (SOCKS/proxy
errors, invalid tokens) surface fast instead of burning the full ready
timeout. Because connect() no longer waits via asyncio.wait_for on that
path, test_connect_releases_token_lock_on_timeout is updated to trigger
the timeout through the new helper (same lock-release contract).

3 tests pass (2 new runtime-failure tests + the updated timeout test);
test_discord_connect.py and test_discord_slash_commands.py green.

Co-Authored-By: ameobius <ameobius@local.host>

* fix(test): patch _wait_for_ready_or_bot_exit in timeout cancel test

connect() no longer uses asyncio.wait_for for the ready handshake, so
test_connect_timeout_cancels_bot_task was hanging for 30s in CI.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: ameobius <ameobius@local.host>
Co-authored-by: Cursor <cursoragent@cursor.com>
AIalliAI pushed a commit to AIalliAI/Hermes that referenced this pull request Jun 14, 2026
…4383)

* fix(discord): recover from runtime gateway task exits

Salvaged from NousResearch#39416 (AMEOBIUS) — cherry-picked only the task-exit
recovery; the original PR was 1081 commits behind with 28 unrelated
commits.

A post-ready discord.py WebSocket crash left the gateway split-brained:
producers stayed active while Discord stopped responding. After this fix
the adapter calls _set_fatal_error(retryable=True) + _notify_fatal_error()
so the existing GatewayRunner reconnect watcher replaces the dead adapter.

Also adds _wait_for_ready_or_bot_exit() so startup failures (SOCKS/proxy
errors, invalid tokens) surface fast instead of burning the full ready
timeout. Because connect() no longer waits via asyncio.wait_for on that
path, test_connect_releases_token_lock_on_timeout is updated to trigger
the timeout through the new helper (same lock-release contract).

3 tests pass (2 new runtime-failure tests + the updated timeout test);
test_discord_connect.py and test_discord_slash_commands.py green.

Co-Authored-By: ameobius <ameobius@local.host>

* fix(test): patch _wait_for_ready_or_bot_exit in timeout cancel test

connect() no longer uses asyncio.wait_for for the ready handshake, so
test_connect_timeout_cancels_bot_task was hanging for 30s in CI.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: ameobius <ameobius@local.host>
Co-authored-by: Cursor <cursoragent@cursor.com>
T02200059 pushed a commit to T02200059/hermes-agent that referenced this pull request Jun 18, 2026
…4383)

* fix(discord): recover from runtime gateway task exits

Salvaged from NousResearch#39416 (AMEOBIUS) — cherry-picked only the task-exit
recovery; the original PR was 1081 commits behind with 28 unrelated
commits.

A post-ready discord.py WebSocket crash left the gateway split-brained:
producers stayed active while Discord stopped responding. After this fix
the adapter calls _set_fatal_error(retryable=True) + _notify_fatal_error()
so the existing GatewayRunner reconnect watcher replaces the dead adapter.

Also adds _wait_for_ready_or_bot_exit() so startup failures (SOCKS/proxy
errors, invalid tokens) surface fast instead of burning the full ready
timeout. Because connect() no longer waits via asyncio.wait_for on that
path, test_connect_releases_token_lock_on_timeout is updated to trigger
the timeout through the new helper (same lock-release contract).

3 tests pass (2 new runtime-failure tests + the updated timeout test);
test_discord_connect.py and test_discord_slash_commands.py green.

Co-Authored-By: ameobius <ameobius@local.host>

* fix(test): patch _wait_for_ready_or_bot_exit in timeout cancel test

connect() no longer uses asyncio.wait_for for the ready handshake, so
test_connect_timeout_cancels_bot_task was hanging for 30s in CI.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: ameobius <ameobius@local.host>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/discord Discord bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants