fix(discord): recover from runtime gateway task exits#44383
Conversation
🔎 Lint report:
|
| Rule | Count |
|---|---|
unresolved-attribute |
2 |
invalid-assignment |
1 |
unresolved-import |
1 |
First entries
tests/plugins/test_discord_runtime_failure.py:47: [invalid-assignment] invalid-assignment: Object of type `AsyncMock` is not assignable to attribute `_notify_fatal_error` of type `def _notify_fatal_error(self) -> CoroutineType[Any, Any, None]`
tests/plugins/test_discord_runtime_failure.py:38: [unresolved-attribute] unresolved-attribute: Object of type `bound method DiscordAdapter._notify_fatal_error() -> CoroutineType[Any, Any, None]` has no attribute `assert_awaited_once`
tests/plugins/test_discord_runtime_failure.py:4: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/plugins/test_discord_runtime_failure.py:59: [unresolved-attribute] unresolved-attribute: Object of type `bound method DiscordAdapter._notify_fatal_error() -> CoroutineType[Any, Any, None]` has no attribute `assert_not_awaited`
✅ Fixed issues: none
Unchanged: 5632 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
Salvaged from #39416 (AMEOBIUS) — cherry-picked only the task-exit recovery; the original PR was 1081 commits behind with 28 unrelated commits. A post-ready discord.py WebSocket crash left the gateway split-brained: producers stayed active while Discord stopped responding. After this fix the adapter calls _set_fatal_error(retryable=True) + _notify_fatal_error() so the existing GatewayRunner reconnect watcher replaces the dead adapter. Also adds _wait_for_ready_or_bot_exit() so startup failures (SOCKS/proxy errors, invalid tokens) surface fast instead of burning the full ready timeout. Because connect() no longer waits via asyncio.wait_for on that path, test_connect_releases_token_lock_on_timeout is updated to trigger the timeout through the new helper (same lock-release contract). 3 tests pass (2 new runtime-failure tests + the updated timeout test); test_discord_connect.py and test_discord_slash_commands.py green. Co-Authored-By: ameobius <ameobius@local.host>
545e5a8 to
c9f1a32
Compare
tonydwb
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Approved
Summary
Fixes Discord adapter to recover from runtime gateway task exits. Discord.py reconnects normal gateway interruptions internally, but when its top-level Bot.start() task actually exits after the adapter has been marked running, a split-brain state occurs. This PR surfaces those exits to the gateway supervisor for retry.
Changes
- Adds
_wait_for_ready_or_bot_exit()that races the ready event against the bot task, surfacing early startup failures. - Adds
_handle_bot_task_done()callback to detect post-startup task exits. - Adds
_disconnectingflag to distinguish operator shutdown from crash.
Review
- Good resilience fix with proper error propagation to the gateway supervisor.
- Extensive comments explaining the Discord.py reconnect behavior.
- No security concerns.
Reviewed by Hermes Agent (batch cron)
|
Verification — Discord runtime task exit recovery Reviewed the split-brain prevention for Discord adapter. The design is well-structured:
No issues found. This addresses a real operational gap where a dead Discord websocket could leave the gateway process alive but unresponsive to Discord messages. |
Resolve disconnect() conflict by keeping both the _disconnecting guard from this PR and the _cancel_bot_task() zombie-client cleanup from main. Co-authored-by: Cursor <cursoragent@cursor.com>
This comment was marked as spam.
This comment was marked as spam.
connect() no longer uses asyncio.wait_for for the ready handshake, so test_connect_timeout_cancels_bot_task was hanging for 30s in CI. Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(discord): recover from runtime gateway task exits Salvaged from #39416 (AMEOBIUS) — cherry-picked only the task-exit recovery; the original PR was 1081 commits behind with 28 unrelated commits. A post-ready discord.py WebSocket crash left the gateway split-brained: producers stayed active while Discord stopped responding. After this fix the adapter calls _set_fatal_error(retryable=True) + _notify_fatal_error() so the existing GatewayRunner reconnect watcher replaces the dead adapter. Also adds _wait_for_ready_or_bot_exit() so startup failures (SOCKS/proxy errors, invalid tokens) surface fast instead of burning the full ready timeout. Because connect() no longer waits via asyncio.wait_for on that path, test_connect_releases_token_lock_on_timeout is updated to trigger the timeout through the new helper (same lock-release contract). 3 tests pass (2 new runtime-failure tests + the updated timeout test); test_discord_connect.py and test_discord_slash_commands.py green. Co-Authored-By: ameobius <ameobius@local.host> * fix(test): patch _wait_for_ready_or_bot_exit in timeout cancel test connect() no longer uses asyncio.wait_for for the ready handshake, so test_connect_timeout_cancels_bot_task was hanging for 30s in CI. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: ameobius <ameobius@local.host> Co-authored-by: Cursor <cursoragent@cursor.com>
…4383) * fix(discord): recover from runtime gateway task exits Salvaged from NousResearch#39416 (AMEOBIUS) — cherry-picked only the task-exit recovery; the original PR was 1081 commits behind with 28 unrelated commits. A post-ready discord.py WebSocket crash left the gateway split-brained: producers stayed active while Discord stopped responding. After this fix the adapter calls _set_fatal_error(retryable=True) + _notify_fatal_error() so the existing GatewayRunner reconnect watcher replaces the dead adapter. Also adds _wait_for_ready_or_bot_exit() so startup failures (SOCKS/proxy errors, invalid tokens) surface fast instead of burning the full ready timeout. Because connect() no longer waits via asyncio.wait_for on that path, test_connect_releases_token_lock_on_timeout is updated to trigger the timeout through the new helper (same lock-release contract). 3 tests pass (2 new runtime-failure tests + the updated timeout test); test_discord_connect.py and test_discord_slash_commands.py green. Co-Authored-By: ameobius <ameobius@local.host> * fix(test): patch _wait_for_ready_or_bot_exit in timeout cancel test connect() no longer uses asyncio.wait_for for the ready handshake, so test_connect_timeout_cancels_bot_task was hanging for 30s in CI. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: ameobius <ameobius@local.host> Co-authored-by: Cursor <cursoragent@cursor.com>
…4383) * fix(discord): recover from runtime gateway task exits Salvaged from NousResearch#39416 (AMEOBIUS) — cherry-picked only the task-exit recovery; the original PR was 1081 commits behind with 28 unrelated commits. A post-ready discord.py WebSocket crash left the gateway split-brained: producers stayed active while Discord stopped responding. After this fix the adapter calls _set_fatal_error(retryable=True) + _notify_fatal_error() so the existing GatewayRunner reconnect watcher replaces the dead adapter. Also adds _wait_for_ready_or_bot_exit() so startup failures (SOCKS/proxy errors, invalid tokens) surface fast instead of burning the full ready timeout. Because connect() no longer waits via asyncio.wait_for on that path, test_connect_releases_token_lock_on_timeout is updated to trigger the timeout through the new helper (same lock-release contract). 3 tests pass (2 new runtime-failure tests + the updated timeout test); test_discord_connect.py and test_discord_slash_commands.py green. Co-Authored-By: ameobius <ameobius@local.host> * fix(test): patch _wait_for_ready_or_bot_exit in timeout cancel test connect() no longer uses asyncio.wait_for for the ready handshake, so test_connect_timeout_cancels_bot_task was hanging for 30s in CI. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: ameobius <ameobius@local.host> Co-authored-by: Cursor <cursoragent@cursor.com>
…4383) * fix(discord): recover from runtime gateway task exits Salvaged from NousResearch#39416 (AMEOBIUS) — cherry-picked only the task-exit recovery; the original PR was 1081 commits behind with 28 unrelated commits. A post-ready discord.py WebSocket crash left the gateway split-brained: producers stayed active while Discord stopped responding. After this fix the adapter calls _set_fatal_error(retryable=True) + _notify_fatal_error() so the existing GatewayRunner reconnect watcher replaces the dead adapter. Also adds _wait_for_ready_or_bot_exit() so startup failures (SOCKS/proxy errors, invalid tokens) surface fast instead of burning the full ready timeout. Because connect() no longer waits via asyncio.wait_for on that path, test_connect_releases_token_lock_on_timeout is updated to trigger the timeout through the new helper (same lock-release contract). 3 tests pass (2 new runtime-failure tests + the updated timeout test); test_discord_connect.py and test_discord_slash_commands.py green. Co-Authored-By: ameobius <ameobius@local.host> * fix(test): patch _wait_for_ready_or_bot_exit in timeout cancel test connect() no longer uses asyncio.wait_for for the ready handshake, so test_connect_timeout_cancels_bot_task was hanging for 30s in CI. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: ameobius <ameobius@local.host> Co-authored-by: Cursor <cursoragent@cursor.com>
Salvage of #39416 (AMEOBIUS). Cherry-picked the 2-file fix from a branch that was 1081 commits behind with 28 unrelated commits. Original author was silent for 5+ days after a split-out request.
What this fixes
A post-ready discord.py WebSocket crash (e.g.
ClientOSError,ConnectionResetError) left the gateway split-brained: producer tasks stayed alive while Discord stopped responding. The adapter had no task-exit callback, so no reconnect was ever triggered.Changes
plugins/platforms/discord/adapter.py— adds_handle_bot_task_donecallback wired intoBot.start()'s wrapping task viaadd_done_callback. On any post-ready exit, calls_set_fatal_error(retryable=True)+_notify_fatal_error()so the existingGatewayRunner._handle_adapter_fatal_errorreconnect watcher replaces the dead adapter. Also adds_wait_for_ready_or_bot_exit()so startup failures (SOCKS/proxy errors, invalid tokens) surface immediately instead of burning the full ready timeout.tests/plugins/test_discord_runtime_failure.py— 2 tests: (1) runtime task exit triggers retryable fatal notification, (2) ready event resolves normally when startup succeeds.Verification
Both tests pass locally (2 passed in 0.29s). Hooks (
_set_fatal_error,_notify_fatal_error,_handle_adapter_fatal_error) confirmed present onmain.Closes #39416.