Skip to content

[Core] Improve Dashborad SubprocessModuleHandle.destroy_module() Resource Cleanup and Process Termination#60172

Merged
edoakes merged 5 commits into
ray-project:masterfrom
daiping8:spawn
Jan 21, 2026
Merged

[Core] Improve Dashborad SubprocessModuleHandle.destroy_module() Resource Cleanup and Process Termination#60172
edoakes merged 5 commits into
ray-project:masterfrom
daiping8:spawn

Conversation

@daiping8

@daiping8 daiping8 commented Jan 15, 2026

Copy link
Copy Markdown
Contributor

Description

Summary

This PR improves the destroy_module() method in SubprocessModuleHandle to fix race conditions, implement graceful process termination, and ensure complete resource cleanup. The changes prevent resource leaks and improve the reliability of module restart and shutdown.

Key Changes

  1. Cancel health check task first to prevent race conditions
  2. Ordered resource cleanup with clear dependencies:
    • Cancel health check task first
    • Close parent connection
    • Terminate process gracefully, then forcefully if needed
    • Close HTTP client session
  3. Graceful process termination:
    • First attempts graceful termination with terminate() and 5-second timeout
    • Falls back to force kill (kill()) only if necessary
    • All join() calls have timeouts to prevent infinite blocking
  4. Error handling: Try-except blocks ensure cleanup continues even if one step fails

Modified Files

  1. python/ray/dashboard/subprocesses/handle.py

    • Refactored destroy_module() with ordered resource cleanup
    • Implemented graceful process termination with timeout protection
    • Added smart health check task cancellation logic
      • Smart Detection Logic:
        current_task = asyncio.current_task(loop=self.loop)
        if current_task is None or self.health_check_task is not current_task:
            self.health_check_task.cancel()  # Only cancel if not current task
        This ensures:
        • External calls: Immediately cancel health check task to prevent interference
        • Internal calls: Don't cancel current task, allowing cleanup and restart to complete
    • Added comprehensive error handling and logging
  2. python/ray/dashboard/subprocesses/tests/test_e2e.py

    • Added test_destroy_module_cleans_up_resources() to verify complete resource cleanup
    • Added mock classes (_DummyConn, _DummyProcess, _DummySession) for isolated testing
    • Added cleanup logic in start_http_server_app() to prevent resource leaks between tests

Related issues

Closes #60214

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the dashboard subprocess handling to improve graceful shutdown and resource cleanup. The changes introduce more robust process termination, explicit resource management for HTTP sessions, and attempts to handle signals for graceful exit. While the direction is excellent, I've identified a critical issue in the new signal handling logic that would cause a RuntimeError, and a high-severity race condition in the resource cleanup order. I've provided detailed comments and suggestions to address these issues, which should make the shutdown process more reliable.

Comment thread python/ray/dashboard/subprocesses/module.py Outdated
Comment thread python/ray/dashboard/subprocesses/handle.py Outdated
Change-Id: Ia045cdf34ce8c473da05e114b76e07106c7bc4e9
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
@daiping8 daiping8 changed the title refactor(dashboard) [Core] Improve Dashborad SubprocessModuleHandle.destroy_module() Resource Cleanup and Process Termination Jan 16, 2026
@daiping8 daiping8 marked this pull request as ready for review January 17, 2026 06:32
@daiping8 daiping8 requested a review from a team as a code owner January 17, 2026 06:32
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jan 17, 2026
Change-Id: I564b0b41292666b9da8c39e7fe494fd40ffe3d1a
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Jan 20, 2026
@edoakes

edoakes commented Jan 20, 2026

Copy link
Copy Markdown
Collaborator

@sampan-s-nayak PTAL

@sampan-s-nayak sampan-s-nayak self-assigned this Jan 21, 2026

@sampan-s-nayak sampan-s-nayak left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this (and for the detailed pr description). LGTM

Comment thread python/ray/dashboard/subprocesses/handle.py Outdated
sampan-s-nayak and others added 2 commits January 21, 2026 10:25
Introduced a new constant for the graceful shutdown timeout of the subprocess module, allowing for better control over process termination. Updated the subprocess handling to utilize this new timeout value.

Change-Id: I4e49b28766c7c63956e6b9db1cf6b33cebc34210
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
cursor[bot]

This comment was marked as outdated.

Change-Id: I0ec1002cc87837e91006cdacb31a724b02eaa5c0
Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment thread python/ray/dashboard/subprocesses/tests/test_e2e.py
@edoakes edoakes merged commit eaeb585 into ray-project:master Jan 21, 2026
6 checks passed
ZacAttack added a commit to ZacAttack/ray that referenced this pull request Jan 22, 2026
…e() Resource Cleanup and Process Termination (ray-project#60172)"

This reverts commit eaeb585.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] Improve Dashborad SubprocessModuleHandle.destroy_module() Resource Cleanup and Process Termination

3 participants