Skip to content

[core] Fix raylet shutdown race(s)#57198

Merged
edoakes merged 14 commits into
ray-project:masterfrom
dayshah:fix-raylet-shutdown
Oct 16, 2025
Merged

[core] Fix raylet shutdown race(s)#57198
edoakes merged 14 commits into
ray-project:masterfrom
dayshah:fix-raylet-shutdown

Conversation

@dayshah

@dayshah dayshah commented Oct 5, 2025

Copy link
Copy Markdown
Contributor

Problem

Currently there's two different shutdown flags on the raylet. There's shutted_down in main.cc which tracks whether shutdown_raylet_gracefully_internal has been executed yet, and then there's is_shutting_down_ which tracks whether shutdown_raylet_gracefully_ has been called from NodeManager.

  1. shutted_down isn't atomically checked + changed inside shutdown_raylet_gracefully_internal. So it's possible to have the internal shutdown path happen twice.
  2. When the raylet gets sigtermed it calls shutdown_raylet_gracefully_internal which only sets shutted_down in main.cc and not is_shutting_down_ in node manager.cc. So we could end up in a case where we send an UnregisterSelf to the GCS and get the publish back that we're dead before. This will result in a RAY_LOG(FATAL) where the raylet will crash itself. See [core] fix test state api and dashboard flakiness #56966 for more context.
  3. shutdown_raylet_gracefully_ can also be called from the local resource manager or runtime env agent and we won't set a flag directly in that case.
  4. NodeInfoAccessor::UnregisterSelf wasn't thread safe.

Solution

The solution is to just have one shutdown_raylet_gracefully_ and one shutting_down flag that's set inside it. It's ok to not post shutdown_raylet_gracefully_ because I've made NodeInfoAccessor::UnregisterSelf thread safe by just killing the the local_node_id_ + local_node_info_ state because it was unnecessary and you can just pass node id into UnregisterSelf. The unregister callback is where the real shutdown work happens and that's posted onto the main io service so it's ok to not post the graceful shutdown callback.

shutting_down is now checked in the pubsub NodeRemoved callback so the raylet won't crash and ray_log fatal itself when it gets its own node death publish and shutdown has already started.

Also a bit of miscellaneous cpp and gcs client node accessor cleanup while I was there...

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah requested a review from a team as a code owner October 5, 2025 06:16
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Oct 5, 2025

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses race conditions during raylet shutdown by introducing a unified atomic shutting_down flag. This change simplifies the shutdown logic and prevents potential crashes. The introduction of this flag and its atomic usage in shutdown_raylet_gracefully and HandleShutdownRaylet is a solid improvement. Additionally, the pull request includes numerous valuable cleanups, such as using modern C++ features, optimizing for performance by avoiding copies and reserving vector capacity, and improving code style, which all contribute to better code quality and maintainability. I have one suggestion to improve the linkage of a helper function.

@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label Oct 5, 2025
Signed-off-by: dayshah <dhyey2019@gmail.com>
@edoakes

edoakes commented Oct 5, 2025

Copy link
Copy Markdown
Collaborator

@codope can you help review? You touched this recently.

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@codope codope self-assigned this Oct 6, 2025

@codope codope left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if there's a way to reduce the flag count to reduce confusion and more subtle bugs in future. We can replace the two flags by something like std::once_flag shutdown_once (keep shutting_down) process-wide and use std::call_once(shutdown_once, [&]{ … do internal graceful shutdown … }); to protect shutdown_raylet_gracefully_internal. Another option is to use enum like RayletState and a single std::atomic<RayletState> (drop all boolean flags).

Wdyt?

Comment thread src/ray/raylet/node_manager.cc Outdated
Comment thread src/ray/raylet/main.cc Outdated
Comment thread src/ray/raylet/tests/node_manager_test.cc Outdated
Comment thread src/ray/raylet/tests/node_manager_test.cc Outdated
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah requested a review from codope October 13, 2025 00:24
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah

dayshah commented Oct 13, 2025

Copy link
Copy Markdown
Contributor Author

Just wondering if there's a way to reduce the flag count to reduce confusion and more subtle bugs in future. We can replace the two flags by something like std::once_flag shutdown_once (keep shutting_down) process-wide and use std::call_once(shutdown_once, [&]{ … do internal graceful shutdown … }); to protect shutdown_raylet_gracefully_internal. Another option is to use enum like RayletState and a single std::atomic<RayletState> (drop all boolean flags).

Wdyt?

Ya we need at least 2 since there's 2 phases to shutdown and we want to shortcut to phase 2 in the sigterm case. But ya I like the enum idea, makes it more understandable. Moved to that.

cursor[bot]

This comment was marked as outdated.

@codope codope left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing the comments!

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah

dayshah commented Oct 15, 2025

Copy link
Copy Markdown
Contributor Author

@codope updated after talking to @edoakes offline
There's no real point of having a shutdown internal vs. shutdown because the actual shutdown stuff happens in the unregister callback and the unregister callback will always be posted to the main io service. So moved to just one shutdown function and one flag

@edoakes up for re-review

@edoakes edoakes left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent

@edoakes edoakes merged commit f0d03aa into ray-project:master Oct 16, 2025
6 checks passed
@dayshah dayshah deleted the fix-raylet-shutdown branch October 16, 2025 15:25
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
Currently there's two different shutdown flags on the raylet. There's
`shutted_down` in main.cc which tracks whether
`shutdown_raylet_gracefully_internal` has been executed yet, and then
there's `is_shutting_down_` which tracks whether
`shutdown_raylet_gracefully_` has been called from NodeManager.

1. `shutted_down` isn't atomically checked + changed inside
`shutdown_raylet_gracefully_internal`. So it's possible to have the
internal shutdown path happen twice.
2. When the raylet gets sigtermed it calls
`shutdown_raylet_gracefully_internal` which only sets `shutted_down` in
main.cc and not `is_shutting_down_` in node manager.cc. So we could end
up in a case where we send an UnregisterSelf to the GCS and get the
publish back that we're dead before. This will result in a
RAY_LOG(FATAL) where the raylet will crash itself. See
ray-project#56966 for more context.
3. `shutdown_raylet_gracefully_` can also be called from the local
resource manager or runtime env agent and we won't set a flag directly
in that case.
4. `NodeInfoAccessor::UnregisterSelf` wasn't thread safe.

## Solution
The solution is to just have one `shutdown_raylet_gracefully_` and one
`shutting_down` flag that's set inside it. It's ok to not post
`shutdown_raylet_gracefully_` because I've made
`NodeInfoAccessor::UnregisterSelf` thread safe by just killing the the
`local_node_id_` + `local_node_info_` state because it was unnecessary
and you can just pass node id into `UnregisterSelf`. The unregister
callback is where the real shutdown work happens and that's posted onto
the main io service so it's ok to not post the graceful shutdown
callback.

`shutting_down` is now checked in the pubsub NodeRemoved callback so the
raylet won't crash and ray_log fatal itself when it gets its own node
death publish and shutdown has already started.

Also a bit of miscellaneous cpp and gcs client node accessor cleanup
while I was there...

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
Currently there's two different shutdown flags on the raylet. There's
`shutted_down` in main.cc which tracks whether
`shutdown_raylet_gracefully_internal` has been executed yet, and then
there's `is_shutting_down_` which tracks whether
`shutdown_raylet_gracefully_` has been called from NodeManager.

1. `shutted_down` isn't atomically checked + changed inside
`shutdown_raylet_gracefully_internal`. So it's possible to have the
internal shutdown path happen twice.
2. When the raylet gets sigtermed it calls
`shutdown_raylet_gracefully_internal` which only sets `shutted_down` in
main.cc and not `is_shutting_down_` in node manager.cc. So we could end
up in a case where we send an UnregisterSelf to the GCS and get the
publish back that we're dead before. This will result in a
RAY_LOG(FATAL) where the raylet will crash itself. See
#56966 for more context.
3. `shutdown_raylet_gracefully_` can also be called from the local
resource manager or runtime env agent and we won't set a flag directly
in that case.
4. `NodeInfoAccessor::UnregisterSelf` wasn't thread safe.

## Solution
The solution is to just have one `shutdown_raylet_gracefully_` and one
`shutting_down` flag that's set inside it. It's ok to not post
`shutdown_raylet_gracefully_` because I've made
`NodeInfoAccessor::UnregisterSelf` thread safe by just killing the the
`local_node_id_` + `local_node_info_` state because it was unnecessary
and you can just pass node id into `UnregisterSelf`. The unregister
callback is where the real shutdown work happens and that's posted onto
the main io service so it's ok to not post the graceful shutdown
callback.

`shutting_down` is now checked in the pubsub NodeRemoved callback so the
raylet won't crash and ray_log fatal itself when it gets its own node
death publish and shutdown has already started.

Also a bit of miscellaneous cpp and gcs client node accessor cleanup
while I was there...

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Currently there's two different shutdown flags on the raylet. There's
`shutted_down` in main.cc which tracks whether
`shutdown_raylet_gracefully_internal` has been executed yet, and then
there's `is_shutting_down_` which tracks whether
`shutdown_raylet_gracefully_` has been called from NodeManager.

1. `shutted_down` isn't atomically checked + changed inside
`shutdown_raylet_gracefully_internal`. So it's possible to have the
internal shutdown path happen twice.
2. When the raylet gets sigtermed it calls
`shutdown_raylet_gracefully_internal` which only sets `shutted_down` in
main.cc and not `is_shutting_down_` in node manager.cc. So we could end
up in a case where we send an UnregisterSelf to the GCS and get the
publish back that we're dead before. This will result in a
RAY_LOG(FATAL) where the raylet will crash itself. See
ray-project#56966 for more context.
3. `shutdown_raylet_gracefully_` can also be called from the local
resource manager or runtime env agent and we won't set a flag directly
in that case.
4. `NodeInfoAccessor::UnregisterSelf` wasn't thread safe.

## Solution
The solution is to just have one `shutdown_raylet_gracefully_` and one
`shutting_down` flag that's set inside it. It's ok to not post
`shutdown_raylet_gracefully_` because I've made
`NodeInfoAccessor::UnregisterSelf` thread safe by just killing the the
`local_node_id_` + `local_node_info_` state because it was unnecessary
and you can just pass node id into `UnregisterSelf`. The unregister
callback is where the real shutdown work happens and that's posted onto
the main io service so it's ok to not post the graceful shutdown
callback.

`shutting_down` is now checked in the pubsub NodeRemoved callback so the
raylet won't crash and ray_log fatal itself when it gets its own node
death publish and shutdown has already started.

Also a bit of miscellaneous cpp and gcs client node accessor cleanup
while I was there...

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
Currently there's two different shutdown flags on the raylet. There's
`shutted_down` in main.cc which tracks whether
`shutdown_raylet_gracefully_internal` has been executed yet, and then
there's `is_shutting_down_` which tracks whether
`shutdown_raylet_gracefully_` has been called from NodeManager.

1. `shutted_down` isn't atomically checked + changed inside
`shutdown_raylet_gracefully_internal`. So it's possible to have the
internal shutdown path happen twice.
2. When the raylet gets sigtermed it calls
`shutdown_raylet_gracefully_internal` which only sets `shutted_down` in
main.cc and not `is_shutting_down_` in node manager.cc. So we could end
up in a case where we send an UnregisterSelf to the GCS and get the
publish back that we're dead before. This will result in a
RAY_LOG(FATAL) where the raylet will crash itself. See
ray-project#56966 for more context.
3. `shutdown_raylet_gracefully_` can also be called from the local
resource manager or runtime env agent and we won't set a flag directly
in that case.
4. `NodeInfoAccessor::UnregisterSelf` wasn't thread safe.

## Solution
The solution is to just have one `shutdown_raylet_gracefully_` and one
`shutting_down` flag that's set inside it. It's ok to not post
`shutdown_raylet_gracefully_` because I've made
`NodeInfoAccessor::UnregisterSelf` thread safe by just killing the the
`local_node_id_` + `local_node_info_` state because it was unnecessary
and you can just pass node id into `UnregisterSelf`. The unregister
callback is where the real shutdown work happens and that's posted onto
the main io service so it's ok to not post the graceful shutdown
callback.

`shutting_down` is now checked in the pubsub NodeRemoved callback so the
raylet won't crash and ray_log fatal itself when it gets its own node
death publish and shutdown has already started.

Also a bit of miscellaneous cpp and gcs client node accessor cleanup
while I was there...

---------

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants