[core] Fix raylet shutdown race(s)#57198
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses race conditions during raylet shutdown by introducing a unified atomic shutting_down flag. This change simplifies the shutdown logic and prevents potential crashes. The introduction of this flag and its atomic usage in shutdown_raylet_gracefully and HandleShutdownRaylet is a solid improvement. Additionally, the pull request includes numerous valuable cleanups, such as using modern C++ features, optimizing for performance by avoiding copies and reserving vector capacity, and improving code style, which all contribute to better code quality and maintainability. I have one suggestion to improve the linkage of a helper function.
Signed-off-by: dayshah <dhyey2019@gmail.com>
|
@codope can you help review? You touched this recently. |
Signed-off-by: dayshah <dhyey2019@gmail.com>
codope
left a comment
There was a problem hiding this comment.
Just wondering if there's a way to reduce the flag count to reduce confusion and more subtle bugs in future. We can replace the two flags by something like std::once_flag shutdown_once (keep shutting_down) process-wide and use std::call_once(shutdown_once, [&]{ … do internal graceful shutdown … }); to protect shutdown_raylet_gracefully_internal. Another option is to use enum like RayletState and a single std::atomic<RayletState> (drop all boolean flags).
Wdyt?
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Ya we need at least 2 since there's 2 phases to shutdown and we want to shortcut to phase 2 in the sigterm case. But ya I like the enum idea, makes it more understandable. Moved to that. |
codope
left a comment
There was a problem hiding this comment.
LGTM, thanks for addressing the comments!
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
|
@codope updated after talking to @edoakes offline @edoakes up for re-review |
Currently there's two different shutdown flags on the raylet. There's `shutted_down` in main.cc which tracks whether `shutdown_raylet_gracefully_internal` has been executed yet, and then there's `is_shutting_down_` which tracks whether `shutdown_raylet_gracefully_` has been called from NodeManager. 1. `shutted_down` isn't atomically checked + changed inside `shutdown_raylet_gracefully_internal`. So it's possible to have the internal shutdown path happen twice. 2. When the raylet gets sigtermed it calls `shutdown_raylet_gracefully_internal` which only sets `shutted_down` in main.cc and not `is_shutting_down_` in node manager.cc. So we could end up in a case where we send an UnregisterSelf to the GCS and get the publish back that we're dead before. This will result in a RAY_LOG(FATAL) where the raylet will crash itself. See ray-project#56966 for more context. 3. `shutdown_raylet_gracefully_` can also be called from the local resource manager or runtime env agent and we won't set a flag directly in that case. 4. `NodeInfoAccessor::UnregisterSelf` wasn't thread safe. ## Solution The solution is to just have one `shutdown_raylet_gracefully_` and one `shutting_down` flag that's set inside it. It's ok to not post `shutdown_raylet_gracefully_` because I've made `NodeInfoAccessor::UnregisterSelf` thread safe by just killing the the `local_node_id_` + `local_node_info_` state because it was unnecessary and you can just pass node id into `UnregisterSelf`. The unregister callback is where the real shutdown work happens and that's posted onto the main io service so it's ok to not post the graceful shutdown callback. `shutting_down` is now checked in the pubsub NodeRemoved callback so the raylet won't crash and ray_log fatal itself when it gets its own node death publish and shutdown has already started. Also a bit of miscellaneous cpp and gcs client node accessor cleanup while I was there... --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>
Currently there's two different shutdown flags on the raylet. There's `shutted_down` in main.cc which tracks whether `shutdown_raylet_gracefully_internal` has been executed yet, and then there's `is_shutting_down_` which tracks whether `shutdown_raylet_gracefully_` has been called from NodeManager. 1. `shutted_down` isn't atomically checked + changed inside `shutdown_raylet_gracefully_internal`. So it's possible to have the internal shutdown path happen twice. 2. When the raylet gets sigtermed it calls `shutdown_raylet_gracefully_internal` which only sets `shutted_down` in main.cc and not `is_shutting_down_` in node manager.cc. So we could end up in a case where we send an UnregisterSelf to the GCS and get the publish back that we're dead before. This will result in a RAY_LOG(FATAL) where the raylet will crash itself. See #56966 for more context. 3. `shutdown_raylet_gracefully_` can also be called from the local resource manager or runtime env agent and we won't set a flag directly in that case. 4. `NodeInfoAccessor::UnregisterSelf` wasn't thread safe. ## Solution The solution is to just have one `shutdown_raylet_gracefully_` and one `shutting_down` flag that's set inside it. It's ok to not post `shutdown_raylet_gracefully_` because I've made `NodeInfoAccessor::UnregisterSelf` thread safe by just killing the the `local_node_id_` + `local_node_info_` state because it was unnecessary and you can just pass node id into `UnregisterSelf`. The unregister callback is where the real shutdown work happens and that's posted onto the main io service so it's ok to not post the graceful shutdown callback. `shutting_down` is now checked in the pubsub NodeRemoved callback so the raylet won't crash and ray_log fatal itself when it gets its own node death publish and shutdown has already started. Also a bit of miscellaneous cpp and gcs client node accessor cleanup while I was there... --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Currently there's two different shutdown flags on the raylet. There's `shutted_down` in main.cc which tracks whether `shutdown_raylet_gracefully_internal` has been executed yet, and then there's `is_shutting_down_` which tracks whether `shutdown_raylet_gracefully_` has been called from NodeManager. 1. `shutted_down` isn't atomically checked + changed inside `shutdown_raylet_gracefully_internal`. So it's possible to have the internal shutdown path happen twice. 2. When the raylet gets sigtermed it calls `shutdown_raylet_gracefully_internal` which only sets `shutted_down` in main.cc and not `is_shutting_down_` in node manager.cc. So we could end up in a case where we send an UnregisterSelf to the GCS and get the publish back that we're dead before. This will result in a RAY_LOG(FATAL) where the raylet will crash itself. See ray-project#56966 for more context. 3. `shutdown_raylet_gracefully_` can also be called from the local resource manager or runtime env agent and we won't set a flag directly in that case. 4. `NodeInfoAccessor::UnregisterSelf` wasn't thread safe. ## Solution The solution is to just have one `shutdown_raylet_gracefully_` and one `shutting_down` flag that's set inside it. It's ok to not post `shutdown_raylet_gracefully_` because I've made `NodeInfoAccessor::UnregisterSelf` thread safe by just killing the the `local_node_id_` + `local_node_info_` state because it was unnecessary and you can just pass node id into `UnregisterSelf`. The unregister callback is where the real shutdown work happens and that's posted onto the main io service so it's ok to not post the graceful shutdown callback. `shutting_down` is now checked in the pubsub NodeRemoved callback so the raylet won't crash and ray_log fatal itself when it gets its own node death publish and shutdown has already started. Also a bit of miscellaneous cpp and gcs client node accessor cleanup while I was there... --------- Signed-off-by: dayshah <dhyey2019@gmail.com>
Currently there's two different shutdown flags on the raylet. There's `shutted_down` in main.cc which tracks whether `shutdown_raylet_gracefully_internal` has been executed yet, and then there's `is_shutting_down_` which tracks whether `shutdown_raylet_gracefully_` has been called from NodeManager. 1. `shutted_down` isn't atomically checked + changed inside `shutdown_raylet_gracefully_internal`. So it's possible to have the internal shutdown path happen twice. 2. When the raylet gets sigtermed it calls `shutdown_raylet_gracefully_internal` which only sets `shutted_down` in main.cc and not `is_shutting_down_` in node manager.cc. So we could end up in a case where we send an UnregisterSelf to the GCS and get the publish back that we're dead before. This will result in a RAY_LOG(FATAL) where the raylet will crash itself. See ray-project#56966 for more context. 3. `shutdown_raylet_gracefully_` can also be called from the local resource manager or runtime env agent and we won't set a flag directly in that case. 4. `NodeInfoAccessor::UnregisterSelf` wasn't thread safe. ## Solution The solution is to just have one `shutdown_raylet_gracefully_` and one `shutting_down` flag that's set inside it. It's ok to not post `shutdown_raylet_gracefully_` because I've made `NodeInfoAccessor::UnregisterSelf` thread safe by just killing the the `local_node_id_` + `local_node_info_` state because it was unnecessary and you can just pass node id into `UnregisterSelf`. The unregister callback is where the real shutdown work happens and that's posted onto the main io service so it's ok to not post the graceful shutdown callback. `shutting_down` is now checked in the pubsub NodeRemoved callback so the raylet won't crash and ray_log fatal itself when it gets its own node death publish and shutdown has already started. Also a bit of miscellaneous cpp and gcs client node accessor cleanup while I was there... --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Problem
Currently there's two different shutdown flags on the raylet. There's
shutted_downin main.cc which tracks whethershutdown_raylet_gracefully_internalhas been executed yet, and then there'sis_shutting_down_which tracks whethershutdown_raylet_gracefully_has been called from NodeManager.shutted_downisn't atomically checked + changed insideshutdown_raylet_gracefully_internal. So it's possible to have the internal shutdown path happen twice.shutdown_raylet_gracefully_internalwhich only setsshutted_downin main.cc and notis_shutting_down_in node manager.cc. So we could end up in a case where we send an UnregisterSelf to the GCS and get the publish back that we're dead before. This will result in a RAY_LOG(FATAL) where the raylet will crash itself. See [core] fix test state api and dashboard flakiness #56966 for more context.shutdown_raylet_gracefully_can also be called from the local resource manager or runtime env agent and we won't set a flag directly in that case.NodeInfoAccessor::UnregisterSelfwasn't thread safe.Solution
The solution is to just have one
shutdown_raylet_gracefully_and oneshutting_downflag that's set inside it. It's ok to not postshutdown_raylet_gracefully_because I've madeNodeInfoAccessor::UnregisterSelfthread safe by just killing the thelocal_node_id_+local_node_info_state because it was unnecessary and you can just pass node id intoUnregisterSelf. The unregister callback is where the real shutdown work happens and that's posted onto the main io service so it's ok to not post the graceful shutdown callback.shutting_downis now checked in the pubsub NodeRemoved callback so the raylet won't crash and ray_log fatal itself when it gets its own node death publish and shutdown has already started.Also a bit of miscellaneous cpp and gcs client node accessor cleanup while I was there...