[CORE] Filters bad subscriber messages from taking down GCS publisher#60252
Conversation
Signed-off-by: davik <davik@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a critical issue where an invalid channel in a subscription request could take down the GCS publisher. By changing RegisterSubscription to return a Status, the error can be gracefully handled by the GCS and propagated to the subscriber. The subscriber is correctly updated to treat this as a fatal error, which is appropriate for an internal system inconsistency. The changes are well-contained, and the addition of new unit tests ensures the new error-handling logic is robust. Overall, this is a solid improvement for cluster stability.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
…er pubsub Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
| return status; | ||
| } | ||
|
|
||
| RAY_CHECK(sub_message.has_worker_object_eviction_message() || |
There was a problem hiding this comment.
mmmm... My gut is telling me that it'd be sure nice if we had a utility function or something for this. Idea being, if we add more sub_message types in the future it would 'hopefully?' be obvious that we should update the list of message types? idk. Admittedly this is no worse then what we had. So maybe this is ok. What do you think?
There was a problem hiding this comment.
We chatted offline, and concluded this should go into a separate undertaking that defines a distinct proto API per pubsub system.
ZacAttack
left a comment
There was a problem hiding this comment.
Overall this looks good to me. Left one optional comment.
|
@dayshah Could you merge this for me when you get the chance? Thanks |
dayshah
left a comment
There was a problem hiding this comment.
wait what other ways are there for a bad subscriber message to come in other than version mismatch?
Also the status invalid will get returned in the reply, but how does the receiver handle that? I feel like this could be a real invariant in the system and the cluster should die if this happens
Signed-off-by: davik <davik@anyscale.com>
ZacAttack
left a comment
There was a problem hiding this comment.
Ok, aside from some nits, looks good.
Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: davik <davik@anyscale.com>
…#60252) ## Description Issue #58964 reported that it's possible that the GCS publisher may receive an invalid channel causing the cluster to be taken down. From investigation, we suspect this mostly likely sprung up due to a misconfiguration in the local environment of the subscriber (e.g. ray version mismatch). This PR Modify publisher reject and reply to messages with bad arguments. The subscriber will fail upon receiving a bad argument reply as pubsub should be internal and bad arguments indicate potential bugs in the system. However, we do not take down the cluster in this case in case of a local environment misconfiguration. ## Related issues #58964 ## Additional information --------- Signed-off-by: davik <davik@anyscale.com> Signed-off-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: davik <davik@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…#60252) ## Description Issue #58964 reported that it's possible that the GCS publisher may receive an invalid channel causing the cluster to be taken down. From investigation, we suspect this mostly likely sprung up due to a misconfiguration in the local environment of the subscriber (e.g. ray version mismatch). This PR Modify publisher reject and reply to messages with bad arguments. The subscriber will fail upon receiving a bad argument reply as pubsub should be internal and bad arguments indicate potential bugs in the system. However, we do not take down the cluster in this case in case of a local environment misconfiguration. ## Related issues #58964 ## Additional information --------- Signed-off-by: davik <davik@anyscale.com> Signed-off-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: davik <davik@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ray-project#60252) ## Description Issue ray-project#58964 reported that it's possible that the GCS publisher may receive an invalid channel causing the cluster to be taken down. From investigation, we suspect this mostly likely sprung up due to a misconfiguration in the local environment of the subscriber (e.g. ray version mismatch). This PR Modify publisher reject and reply to messages with bad arguments. The subscriber will fail upon receiving a bad argument reply as pubsub should be internal and bad arguments indicate potential bugs in the system. However, we do not take down the cluster in this case in case of a local environment misconfiguration. ## Related issues ray-project#58964 ## Additional information --------- Signed-off-by: davik <davik@anyscale.com> Signed-off-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Co-authored-by: davik <davik@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
Description
Issue #58964 reported that it's possible that the GCS publisher may receive an invalid channel causing the cluster to be taken down. From investigation, we suspect this mostly likely sprung up due to a misconfiguration in the local environment of the subscriber (e.g. ray version mismatch).
This PR Modify publisher reject and reply to messages with bad arguments. The subscriber will fail upon receiving a bad argument reply as pubsub should be internal and bad arguments indicate potential bugs in the system. However, we do not take down the cluster in this case in case of a local environment misconfiguration.
Related issues
#58964
Additional information