Skip to content

[CORE] Filters bad subscriber messages from taking down GCS publisher#60252

Merged
edoakes merged 10 commits into
ray-project:masterfrom
Kunchd:publisher_filters_torpedos
Jan 30, 2026
Merged

[CORE] Filters bad subscriber messages from taking down GCS publisher#60252
edoakes merged 10 commits into
ray-project:masterfrom
Kunchd:publisher_filters_torpedos

Conversation

@Kunchd

@Kunchd Kunchd commented Jan 17, 2026

Copy link
Copy Markdown
Contributor

Description

Issue #58964 reported that it's possible that the GCS publisher may receive an invalid channel causing the cluster to be taken down. From investigation, we suspect this mostly likely sprung up due to a misconfiguration in the local environment of the subscriber (e.g. ray version mismatch).

This PR Modify publisher reject and reply to messages with bad arguments. The subscriber will fail upon receiving a bad argument reply as pubsub should be internal and bad arguments indicate potential bugs in the system. However, we do not take down the cluster in this case in case of a local environment misconfiguration.

Related issues

#58964

Additional information

@Kunchd Kunchd requested a review from a team as a code owner January 17, 2026 00:25

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical issue where an invalid channel in a subscription request could take down the GCS publisher. By changing RegisterSubscription to return a Status, the error can be gracefully handled by the GCS and propagated to the subscriber. The subscriber is correctly updated to treat this as a fatal error, which is appropriate for an internal system inconsistency. The changes are well-contained, and the addition of new unit tests ensures the new error-handling logic is robust. Overall, this is a solid improvement for cluster stability.

Comment thread src/ray/pubsub/publisher.cc Outdated
@Kunchd Kunchd added the go add ONLY when ready to merge, run all tests label Jan 17, 2026
@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label Jan 17, 2026
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
@Kunchd Kunchd changed the title [CORE] Filters torpedos from taking down GCS publisher [CORE] Filters bad subscriber messages from taking down GCS publisher Jan 20, 2026
Comment thread src/ray/core_worker/core_worker.cc Outdated
Comment thread src/ray/core_worker/core_worker.cc Outdated
Comment thread src/ray/pubsub/publisher.cc Outdated
Signed-off-by: davik <davik@anyscale.com>
@Kunchd Kunchd requested a review from ZacAttack January 26, 2026 18:19
Comment thread src/ray/core_worker/core_worker.cc Outdated
Comment thread src/ray/core_worker/core_worker.cc Outdated
Comment thread src/ray/gcs/tests/BUILD.bazel
Comment thread src/ray/core_worker/core_worker.h
Signed-off-by: davik <davik@anyscale.com>
@Kunchd Kunchd requested a review from ZacAttack January 26, 2026 21:08
Comment thread src/ray/core_worker/core_worker.cc Outdated
return status;
}

RAY_CHECK(sub_message.has_worker_object_eviction_message() ||

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmm... My gut is telling me that it'd be sure nice if we had a utility function or something for this. Idea being, if we add more sub_message types in the future it would 'hopefully?' be obvious that we should update the list of message types? idk. Admittedly this is no worse then what we had. So maybe this is ok. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted offline, and concluded this should go into a separate undertaking that defines a distinct proto API per pubsub system.

#60510

@ZacAttack ZacAttack left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. Left one optional comment.

@Kunchd

Kunchd commented Jan 27, 2026

Copy link
Copy Markdown
Contributor Author

@dayshah Could you merge this for me when you get the chance? Thanks

@dayshah dayshah left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait what other ways are there for a bad subscriber message to come in other than version mismatch?

Also the status invalid will get returned in the reply, but how does the receiver handle that? I feel like this could be a real invariant in the system and the cluster should die if this happens

Comment thread src/ray/pubsub/publisher_interface.h Outdated
Comment thread src/ray/pubsub/publisher.cc Outdated
Comment thread src/ray/pubsub/subscriber.cc Outdated
Comment thread src/ray/pubsub/tests/publisher_test.cc Outdated
Comment thread src/ray/pubsub/tests/publisher_test.cc Outdated
Comment thread src/ray/pubsub/tests/publisher_test.cc Outdated
Comment thread src/ray/pubsub/tests/pubsub_integration_test.cc Outdated
Comment thread src/ray/pubsub/tests/subscriber_test.cc Outdated
Comment thread src/ray/gcs/pubsub_handler.cc Outdated
Comment thread src/ray/gcs/pubsub_handler.cc Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment thread src/ray/core_worker/core_worker.cc
Comment thread src/ray/pubsub/subscriber.cc Outdated
Comment thread src/ray/pubsub/tests/pubsub_integration_test.cc Outdated
Comment thread src/ray/gcs/pubsub_handler.cc Outdated

@ZacAttack ZacAttack left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, aside from some nits, looks good.

@edoakes edoakes enabled auto-merge (squash) January 29, 2026 23:15
Signed-off-by: davik <davik@anyscale.com>
@github-actions github-actions Bot disabled auto-merge January 29, 2026 23:20
@edoakes edoakes enabled auto-merge (squash) January 29, 2026 23:22
Signed-off-by: davik <davik@anyscale.com>
@github-actions github-actions Bot disabled auto-merge January 30, 2026 07:20
@edoakes edoakes merged commit a95469d into ray-project:master Jan 30, 2026
6 checks passed
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
…#60252)

## Description
Issue #58964 reported that it's
possible that the GCS publisher may receive an invalid channel causing
the cluster to be taken down. From investigation, we suspect this mostly
likely sprung up due to a misconfiguration in the local environment of
the subscriber (e.g. ray version mismatch).


This PR Modify publisher reject and reply to messages with bad
arguments. The subscriber will fail upon receiving a bad argument reply
as pubsub should be internal and bad arguments indicate potential bugs
in the system. However, we do not take down the cluster in this case in
case of a local environment misconfiguration.

## Related issues
#58964

## Additional information

---------

Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Co-authored-by: davik <davik@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
…#60252)

## Description
Issue #58964 reported that it's
possible that the GCS publisher may receive an invalid channel causing
the cluster to be taken down. From investigation, we suspect this mostly
likely sprung up due to a misconfiguration in the local environment of
the subscriber (e.g. ray version mismatch).


This PR Modify publisher reject and reply to messages with bad
arguments. The subscriber will fail upon receiving a bad argument reply
as pubsub should be internal and bad arguments indicate potential bugs
in the system. However, we do not take down the cluster in this case in
case of a local environment misconfiguration.

## Related issues
#58964

## Additional information

---------

Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Co-authored-by: davik <davik@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…ray-project#60252)

## Description
Issue ray-project#58964 reported that it's
possible that the GCS publisher may receive an invalid channel causing
the cluster to be taken down. From investigation, we suspect this mostly
likely sprung up due to a misconfiguration in the local environment of
the subscriber (e.g. ray version mismatch).

This PR Modify publisher reject and reply to messages with bad
arguments. The subscriber will fail upon receiving a bad argument reply
as pubsub should be internal and bad arguments indicate potential bugs
in the system. However, we do not take down the cluster in this case in
case of a local environment misconfiguration.

## Related issues
ray-project#58964

## Additional information

---------

Signed-off-by: davik <davik@anyscale.com>
Signed-off-by: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com>
Co-authored-by: davik <davik@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
@Kunchd Kunchd deleted the publisher_filters_torpedos branch June 5, 2026 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants