Skip to content

[core] Fix "RayEventRecorder::StartExportingEvents() should be called only once."#57917

Merged
can-anyscale merged 2 commits into
masterfrom
can-1ev07
Oct 20, 2025
Merged

[core] Fix "RayEventRecorder::StartExportingEvents() should be called only once."#57917
can-anyscale merged 2 commits into
masterfrom
can-1ev07

Conversation

@can-anyscale

@can-anyscale can-anyscale commented Oct 20, 2025

Copy link
Copy Markdown
Contributor

This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:

  • The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events.
  • At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events.

This PR introduces two fixes:

  • In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger).
  • Do not call StartExportingEvents if the connection cannot be established.

Test:

  • CI

Signed-off-by: Cuong Nguyen <can@anyscale.com>
@can-anyscale can-anyscale requested a review from a team as a code owner October 20, 2025 17:20
RAY_LOG(DEBUG) << "Initiate the metrics client of address:"
<< BuildAddress(address, port);
grpc_client_ =
std::make_unique<GrpcClient<ReporterService>>(address, port, client_call_manager);
retry_timer_ = std::make_unique<boost::asio::steady_timer>(io_service);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node: we don't perform retry async anymore; instead, we'll retry during the callback of the connection healthcheck

Comment thread src/ray/rpc/metrics_agent_client.h Outdated
@@ -89,7 +89,7 @@ class MetricsAgentClientImpl : public MetricsAgentClient {
VOID_RPC_CLIENT_METHOD(ReporterService,
HealthCheck,
grpc_client_,
/*method_timeout_ms*/ -1,
/*method_timeout_ms*/ kMetricAgentInitRetryDelayMs * 1000,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set a finite amount of timed out so we don't hang indefinitely in retrying

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass in directly when you call the PRC, I think it's more direct

@can-anyscale can-anyscale added the go add ONLY when ready to merge, run all tests label Oct 20, 2025
cursor[bot]

This comment was marked as outdated.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition that could lead to a RAY_CHECK failure in RayEventRecorder::StartExportingEvents(). The fix involves two main changes: making StartExportingEvents aware of the connection status, and synchronizing the connection logic in MetricsAgentClientImpl. The changes are logical and move towards resolving the issue. However, I've identified a remaining race condition in MetricsAgentClientImpl on the exporter_initialized_ flag, which could still lead to the same RAY_CHECK failure under concurrent execution. My review includes a detailed comment and suggestions to make the implementation fully thread-safe.

Comment on lines +48 to +55
if (status.ok()) {
if (exporter_initialized_) {
return;
}
init_exporter_fn(status);
exporter_initialized_ = true;
RAY_LOG(INFO) << "Exporter initialized.";
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block has a race condition on exporter_initialized_. Since the io_service can use multiple threads, two concurrent HealthCheck callbacks could both see exporter_initialized_ as false, leading to init_exporter_fn being called twice. This would re-introduce the RAY_CHECK failure this PR aims to fix.

To solve this, you should protect this critical section. One way is to use a mutex. You would add absl::Mutex exporter_mutex_; to MetricsAgentClientImpl (in the .h file, and include absl/synchronization/mutex.h) and then use it here:

if (status.ok()) {
  absl::MutexLock lock(&exporter_mutex_);
  if (exporter_initialized_) {
    return;
  }
  init_exporter_fn(status);
  exporter_initialized_ = true;
  RAY_LOG(INFO) << "Exporter initialized.";
  return;
}

An alternative is to make exporter_initialized_ an std::atomic<bool> (which would require including <atomic>).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, io_context is single threaded

Comment on lines +37 to +45
void RayEventRecorder::StartExportingEvents(
const Status &aggregator_agent_connection_status) {
absl::MutexLock lock(&mutex_);
if (!aggregator_agent_connection_status.ok()) {
RAY_LOG(ERROR) << "Failed to establish connection to the event aggregator agent. "
<< "Events will not be exported. Error: "
<< aggregator_agent_connection_status.ToString();
return;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a surprising pattern -- why not just avoid calling StartExportingEvents if the health check fails?

@can-anyscale can-anyscale Oct 20, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at some point in the past you suggested (or I might have misunderstood) that we should propagate the error status to these sub-components (e.g., otelrecorder, eventrecorder, etc.) so they can handle errors themselve, which is the current pattern here.

But yes, it’s probably better not to call these sub-components at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's just NOT call export then

@can-anyscale can-anyscale force-pushed the can-1ev07 branch 3 times, most recently from c47bf08 to 8b65f96 Compare October 20, 2025 18:15
Comment thread src/ray/gcs/gcs_server.cc
@@ -296,7 +296,9 @@ void GcsServer::DoStart(const GcsInitData &gcs_init_data) {
// Init metrics and event exporter.
metrics_agent_client_->WaitForServerReady([this](const Status &server_status) {
stats::InitOpenTelemetryExporter(config_.metrics_agent_port, server_status);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll refactor this call to use the same pattern in another follow up (to make this PR minimal)

@can-anyscale can-anyscale requested review from edoakes and jjyao October 20, 2025 18:16
@can-anyscale

Copy link
Copy Markdown
Contributor Author

@edoakes , @jjyao for re-review, thankks

cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label Oct 20, 2025
@edoakes

edoakes commented Oct 20, 2025

Copy link
Copy Markdown
Collaborator

build failure @can-anyscale

… called only once."

Signed-off-by: Cuong Nguyen <can@anyscale.com>
@can-anyscale

Copy link
Copy Markdown
Contributor Author

@edoakes - my bad, fixed now, tested that it builds locally

init_exporter_fn, retry_count + 1, max_retry, retry_interval_ms);
},
"MetricsAgentClient.WaitForServerReadyWithRetry",
retry_interval_ms * 1000);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect Retry Timing in Metrics Exporter

The io_service_.post() call uses retry_interval_ms * 1000 for its delay parameter. Since retry_interval_ms is in milliseconds, this multiplication likely causes incorrect retry timing, making retries either too fast or too slow depending on the expected unit. This impacts the metrics exporter's initialization.

Fix in Cursor Fix in Web

@can-anyscale can-anyscale enabled auto-merge (squash) October 20, 2025 20:13
@can-anyscale can-anyscale merged commit 299eb1b into master Oct 20, 2025
7 checks passed
@can-anyscale can-anyscale deleted the can-1ev07 branch October 20, 2025 21:27
kamil-kaczmarek pushed a commit that referenced this pull request Oct 20, 2025
… only once." (#57917)

This PR fixes the Ray check failure
RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
- The metric_agent_client successfully establishes a connection with the
dashboard agent. In this case, RayEventRecorder::StartExportingEvents is
correctly invoked to start sending events.
- At the same time, the metric_agent_client exceeds its maximum number
of connection retries. In this case,
RayEventRecorder::StartExportingEvents is invoked again incorrectly,
causing duplicate attempts to start exporting events.

This PR introduces two fixes:
- In metric_agent_client, the connection success and retry logic are now
synchronized (previously they ran asynchronously, allowing both paths to
trigger).
- Do not call StartExportingEvents if the connection cannot be
established.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
… only once." (ray-project#57917)

This PR fixes the Ray check failure
RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
- The metric_agent_client successfully establishes a connection with the
dashboard agent. In this case, RayEventRecorder::StartExportingEvents is
correctly invoked to start sending events.
- At the same time, the metric_agent_client exceeds its maximum number
of connection retries. In this case,
RayEventRecorder::StartExportingEvents is invoked again incorrectly,
causing duplicate attempts to start exporting events.

This PR introduces two fixes:
- In metric_agent_client, the connection success and retry logic are now
synchronized (previously they ran asynchronously, allowing both paths to
trigger).
- Do not call StartExportingEvents if the connection cannot be
established.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
… only once." (#57917)

This PR fixes the Ray check failure
RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
- The metric_agent_client successfully establishes a connection with the
dashboard agent. In this case, RayEventRecorder::StartExportingEvents is
correctly invoked to start sending events.
- At the same time, the metric_agent_client exceeds its maximum number
of connection retries. In this case,
RayEventRecorder::StartExportingEvents is invoked again incorrectly,
causing duplicate attempts to start exporting events.

This PR introduces two fixes:
- In metric_agent_client, the connection success and retry logic are now
synchronized (previously they ran asynchronously, allowing both paths to
trigger).
- Do not call StartExportingEvents if the connection cannot be
established.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
… only once." (ray-project#57917)

This PR fixes the Ray check failure
RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
- The metric_agent_client successfully establishes a connection with the
dashboard agent. In this case, RayEventRecorder::StartExportingEvents is
correctly invoked to start sending events.
- At the same time, the metric_agent_client exceeds its maximum number
of connection retries. In this case,
RayEventRecorder::StartExportingEvents is invoked again incorrectly,
causing duplicate attempts to start exporting events.

This PR introduces two fixes:
- In metric_agent_client, the connection success and retry logic are now
synchronized (previously they ran asynchronously, allowing both paths to
trigger).
- Do not call StartExportingEvents if the connection cannot be
established.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
… only once." (ray-project#57917)

This PR fixes the Ray check failure
RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
- The metric_agent_client successfully establishes a connection with the
dashboard agent. In this case, RayEventRecorder::StartExportingEvents is
correctly invoked to start sending events.
- At the same time, the metric_agent_client exceeds its maximum number
of connection retries. In this case,
RayEventRecorder::StartExportingEvents is invoked again incorrectly,
causing duplicate attempts to start exporting events.

This PR introduces two fixes:
- In metric_agent_client, the connection success and retry logic are now
synchronized (previously they ran asynchronously, allowing both paths to
trigger).
- Do not call StartExportingEvents if the connection cannot be
established.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants