[core] Fix "RayEventRecorder::StartExportingEvents() should be called only once."#57917
Conversation
Signed-off-by: Cuong Nguyen <can@anyscale.com>
| RAY_LOG(DEBUG) << "Initiate the metrics client of address:" | ||
| << BuildAddress(address, port); | ||
| grpc_client_ = | ||
| std::make_unique<GrpcClient<ReporterService>>(address, port, client_call_manager); | ||
| retry_timer_ = std::make_unique<boost::asio::steady_timer>(io_service); |
There was a problem hiding this comment.
node: we don't perform retry async anymore; instead, we'll retry during the callback of the connection healthcheck
| @@ -89,7 +89,7 @@ class MetricsAgentClientImpl : public MetricsAgentClient { | |||
| VOID_RPC_CLIENT_METHOD(ReporterService, | |||
| HealthCheck, | |||
| grpc_client_, | |||
| /*method_timeout_ms*/ -1, | |||
| /*method_timeout_ms*/ kMetricAgentInitRetryDelayMs * 1000, | |||
There was a problem hiding this comment.
set a finite amount of timed out so we don't hang indefinitely in retrying
There was a problem hiding this comment.
Pass in directly when you call the PRC, I think it's more direct
There was a problem hiding this comment.
Code Review
This pull request addresses a race condition that could lead to a RAY_CHECK failure in RayEventRecorder::StartExportingEvents(). The fix involves two main changes: making StartExportingEvents aware of the connection status, and synchronizing the connection logic in MetricsAgentClientImpl. The changes are logical and move towards resolving the issue. However, I've identified a remaining race condition in MetricsAgentClientImpl on the exporter_initialized_ flag, which could still lead to the same RAY_CHECK failure under concurrent execution. My review includes a detailed comment and suggestions to make the implementation fully thread-safe.
| if (status.ok()) { | ||
| if (exporter_initialized_) { | ||
| return; | ||
| } | ||
| init_exporter_fn(status); | ||
| exporter_initialized_ = true; | ||
| RAY_LOG(INFO) << "Exporter initialized."; | ||
| return; |
There was a problem hiding this comment.
This block has a race condition on exporter_initialized_. Since the io_service can use multiple threads, two concurrent HealthCheck callbacks could both see exporter_initialized_ as false, leading to init_exporter_fn being called twice. This would re-introduce the RAY_CHECK failure this PR aims to fix.
To solve this, you should protect this critical section. One way is to use a mutex. You would add absl::Mutex exporter_mutex_; to MetricsAgentClientImpl (in the .h file, and include absl/synchronization/mutex.h) and then use it here:
if (status.ok()) {
absl::MutexLock lock(&exporter_mutex_);
if (exporter_initialized_) {
return;
}
init_exporter_fn(status);
exporter_initialized_ = true;
RAY_LOG(INFO) << "Exporter initialized.";
return;
}An alternative is to make exporter_initialized_ an std::atomic<bool> (which would require including <atomic>).
There was a problem hiding this comment.
thanks, io_context is single threaded
| void RayEventRecorder::StartExportingEvents( | ||
| const Status &aggregator_agent_connection_status) { | ||
| absl::MutexLock lock(&mutex_); | ||
| if (!aggregator_agent_connection_status.ok()) { | ||
| RAY_LOG(ERROR) << "Failed to establish connection to the event aggregator agent. " | ||
| << "Events will not be exported. Error: " | ||
| << aggregator_agent_connection_status.ToString(); | ||
| return; | ||
| } |
There was a problem hiding this comment.
this is a surprising pattern -- why not just avoid calling StartExportingEvents if the health check fails?
There was a problem hiding this comment.
I think at some point in the past you suggested (or I might have misunderstood) that we should propagate the error status to these sub-components (e.g., otelrecorder, eventrecorder, etc.) so they can handle errors themselve, which is the current pattern here.
But yes, it’s probably better not to call these sub-components at all.
There was a problem hiding this comment.
ok, let's just NOT call export then
c47bf08 to
8b65f96
Compare
| @@ -296,7 +296,9 @@ void GcsServer::DoStart(const GcsInitData &gcs_init_data) { | |||
| // Init metrics and event exporter. | |||
| metrics_agent_client_->WaitForServerReady([this](const Status &server_status) { | |||
| stats::InitOpenTelemetryExporter(config_.metrics_agent_port, server_status); | |||
There was a problem hiding this comment.
i'll refactor this call to use the same pattern in another follow up (to make this PR minimal)
8b65f96 to
88c170d
Compare
|
build failure @can-anyscale |
88c170d to
a7c6fc2
Compare
… called only once." Signed-off-by: Cuong Nguyen <can@anyscale.com>
a7c6fc2 to
2be8a2b
Compare
|
@edoakes - my bad, fixed now, tested that it builds locally |
| init_exporter_fn, retry_count + 1, max_retry, retry_interval_ms); | ||
| }, | ||
| "MetricsAgentClient.WaitForServerReadyWithRetry", | ||
| retry_interval_ms * 1000); |
There was a problem hiding this comment.
Bug: Incorrect Retry Timing in Metrics Exporter
The io_service_.post() call uses retry_interval_ms * 1000 for its delay parameter. Since retry_interval_ms is in milliseconds, this multiplication likely causes incorrect retry timing, making retries either too fast or too slow depending on the expected unit. This impacts the metrics exporter's initialization.
… only once." (#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
… only once." (ray-project#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>
… only once." (#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
… only once." (ray-project#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <can@anyscale.com>
… only once." (ray-project#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
This PR introduces two fixes:
Test: