Aquileo | [serve] Fix request hanging on upstream `ActorDiedError` in chained `DeploymentResponse` by jeffreywang-anyscale · Pull Request #62147 · ray-project/ray

jeffreywang-anyscale · 2026-03-27T23:03:49Z

Description

When an upstream deployment dies (e.g., os._exit()) in a model composition chain, the downstream router's retry loop would spin indefinitely on the permanently-failed input, causing the request to hang until the caller's timeout instead of failing fast.
The fix propagates the upstream ActorDiedError immediately when the error is from an upstream dependency (actor ID mismatch) and the request was already accepted (callback registered), breaking the infinite retry loop.

Related issues

Fixes #61594 (comment).

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

…esponse Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

gemini-code-assist

Code Review

This pull request improves the handling of ActorDiedError in Ray Serve's router, specifically for chained deployment responses. By distinguishing between failures in the local replica and those in upstream dependencies, the router can now propagate upstream errors immediately, preventing unnecessary retries or hangs. The changes include updates to _handle_actor_died_error and _route_and_send_request_once, along with a new test case to verify fast-failing behavior. A review comment suggested improving the robustness of the new test by replacing an infinite loop with a bounded one and ignoring transient exceptions to avoid flakiness.

gemini-code-assist · 2026-03-27T23:05:53Z

+    with pytest.raises(RayActorError):
+        while True:
+            handle.remote(0).result(timeout_s=10)


The while True loop combined with a strict pytest.raises(RayActorError) could make this test flaky. If a transient error occurs (e.g., DeploymentUnavailableError while the replica is restarting), the test will fail unexpectedly.

To improve robustness, I recommend using a bounded loop and ignoring other exceptions. This ensures the test specifically checks for the RayActorError from the upstream crash, making it more reliable.

Suggested change

with pytest.raises(RayActorError):
while True:
handle.remote(0).result(timeout_s=10)
with pytest.raises(RayActorError):
# Loop a number of times to ensure we hit the crashing replica.
for _ in range(20):
try:
handle.remote(0).result(timeout_s=10)
except RayActorError:
raise # Re-raise to be caught by pytest.raises
except Exception:
pass # Ignore other exceptions and retry

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale · 2026-03-27T23:20:13Z

-    RAY_SERVE_FORCE_LOCAL_TESTING_MODE,
-    reason="local_testing_mode has different failure handling.",
-)
-def test_chained_deployment_response_multiple_crashes_then_success(serve_instance):


Note for reviewer: This test is now covered by test_chained_deployment_response_upstream_crash_downstream_stays_in_rotation and the new test test_chained_deployment_response_upstream_crash_fails_fast.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

Signed-off-by: abrar <abrar@anyscale.com>

…oute

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh · 2026-03-30T00:58:36Z

+    # Ray's ActorDiedError text embedded (actor id / replica name, etc.).
+    with pytest.raises(RayServeException) as exc_info:
+        while True:
+            handle.remote(0).result(timeout_s=10)


sample logs

(ServeReplica:default:Adder pid=3359418) INFO 2026-03-30 00:54:50,498 default_Adder j4y62r6x 261f1a95-8fb2-45e1-af00-e00c4ae9a1dd -- CALL __call__ OK 0.6ms (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID: 0200000090fe562c8e59b7fd696797d6412fccaa51316e0668e4a44ac64bb1c2 Worker ID: 71a47fa8332ebbb8c577310c3b4bec58ae10a19eebee78e22feafafe Node ID: 669d9f1276344e8196459302630a51de5f922c49f98932e354e2a0a2 Worker IP address: 172.31.7.228 Worker port: 34729 Worker PID: 3359414 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error. INFO 2026-03-30 00:54:50,529 serve 3358518 -- Deleting app ['default'] (ProxyActor pid=3359413) INFO 2026-03-30 00:54:50,598 proxy 172.31.7.228 -- Got updated endpoints: {}. (ServeReplica:default:Ingress pid=3359416) WARNING 2026-03-30 00:54:50,521 default_Ingress 9fjej5dq -- Replica(id='ankrf4s5', deployment='Adder', app='default') will not be considered for future requests because it has died. (ServeReplica:default:Ingress pid=3359416) ERROR 2026-03-30 00:54:50,524 default_Ingress 9fjej5dq 6ec2c7c8-ae53-4f48-b2d7-29e94d7f7aec -- Request failed. (ServeReplica:default:Ingress pid=3359416) Traceback (most recent call last): (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 1310, in _handle_errors_and_metrics (ServeReplica:default:Ingress pid=3359416) yield _status_code_callback (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 2084, in _wrap_request (ServeReplica:default:Ingress pid=3359416) yield status_code_callback (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 1640, in handle_request_with_rejection (ServeReplica:default:Ingress pid=3359416) raise self._maybe_wrap_grpc_exception(e, request_metadata) from e (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 1635, in handle_request_with_rejection (ServeReplica:default:Ingress pid=3359416) yield await self._user_callable_wrapper.call_user_method( (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 3657, in call_user_method (ServeReplica:default:Ingress pid=3359416) result, _ = await self._call_func_or_gen( (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 3148, in _call_func_or_gen (ServeReplica:default:Ingress pid=3359416) result = await result (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/tests/test_handle_2.py", line 429, in __call__ (ServeReplica:default:Ingress pid=3359416) return await multiplier_response (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/handle.py", line 488, in __await__ (ServeReplica:default:Ingress pid=3359416) replica_result = yield from self._fetch_future_result_async().__await__() (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/handle.py", line 352, in _fetch_future_result_async (ServeReplica:default:Ingress pid=3359416) self._replica_result = await asyncio.wrap_future( (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/router.py", line 1126, in assign_request (ServeReplica:default:Ingress pid=3359416) replica_result = await self.route_and_send_request( (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/router.py", line 1058, in route_and_send_request (ServeReplica:default:Ingress pid=3359416) result = await self._route_and_send_request_once( (ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/router.py", line 1020, in _route_and_send_request_once (ServeReplica:default:Ingress pid=3359416) raise self._make_upstream_crash_error(e) (ServeReplica:default:Ingress pid=3359416) ray.serve.exceptions.RayServeException: Request to deployment 'Multiplier' failed because an upstream actor died before finishing a dependent task. Ray reported: (ServeReplica:default:Ingress pid=3359416) ray::ServeReplica:default:Multiplier.handle_request_with_rejection() (pid=3359421, ip=172.31.7.228, actor_id=d8ca27b9cd768bde0edf6f0801000000, repr=<ray.serve._private.replica.ServeReplica:default:Multiplier object at 0x7f2195015300>) (ServeReplica:default:Ingress pid=3359416) At least one of the input arguments for this task could not be computed: (ServeReplica:default:Ingress pid=3359416) ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. (ServeReplica:default:Ingress pid=3359416) class_name: ServeReplica:default:Adder (ServeReplica:default:Ingress pid=3359416) actor_id: 5faf01896ee11cdb7b2100b801000000 (ServeReplica:default:Ingress pid=3359416) pid: 3359414 (ServeReplica:default:Ingress pid=3359416) name: SERVE_REPLICA::default#Adder#ankrf4s5 (ServeReplica:default:Ingress pid=3359416) namespace: serve (ServeReplica:default:Ingress pid=3359416) ip: 172.31.7.228 (ServeReplica:default:Ingress pid=3359416) The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error. (ServeReplica:default:Ingress pid=3359416) INFO 2026-03-30 00:54:50,525 default_Ingress 9fjej5dq 6ec2c7c8-ae53-4f48-b2d7-29e94d7f7aec -- CALL __call__ ERROR 24.1ms

…DeploymentResponse` (ray-project#62147) ## Description - When an upstream deployment dies (e.g., `os._exit()`) in a model composition chain, the downstream router's retry loop would spin indefinitely on the permanently-failed input, causing the request to hang until the caller's timeout instead of failing fast. - The fix propagates the upstream `ActorDiedError` immediately when the error is from an upstream dependency (actor ID mismatch) and the request was already accepted (callback registered), breaking the infinite retry loop. ## Related issues Fixes ray-project#61594 (comment). ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: abrar <abrar@anyscale.com> Co-authored-by: abrar <abrar@anyscale.com> Signed-off-by: Frank Mancina <fmancina@haproxy.com>

…DeploymentResponse` (ray-project#62147) ## Description - When an upstream deployment dies (e.g., `os._exit()`) in a model composition chain, the downstream router's retry loop would spin indefinitely on the permanently-failed input, causing the request to hang until the caller's timeout instead of failing fast. - The fix propagates the upstream `ActorDiedError` immediately when the error is from an upstream dependency (actor ID mismatch) and the request was already accepted (callback registered), breaking the infinite retry loop. ## Related issues Fixes ray-project#61594 (comment). ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: abrar <abrar@anyscale.com> Co-authored-by: abrar <abrar@anyscale.com>

Fix request hanging on upstream ActorDiedError in chained DeploymentR…

31b6781

…esponse Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale requested a review from abrarsheikh March 27, 2026 23:03

jeffreywang-anyscale requested a review from a team as a code owner March 27, 2026 23:03

jeffreywang-anyscale added the go add ONLY when ready to merge, run all tests label Mar 27, 2026

gemini-code-assist Bot reviewed Mar 27, 2026

View reviewed changes

Drop test_chained_deployment_response_multiple_crashes_then_success

497a2d0

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale commented Mar 27, 2026

View reviewed changes

ray-gardener Bot added the serve Ray Serve Related Issue label Mar 28, 2026

abrarsheikh reviewed Mar 28, 2026

View reviewed changes

Comment thread python/ray/serve/_private/router.py Outdated

Don't expose non-public exceptions

0fd7aef

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale mentioned this pull request Mar 28, 2026

[Serve] Ray serve replica deadlock when actor dies earlier in a chain of model composition #61594

Closed

abrarsheikh reviewed Mar 28, 2026

View reviewed changes

Comment thread python/ray/serve/_private/router.py Outdated

Use RayServeException

d262dea

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

cursor Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread python/ray/serve/_private/router.py Outdated

abrarsheikh added 3 commits March 30, 2026 00:52

improve

a95da81

Signed-off-by: abrar <abrar@anyscale.com>

Merge branch 'master' of github.com:ray-project/ray into dead-actor-r…

d16867a

…oute

update test

93d3117

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh reviewed Mar 30, 2026

View reviewed changes

abrarsheikh approved these changes Mar 30, 2026

View reviewed changes

akyang-anyscale approved these changes Mar 30, 2026

View reviewed changes

abrarsheikh merged commit ef6ba17 into master Mar 30, 2026
6 checks passed

abrarsheikh deleted the dead-actor-route branch March 30, 2026 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Fix request hanging on upstream `ActorDiedError` in chained `DeploymentResponse`#62147

[serve] Fix request hanging on upstream `ActorDiedError` in chained `DeploymentResponse`#62147
abrarsheikh merged 7 commits into
masterfrom
dead-actor-route

jeffreywang-anyscale commented Mar 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 27, 2026

Uh oh!

jeffreywang-anyscale Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

abrarsheikh Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    with pytest.raises(RayActorError):
-        while True:
-            handle.remote(0).result(timeout_s=10)
+    with pytest.raises(RayActorError):
+        # Loop a number of times to ensure we hit the crashing replica.
+        for _ in range(20):
+            try:
+                handle.remote(0).result(timeout_s=10)
+            except RayActorError:
+                raise  # Re-raise to be caught by pytest.raises
+            except Exception:
+                pass  # Ignore other exceptions and retry

Conversation

jeffreywang-anyscale commented Mar 27, 2026

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abrarsheikh Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants