[serve] Fix request hanging on upstream ActorDiedError in chained DeploymentResponse#62147
Conversation
…esponse Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request improves the handling of ActorDiedError in Ray Serve's router, specifically for chained deployment responses. By distinguishing between failures in the local replica and those in upstream dependencies, the router can now propagate upstream errors immediately, preventing unnecessary retries or hangs. The changes include updates to _handle_actor_died_error and _route_and_send_request_once, along with a new test case to verify fast-failing behavior. A review comment suggested improving the robustness of the new test by replacing an infinite loop with a bounded one and ignoring transient exceptions to avoid flakiness.
| with pytest.raises(RayActorError): | ||
| while True: | ||
| handle.remote(0).result(timeout_s=10) |
There was a problem hiding this comment.
The while True loop combined with a strict pytest.raises(RayActorError) could make this test flaky. If a transient error occurs (e.g., DeploymentUnavailableError while the replica is restarting), the test will fail unexpectedly.
To improve robustness, I recommend using a bounded loop and ignoring other exceptions. This ensures the test specifically checks for the RayActorError from the upstream crash, making it more reliable.
| with pytest.raises(RayActorError): | |
| while True: | |
| handle.remote(0).result(timeout_s=10) | |
| with pytest.raises(RayActorError): | |
| # Loop a number of times to ensure we hit the crashing replica. | |
| for _ in range(20): | |
| try: | |
| handle.remote(0).result(timeout_s=10) | |
| except RayActorError: | |
| raise # Re-raise to be caught by pytest.raises | |
| except Exception: | |
| pass # Ignore other exceptions and retry |
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
| RAY_SERVE_FORCE_LOCAL_TESTING_MODE, | ||
| reason="local_testing_mode has different failure handling.", | ||
| ) | ||
| def test_chained_deployment_response_multiple_crashes_then_success(serve_instance): |
There was a problem hiding this comment.
Note for reviewer: This test is now covered by test_chained_deployment_response_upstream_crash_downstream_stays_in_rotation and the new test test_chained_deployment_response_upstream_crash_fails_fast.
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Signed-off-by: abrar <abrar@anyscale.com>
| # Ray's ActorDiedError text embedded (actor id / replica name, etc.). | ||
| with pytest.raises(RayServeException) as exc_info: | ||
| while True: | ||
| handle.remote(0).result(timeout_s=10) |
There was a problem hiding this comment.
sample logs
(ServeReplica:default:Adder pid=3359418) INFO 2026-03-30 00:54:50,498 default_Adder j4y62r6x 261f1a95-8fb2-45e1-af00-e00c4ae9a1dd -- CALL __call__ OK 0.6ms
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID: 0200000090fe562c8e59b7fd696797d6412fccaa51316e0668e4a44ac64bb1c2 Worker ID: 71a47fa8332ebbb8c577310c3b4bec58ae10a19eebee78e22feafafe Node ID: 669d9f1276344e8196459302630a51de5f922c49f98932e354e2a0a2 Worker IP address: 172.31.7.228 Worker port: 34729 Worker PID: 3359414 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error.
INFO 2026-03-30 00:54:50,529 serve 3358518 -- Deleting app ['default']
(ProxyActor pid=3359413) INFO 2026-03-30 00:54:50,598 proxy 172.31.7.228 -- Got updated endpoints: {}.
(ServeReplica:default:Ingress pid=3359416) WARNING 2026-03-30 00:54:50,521 default_Ingress 9fjej5dq -- Replica(id='ankrf4s5', deployment='Adder', app='default') will not be considered for future requests because it has died.
(ServeReplica:default:Ingress pid=3359416) ERROR 2026-03-30 00:54:50,524 default_Ingress 9fjej5dq 6ec2c7c8-ae53-4f48-b2d7-29e94d7f7aec -- Request failed.
(ServeReplica:default:Ingress pid=3359416) Traceback (most recent call last):
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 1310, in _handle_errors_and_metrics
(ServeReplica:default:Ingress pid=3359416) yield _status_code_callback
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 2084, in _wrap_request
(ServeReplica:default:Ingress pid=3359416) yield status_code_callback
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 1640, in handle_request_with_rejection
(ServeReplica:default:Ingress pid=3359416) raise self._maybe_wrap_grpc_exception(e, request_metadata) from e
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 1635, in handle_request_with_rejection
(ServeReplica:default:Ingress pid=3359416) yield await self._user_callable_wrapper.call_user_method(
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 3657, in call_user_method
(ServeReplica:default:Ingress pid=3359416) result, _ = await self._call_func_or_gen(
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 3148, in _call_func_or_gen
(ServeReplica:default:Ingress pid=3359416) result = await result
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/tests/test_handle_2.py", line 429, in __call__
(ServeReplica:default:Ingress pid=3359416) return await multiplier_response
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/handle.py", line 488, in __await__
(ServeReplica:default:Ingress pid=3359416) replica_result = yield from self._fetch_future_result_async().__await__()
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/handle.py", line 352, in _fetch_future_result_async
(ServeReplica:default:Ingress pid=3359416) self._replica_result = await asyncio.wrap_future(
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/router.py", line 1126, in assign_request
(ServeReplica:default:Ingress pid=3359416) replica_result = await self.route_and_send_request(
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/router.py", line 1058, in route_and_send_request
(ServeReplica:default:Ingress pid=3359416) result = await self._route_and_send_request_once(
(ServeReplica:default:Ingress pid=3359416) File "/home/ubuntu/ray/python/ray/serve/_private/router.py", line 1020, in _route_and_send_request_once
(ServeReplica:default:Ingress pid=3359416) raise self._make_upstream_crash_error(e)
(ServeReplica:default:Ingress pid=3359416) ray.serve.exceptions.RayServeException: Request to deployment 'Multiplier' failed because an upstream actor died before finishing a dependent task. Ray reported:
(ServeReplica:default:Ingress pid=3359416) ray::ServeReplica:default:Multiplier.handle_request_with_rejection() (pid=3359421, ip=172.31.7.228, actor_id=d8ca27b9cd768bde0edf6f0801000000, repr=<ray.serve._private.replica.ServeReplica:default:Multiplier object at 0x7f2195015300>)
(ServeReplica:default:Ingress pid=3359416) At least one of the input arguments for this task could not be computed:
(ServeReplica:default:Ingress pid=3359416) ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
(ServeReplica:default:Ingress pid=3359416) class_name: ServeReplica:default:Adder
(ServeReplica:default:Ingress pid=3359416) actor_id: 5faf01896ee11cdb7b2100b801000000
(ServeReplica:default:Ingress pid=3359416) pid: 3359414
(ServeReplica:default:Ingress pid=3359416) name: SERVE_REPLICA::default#Adder#ankrf4s5
(ServeReplica:default:Ingress pid=3359416) namespace: serve
(ServeReplica:default:Ingress pid=3359416) ip: 172.31.7.228
(ServeReplica:default:Ingress pid=3359416) The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error.
(ServeReplica:default:Ingress pid=3359416) INFO 2026-03-30 00:54:50,525 default_Ingress 9fjej5dq 6ec2c7c8-ae53-4f48-b2d7-29e94d7f7aec -- CALL __call__ ERROR 24.1ms…DeploymentResponse` (ray-project#62147) ## Description - When an upstream deployment dies (e.g., `os._exit()`) in a model composition chain, the downstream router's retry loop would spin indefinitely on the permanently-failed input, causing the request to hang until the caller's timeout instead of failing fast. - The fix propagates the upstream `ActorDiedError` immediately when the error is from an upstream dependency (actor ID mismatch) and the request was already accepted (callback registered), breaking the infinite retry loop. ## Related issues Fixes ray-project#61594 (comment). ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: abrar <abrar@anyscale.com> Co-authored-by: abrar <abrar@anyscale.com> Signed-off-by: Frank Mancina <fmancina@haproxy.com>
…DeploymentResponse` (ray-project#62147) ## Description - When an upstream deployment dies (e.g., `os._exit()`) in a model composition chain, the downstream router's retry loop would spin indefinitely on the permanently-failed input, causing the request to hang until the caller's timeout instead of failing fast. - The fix propagates the upstream `ActorDiedError` immediately when the error is from an upstream dependency (actor ID mismatch) and the request was already accepted (callback registered), breaking the infinite retry loop. ## Related issues Fixes ray-project#61594 (comment). ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: abrar <abrar@anyscale.com> Co-authored-by: abrar <abrar@anyscale.com>

Description
os._exit()) in a model composition chain, the downstream router's retry loop would spin indefinitely on the permanently-failed input, causing the request to hang until the caller's timeout instead of failing fast.ActorDiedErrorimmediately when the error is from an upstream dependency (actor ID mismatch) and the request was already accepted (callback registered), breaking the infinite retry loop.Related issues
Fixes #61594 (comment).
Additional information