Skip to content

[serve] Fix request hanging on upstream ActorDiedError in chained DeploymentResponse#62147

Merged
abrarsheikh merged 7 commits into
masterfrom
dead-actor-route
Mar 30, 2026
Merged

[serve] Fix request hanging on upstream ActorDiedError in chained DeploymentResponse#62147
abrarsheikh merged 7 commits into
masterfrom
dead-actor-route

Conversation

@jeffreywang-anyscale

Copy link
Copy Markdown
Contributor

Description

  • When an upstream deployment dies (e.g., os._exit()) in a model composition chain, the downstream router's retry loop would spin indefinitely on the permanently-failed input, causing the request to hang until the caller's timeout instead of failing fast.
  • The fix propagates the upstream ActorDiedError immediately when the error is from an upstream dependency (actor ID mismatch) and the request was already accepted (callback registered), breaking the infinite retry loop.

Related issues

Fixes #61594 (comment).

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

…esponse

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang-anyscale jeffreywang-anyscale requested a review from a team as a code owner March 27, 2026 23:03
@jeffreywang-anyscale jeffreywang-anyscale added the go add ONLY when ready to merge, run all tests label Mar 27, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the handling of ActorDiedError in Ray Serve's router, specifically for chained deployment responses. By distinguishing between failures in the local replica and those in upstream dependencies, the router can now propagate upstream errors immediately, preventing unnecessary retries or hangs. The changes include updates to _handle_actor_died_error and _route_and_send_request_once, along with a new test case to verify fast-failing behavior. A review comment suggested improving the robustness of the new test by replacing an infinite loop with a bounded one and ignoring transient exceptions to avoid flakiness.

Comment thread python/ray/serve/tests/test_handle_2.py Outdated
Comment on lines +443 to +445
with pytest.raises(RayActorError):
while True:
handle.remote(0).result(timeout_s=10)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The while True loop combined with a strict pytest.raises(RayActorError) could make this test flaky. If a transient error occurs (e.g., DeploymentUnavailableError while the replica is restarting), the test will fail unexpectedly.

To improve robustness, I recommend using a bounded loop and ignoring other exceptions. This ensures the test specifically checks for the RayActorError from the upstream crash, making it more reliable.

Suggested change
with pytest.raises(RayActorError):
while True:
handle.remote(0).result(timeout_s=10)
with pytest.raises(RayActorError):
# Loop a number of times to ensure we hit the crashing replica.
for _ in range(20):
try:
handle.remote(0).result(timeout_s=10)
except RayActorError:
raise # Re-raise to be caught by pytest.raises
except Exception:
pass # Ignore other exceptions and retry

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
RAY_SERVE_FORCE_LOCAL_TESTING_MODE,
reason="local_testing_mode has different failure handling.",
)
def test_chained_deployment_response_multiple_crashes_then_success(serve_instance):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewer: This test is now covered by test_chained_deployment_response_upstream_crash_downstream_stays_in_rotation and the new test test_chained_deployment_response_upstream_crash_fails_fast.

@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label Mar 28, 2026
Comment thread python/ray/serve/_private/router.py Outdated
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Comment thread python/ray/serve/_private/router.py Outdated
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Comment thread python/ray/serve/_private/router.py Outdated
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
# Ray's ActorDiedError text embedded (actor id / replica name, etc.).
with pytest.raises(RayServeException) as exc_info:
while True:
handle.remote(0).result(timeout_s=10)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample logs

(ServeReplica:default:Adder pid=3359418) INFO 2026-03-30 00:54:50,498 default_Adder j4y62r6x 261f1a95-8fb2-45e1-af00-e00c4ae9a1dd -- CALL __call__ OK 0.6ms
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID: 0200000090fe562c8e59b7fd696797d6412fccaa51316e0668e4a44ac64bb1c2 Worker ID: 71a47fa8332ebbb8c577310c3b4bec58ae10a19eebee78e22feafafe Node ID: 669d9f1276344e8196459302630a51de5f922c49f98932e354e2a0a2 Worker IP address: 172.31.7.228 Worker port: 34729 Worker PID: 3359414 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error.

INFO 2026-03-30 00:54:50,529 serve 3358518 -- Deleting app ['default']
(ProxyActor pid=3359413) INFO 2026-03-30 00:54:50,598 proxy 172.31.7.228 -- Got updated endpoints: {}.
(ServeReplica:default:Ingress pid=3359416) WARNING 2026-03-30 00:54:50,521 default_Ingress 9fjej5dq -- Replica(id='ankrf4s5', deployment='Adder', app='default') will not be considered for future requests because it has died.
(ServeReplica:default:Ingress pid=3359416) ERROR 2026-03-30 00:54:50,524 default_Ingress 9fjej5dq 6ec2c7c8-ae53-4f48-b2d7-29e94d7f7aec -- Request failed.
(ServeReplica:default:Ingress pid=3359416) Traceback (most recent call last):
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 1310, in _handle_errors_and_metrics
(ServeReplica:default:Ingress pid=3359416)     yield _status_code_callback
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 2084, in _wrap_request
(ServeReplica:default:Ingress pid=3359416)     yield status_code_callback
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 1640, in handle_request_with_rejection
(ServeReplica:default:Ingress pid=3359416)     raise self._maybe_wrap_grpc_exception(e, request_metadata) from e
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 1635, in handle_request_with_rejection
(ServeReplica:default:Ingress pid=3359416)     yield await self._user_callable_wrapper.call_user_method(
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 3657, in call_user_method
(ServeReplica:default:Ingress pid=3359416)     result, _ = await self._call_func_or_gen(
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/_private/replica.py", line 3148, in _call_func_or_gen
(ServeReplica:default:Ingress pid=3359416)     result = await result
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/tests/test_handle_2.py", line 429, in __call__
(ServeReplica:default:Ingress pid=3359416)     return await multiplier_response
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/handle.py", line 488, in __await__
(ServeReplica:default:Ingress pid=3359416)     replica_result = yield from self._fetch_future_result_async().__await__()
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/handle.py", line 352, in _fetch_future_result_async
(ServeReplica:default:Ingress pid=3359416)     self._replica_result = await asyncio.wrap_future(
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/_private/router.py", line 1126, in assign_request
(ServeReplica:default:Ingress pid=3359416)     replica_result = await self.route_and_send_request(
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/_private/router.py", line 1058, in route_and_send_request
(ServeReplica:default:Ingress pid=3359416)     result = await self._route_and_send_request_once(
(ServeReplica:default:Ingress pid=3359416)   File "/home/ubuntu/ray/python/ray/serve/_private/router.py", line 1020, in _route_and_send_request_once
(ServeReplica:default:Ingress pid=3359416)     raise self._make_upstream_crash_error(e)
(ServeReplica:default:Ingress pid=3359416) ray.serve.exceptions.RayServeException: Request to deployment 'Multiplier' failed because an upstream actor died before finishing a dependent task. Ray reported:
(ServeReplica:default:Ingress pid=3359416) ray::ServeReplica:default:Multiplier.handle_request_with_rejection() (pid=3359421, ip=172.31.7.228, actor_id=d8ca27b9cd768bde0edf6f0801000000, repr=<ray.serve._private.replica.ServeReplica:default:Multiplier object at 0x7f2195015300>)
(ServeReplica:default:Ingress pid=3359416)   At least one of the input arguments for this task could not be computed:
(ServeReplica:default:Ingress pid=3359416) ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
(ServeReplica:default:Ingress pid=3359416) 	class_name: ServeReplica:default:Adder
(ServeReplica:default:Ingress pid=3359416) 	actor_id: 5faf01896ee11cdb7b2100b801000000
(ServeReplica:default:Ingress pid=3359416) 	pid: 3359414
(ServeReplica:default:Ingress pid=3359416) 	name: SERVE_REPLICA::default#Adder#ankrf4s5
(ServeReplica:default:Ingress pid=3359416) 	namespace: serve
(ServeReplica:default:Ingress pid=3359416) 	ip: 172.31.7.228
(ServeReplica:default:Ingress pid=3359416) The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error.
(ServeReplica:default:Ingress pid=3359416) INFO 2026-03-30 00:54:50,525 default_Ingress 9fjej5dq 6ec2c7c8-ae53-4f48-b2d7-29e94d7f7aec -- CALL __call__ ERROR 24.1ms

@abrarsheikh abrarsheikh merged commit ef6ba17 into master Mar 30, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the dead-actor-route branch March 30, 2026 21:08
mancfactor pushed a commit to mancfactor/ray that referenced this pull request Apr 2, 2026
…DeploymentResponse` (ray-project#62147)

## Description
- When an upstream deployment dies (e.g., `os._exit()`) in a model
composition chain, the downstream router's retry loop would spin
indefinitely on the permanently-failed input, causing the request to
hang until the caller's timeout instead of failing fast.
- The fix propagates the upstream `ActorDiedError` immediately when the
error is from an upstream dependency (actor ID mismatch) and the request
was already accepted (callback registered), breaking the infinite retry
loop.

## Related issues
Fixes
ray-project#61594 (comment).

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Co-authored-by: abrar <abrar@anyscale.com>
Signed-off-by: Frank Mancina <fmancina@haproxy.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…DeploymentResponse` (ray-project#62147)

## Description
- When an upstream deployment dies (e.g., `os._exit()`) in a model
composition chain, the downstream router's retry loop would spin
indefinitely on the permanently-failed input, causing the request to
hang until the caller's timeout instead of failing fast.
- The fix propagates the upstream `ActorDiedError` immediately when the
error is from an upstream dependency (actor ID mismatch) and the request
was already accepted (callback registered), breaking the infinite retry
loop.

## Related issues
Fixes
ray-project#61594 (comment).

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Co-authored-by: abrar <abrar@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Serve] Ray serve replica deadlock when actor dies earlier in a chain of model composition

3 participants