Skip to content

[LLM] Fix NIXL port conflict in prefill-decode disaggregation test#60057

Merged
aslonnie merged 1 commit into
ray-project:masterfrom
eicherseiji:fix-nixl-port-conflict
Jan 13, 2026
Merged

[LLM] Fix NIXL port conflict in prefill-decode disaggregation test#60057
aslonnie merged 1 commit into
ray-project:masterfrom
eicherseiji:fix-nixl-port-conflict

Conversation

@eicherseiji

Copy link
Copy Markdown
Contributor

Summary

When running prefill-decode disaggregation with NixlConnector and data parallelism, both prefill and decode deployments were using the same port base for their ZMQ side channel. This caused "Address already in use" errors when both deployments had workers on the same node:

zmq.error.ZMQError: Address already in use (addr='tcp://10.0.75.118:40009')
Exception in thread nixl_handshake_listener

Changes

Fix by setting different NIXL_SIDE_CHANNEL_PORT_BASE values for prefill (40000) and decode (41000) configs to ensure port isolation.

Test plan

  • Run test_llm_serve_prefill_decode_with_data_parallelism - should complete without timeout
  • The test previously hung forever waiting for "READY message from DP Coordinator"

When running prefill-decode disaggregation with NixlConnector and data
parallelism, both prefill and decode deployments were using the same
port base for their ZMQ side channel. This caused "Address already in
use" errors when both deployments had workers on the same node.

Fix by setting different NIXL_SIDE_CHANNEL_PORT_BASE values for prefill
(40000) and decode (41000) configs to ensure port isolation.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a port conflict in the prefill-decode disaggregation test by assigning distinct NIXL_SIDE_CHANNEL_PORT_BASE values for the prefill and decode deployments. This resolves the "Address already in use" error. My review includes a suggestion to replace the hardcoded port numbers with named constants to improve code readability and maintainability.

},
experimental_configs={
"dp_size_per_node": 4,
"NIXL_SIDE_CHANNEL_PORT_BASE": 40000, # Prefill port range

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve readability and maintainability, it's best to avoid using 'magic numbers' like 40000 and 41000 for port configurations. Consider defining these as named constants at the start of the test function or at the module level. This makes their purpose clearer and simplifies future modifications.

For example:

PREFILL_NIXL_PORT_BASE = 40000
DECODE_NIXL_PORT_BASE = 41000

Then you can use these constants in the configurations for both prefill and decode.

@nrghosh nrghosh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related - #57771 and #55802

does this need fixing in the main p-d path as well or is just a test issue?

@eicherseiji eicherseiji added the go add ONLY when ready to merge, run all tests label Jan 12, 2026
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue llm labels Jan 13, 2026
@eicherseiji

Copy link
Copy Markdown
Contributor Author

@nrghosh this builder could definitely use some TLC but probably not for this PR

@aslonnie aslonnie merged commit f467a01 into ray-project:master Jan 13, 2026
6 checks passed
rushikeshadhav pushed a commit to rushikeshadhav/ray that referenced this pull request Jan 14, 2026
…ay-project#60057)

## Summary

When running prefill-decode disaggregation with NixlConnector and data
parallelism, both prefill and decode deployments were using the same
port base for their ZMQ side channel. This caused "Address already in
use" errors when both deployments had workers on the same node:

```
zmq.error.ZMQError: Address already in use (addr='tcp://10.0.75.118:40009')
Exception in thread nixl_handshake_listener
```

## Changes

Fix by setting different `NIXL_SIDE_CHANNEL_PORT_BASE` values for
prefill (40000) and decode (41000) configs to ensure port isolation.

## Test plan

- Run `test_llm_serve_prefill_decode_with_data_parallelism` - should
complete without timeout
- The test previously hung forever waiting for "READY message from DP
Coordinator"

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants