[LLM] Fix NIXL port conflict in prefill-decode disaggregation test#60057
Conversation
When running prefill-decode disaggregation with NixlConnector and data parallelism, both prefill and decode deployments were using the same port base for their ZMQ side channel. This caused "Address already in use" errors when both deployments had workers on the same node. Fix by setting different NIXL_SIDE_CHANNEL_PORT_BASE values for prefill (40000) and decode (41000) configs to ensure port isolation. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a port conflict in the prefill-decode disaggregation test by assigning distinct NIXL_SIDE_CHANNEL_PORT_BASE values for the prefill and decode deployments. This resolves the "Address already in use" error. My review includes a suggestion to replace the hardcoded port numbers with named constants to improve code readability and maintainability.
| }, | ||
| experimental_configs={ | ||
| "dp_size_per_node": 4, | ||
| "NIXL_SIDE_CHANNEL_PORT_BASE": 40000, # Prefill port range |
There was a problem hiding this comment.
To improve readability and maintainability, it's best to avoid using 'magic numbers' like 40000 and 41000 for port configurations. Consider defining these as named constants at the start of the test function or at the module level. This makes their purpose clearer and simplifies future modifications.
For example:
PREFILL_NIXL_PORT_BASE = 40000
DECODE_NIXL_PORT_BASE = 41000Then you can use these constants in the configurations for both prefill and decode.
|
@nrghosh this builder could definitely use some TLC but probably not for this PR |
…ay-project#60057) ## Summary When running prefill-decode disaggregation with NixlConnector and data parallelism, both prefill and decode deployments were using the same port base for their ZMQ side channel. This caused "Address already in use" errors when both deployments had workers on the same node: ``` zmq.error.ZMQError: Address already in use (addr='tcp://10.0.75.118:40009') Exception in thread nixl_handshake_listener ``` ## Changes Fix by setting different `NIXL_SIDE_CHANNEL_PORT_BASE` values for prefill (40000) and decode (41000) configs to ensure port isolation. ## Test plan - Run `test_llm_serve_prefill_decode_with_data_parallelism` - should complete without timeout - The test previously hung forever waiting for "READY message from DP Coordinator" Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Summary
When running prefill-decode disaggregation with NixlConnector and data parallelism, both prefill and decode deployments were using the same port base for their ZMQ side channel. This caused "Address already in use" errors when both deployments had workers on the same node:
Changes
Fix by setting different
NIXL_SIDE_CHANNEL_PORT_BASEvalues for prefill (40000) and decode (41000) configs to ensure port isolation.Test plan
test_llm_serve_prefill_decode_with_data_parallelism- should complete without timeout