Skip to content

Fix test_api.py::test_max_constructor_retry_count failing for windows#57541

Merged
zcin merged 55 commits into
ray-project:masterfrom
landscapepainter:fix-windows-flakiness-test
Oct 10, 2025
Merged

Fix test_api.py::test_max_constructor_retry_count failing for windows#57541
zcin merged 55 commits into
ray-project:masterfrom
landscapepainter:fix-windows-flakiness-test

Conversation

@landscapepainter

@landscapepainter landscapepainter commented Oct 8, 2025

Copy link
Copy Markdown
Contributor

test_api.py::test_max_constructor_retry_count was failing for windows.

Tried to expand the timeout on wait_on_condition at the last part of the test to 20s - 40s and added a debug statement to check how far the counter increments to. It goes up in a varying value but I was able to observe 9-12, not reaching 13.

Did some drilling and seems like for our ray actor worker process is forked to be created for Linux and Windows uses CreateProcessA, which builds process from scratch each time ran unlike forking. And this difference is causing the number of counts for windows to be growing more slowly IIUC. The call for windows with CreateProcessA is available here, and forking for Linux is availabe here.

Hence, the solution is to alleviate the test's resource requirement by launching 4->3 replicas and attempting on less number of retries to satisfy both linux and windows.

Signed-off-by: doyoung <doyoung@anyscale.com>
@landscapepainter

landscapepainter commented Oct 9, 2025

Copy link
Copy Markdown
Contributor Author

@landscapepainter landscapepainter marked this pull request as ready for review October 9, 2025 15:08
cursor[bot]

This comment was marked as outdated.

@landscapepainter landscapepainter changed the title Fix windows flakiness test Fix test_api.py::test_max_constructor_retry_count failing for windows Oct 9, 2025
@zcin zcin merged commit 410f434 into ray-project:master Oct 10, 2025
5 checks passed
ArturNiederfahrenhorst pushed a commit to ArturNiederfahrenhorst/ray that referenced this pull request Oct 13, 2025
…ray-project#57541)

`test_api.py::test_max_constructor_retry_count` was failing for windows.

Tried to expand the timeout on wait_on_condition at the last part of the
test to 20s - 40s and added a debug statement to check how far the
counter increments to. It goes up in a varying value but I was able to
observe 9-12, not reaching 13.

Did some drilling and seems like for our ray actor worker process is
forked to be created for Linux and Windows uses `CreateProcessA`, which
builds process from scratch each time ran unlike forking. And this
difference is causing the number of counts for windows to be growing
more slowly IIUC. The call for windows with `CreateProcessA` is
available
[here](https://github.com/ray-project/ray/blob/1296dc4699a3c1681fe3de6dd9f63af51d287582/src/ray/util/process.cc#L171),
and forking for Linux is availabe here.

Hence, the solution is to alleviate the test's resource requirement by
launching 4->3 replicas and attempting on less number of retries to
satisfy both linux and windows.

---------

Signed-off-by: doyoung <doyoung@anyscale.com>
harshit-anyscale pushed a commit that referenced this pull request Oct 15, 2025
…#57541)

`test_api.py::test_max_constructor_retry_count` was failing for windows.

Tried to expand the timeout on wait_on_condition at the last part of the
test to 20s - 40s and added a debug statement to check how far the
counter increments to. It goes up in a varying value but I was able to
observe 9-12, not reaching 13.

Did some drilling and seems like for our ray actor worker process is
forked to be created for Linux and Windows uses `CreateProcessA`, which
builds process from scratch each time ran unlike forking. And this
difference is causing the number of counts for windows to be growing
more slowly IIUC. The call for windows with `CreateProcessA` is
available
[here](https://github.com/ray-project/ray/blob/1296dc4699a3c1681fe3de6dd9f63af51d287582/src/ray/util/process.cc#L171),
and forking for Linux is availabe here.

Hence, the solution is to alleviate the test's resource requirement by
launching 4->3 replicas and attempting on less number of retries to
satisfy both linux and windows.

---------

Signed-off-by: doyoung <doyoung@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…ray-project#57541)

`test_api.py::test_max_constructor_retry_count` was failing for windows.

Tried to expand the timeout on wait_on_condition at the last part of the
test to 20s - 40s and added a debug statement to check how far the
counter increments to. It goes up in a varying value but I was able to
observe 9-12, not reaching 13.

Did some drilling and seems like for our ray actor worker process is
forked to be created for Linux and Windows uses `CreateProcessA`, which
builds process from scratch each time ran unlike forking. And this
difference is causing the number of counts for windows to be growing
more slowly IIUC. The call for windows with `CreateProcessA` is
available
[here](https://github.com/ray-project/ray/blob/1296dc4699a3c1681fe3de6dd9f63af51d287582/src/ray/util/process.cc#L171),
and forking for Linux is availabe here.

Hence, the solution is to alleviate the test's resource requirement by
launching 4->3 replicas and attempting on less number of retries to
satisfy both linux and windows.

---------

Signed-off-by: doyoung <doyoung@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
…#57541)

`test_api.py::test_max_constructor_retry_count` was failing for windows.

Tried to expand the timeout on wait_on_condition at the last part of the
test to 20s - 40s and added a debug statement to check how far the
counter increments to. It goes up in a varying value but I was able to
observe 9-12, not reaching 13.

Did some drilling and seems like for our ray actor worker process is
forked to be created for Linux and Windows uses `CreateProcessA`, which
builds process from scratch each time ran unlike forking. And this
difference is causing the number of counts for windows to be growing
more slowly IIUC. The call for windows with `CreateProcessA` is
available
[here](https://github.com/ray-project/ray/blob/1296dc4699a3c1681fe3de6dd9f63af51d287582/src/ray/util/process.cc#L171),
and forking for Linux is availabe here.

Hence, the solution is to alleviate the test's resource requirement by
launching 4->3 replicas and attempting on less number of retries to
satisfy both linux and windows.

---------

Signed-off-by: doyoung <doyoung@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter added a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ray-project#57541)

`test_api.py::test_max_constructor_retry_count` was failing for windows.

Tried to expand the timeout on wait_on_condition at the last part of the
test to 20s - 40s and added a debug statement to check how far the
counter increments to. It goes up in a varying value but I was able to
observe 9-12, not reaching 13.

Did some drilling and seems like for our ray actor worker process is
forked to be created for Linux and Windows uses `CreateProcessA`, which
builds process from scratch each time ran unlike forking. And this
difference is causing the number of counts for windows to be growing
more slowly IIUC. The call for windows with `CreateProcessA` is
available
[here](https://github.com/ray-project/ray/blob/1296dc4699a3c1681fe3de6dd9f63af51d287582/src/ray/util/process.cc#L171),
and forking for Linux is availabe here.

Hence, the solution is to alleviate the test's resource requirement by
launching 4->3 replicas and attempting on less number of retries to
satisfy both linux and windows.

---------

Signed-off-by: doyoung <doyoung@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ray-project#57541)

`test_api.py::test_max_constructor_retry_count` was failing for windows.

Tried to expand the timeout on wait_on_condition at the last part of the
test to 20s - 40s and added a debug statement to check how far the
counter increments to. It goes up in a varying value but I was able to
observe 9-12, not reaching 13.

Did some drilling and seems like for our ray actor worker process is
forked to be created for Linux and Windows uses `CreateProcessA`, which
builds process from scratch each time ran unlike forking. And this
difference is causing the number of counts for windows to be growing
more slowly IIUC. The call for windows with `CreateProcessA` is
available
[here](https://github.com/ray-project/ray/blob/1296dc4699a3c1681fe3de6dd9f63af51d287582/src/ray/util/process.cc#L171),
and forking for Linux is availabe here.

Hence, the solution is to alleviate the test's resource requirement by
launching 4->3 replicas and attempting on less number of retries to
satisfy both linux and windows.

---------

Signed-off-by: doyoung <doyoung@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants