[tune] Avoid file deletion race by using unique tmp file names#60556
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a race condition that could occur during checkpoint saving by ensuring unique temporary file names are used. The fix is consistently applied across BasicVariantGenerator, SearchGenerator, and Searcher, replacing hardcoded temporary file names with dynamically generated ones based on the session string. This is a good improvement for the robustness of Tune. I have one suggestion to further improve maintainability by centralizing the temporary filename generation logic.
| file_name=self.CKPT_FILE_TMPL.format(session_str), | ||
| tmp_file_name=".tmp_generator", | ||
| file_name=file_name, | ||
| tmp_file_name=f".tmp-{file_name}", |
There was a problem hiding this comment.
It's not clear to me if this solves the problem.
- The previous file names (e.g.
.tmp_generator,.tmp_search_generator_ckpt, etc.) are different from each other so there should not be any contention there. - The repro is only running a single job so the uniqueness of
session_strinfile_namedoesn't make a difference.
There was a problem hiding this comment.
Good catch. Regarding point 2, I added a uuid to the tmp file name to ensure uniqueness. If 2 processes writing to the same file isn't the issue (iiuc tune_controller should only be calling one search_alg.save_to_dir at a time), perhaps os flushing is the issue e.g.
- Write to .tmp_generator
- Call os.replace to rename .tmp_generator and return before the renaming is complete
- Write to .tmp_generator again
- Finish renaming
- Call os.replace to rename .tmp_generator, which is now gone.
I wasn't able to reproduce the issue but this should be a safe change.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
…ing.json Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
# Summary
Tune occasionally ran into issues like this
```
Traceback (most recent call last):
File "/tmp/ray/session_2025-07-07_16-48-33_740950_2862/runtime_resources/working_dir_files/gs_anyscale-public-cloud_org_7c1Kalm9WcX2bNIjW53GUT_cld_6zjdsbn3kbphvk3luhyrd3eg6e_runtime_env_packages_pkg_bc103b49178145e7933f685e54dedda4/run_simple_tune_job.py", line 23, in <module>
analysis = tune.run(
^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/tune.py", line 994, in run
runner.step()
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 701, in step
raise e
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 698, in step
self.checkpoint()
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 352, in checkpoint
self._checkpoint_manager.sync_up_experiment_state(
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/experiment_state.py", line 167, in sync_up_experiment_state
save_fn()
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 348, in save_to_dir
self._search_alg.save_to_dir(driver_staging_path, session_str=self._session_str)
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/search/basic_variant.py", line 405, in save_to_dir
_atomic_save(
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/utils/util.py", line 415, in _atomic_save
os.replace(tmp_search_ckpt_path, os.path.join(checkpoint_dir, file_name))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2025-07-07_16-48-33_740950_2862/artifacts/2025-07-07_16-49-27/training_function_2025-07-07_16-49-27/driver_artifacts/.tmp_generator' -> '/tmp/ray/session_2025-07-07_16-48-33_740950_2862/artifacts/2025-07-07_16-49-27/training_function_2025-07-07_16-49-27/driver_artifacts/basic-variant-state-2025-07-07_16-49-27.json'
```
I was unable to reproduce this issue. However, I was able to get Claude
Code to generate a
[script](https://gist.github.com/TimothySeah/b033410c2b0b60fb4c4e5849c96cc497)
that shows that the following sequence of events is a possible cause:
* process A: writes to .tmp_generator
* process B: writes to .tmp_generator (same name!)
* process A: renames .tmp_generator to final.json
* process B: FileNotFoundError when trying to rename
This PR attempts to solve the problem by making `.tmp_generator` a
unique name instead, which is the same approach we are doing here:
https://github.com/ray-project/ray/blob/b2be3a336d1f292de1f785a6dae29afc6b4611d7/python/ray/tune/callback.py#L467.
This should have no further conflicts since the `CKPT_FILE_TMPL` of the
different components are all different.
Before this change, tmp files looked like `.tmp_generator`. Now they
look like `.uuid-tmp-filename.json`.
# Testing
Unit tests.
---------
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…roject#60556) # Summary Tune occasionally ran into issues like this ``` Traceback (most recent call last): File "/tmp/ray/session_2025-07-07_16-48-33_740950_2862/runtime_resources/working_dir_files/gs_anyscale-public-cloud_org_7c1Kalm9WcX2bNIjW53GUT_cld_6zjdsbn3kbphvk3luhyrd3eg6e_runtime_env_packages_pkg_bc103b49178145e7933f685e54dedda4/run_simple_tune_job.py", line 23, in <module> analysis = tune.run( ^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/tune.py", line 994, in run runner.step() File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 701, in step raise e File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 698, in step self.checkpoint() File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 352, in checkpoint self._checkpoint_manager.sync_up_experiment_state( File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/experiment_state.py", line 167, in sync_up_experiment_state save_fn() File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 348, in save_to_dir self._search_alg.save_to_dir(driver_staging_path, session_str=self._session_str) File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/search/basic_variant.py", line 405, in save_to_dir _atomic_save( File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/utils/util.py", line 415, in _atomic_save os.replace(tmp_search_ckpt_path, os.path.join(checkpoint_dir, file_name)) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2025-07-07_16-48-33_740950_2862/artifacts/2025-07-07_16-49-27/training_function_2025-07-07_16-49-27/driver_artifacts/.tmp_generator' -> '/tmp/ray/session_2025-07-07_16-48-33_740950_2862/artifacts/2025-07-07_16-49-27/training_function_2025-07-07_16-49-27/driver_artifacts/basic-variant-state-2025-07-07_16-49-27.json' ``` I was unable to reproduce this issue. However, I was able to get Claude Code to generate a [script](https://gist.github.com/TimothySeah/b033410c2b0b60fb4c4e5849c96cc497) that shows that the following sequence of events is a possible cause: * process A: writes to .tmp_generator * process B: writes to .tmp_generator (same name!) * process A: renames .tmp_generator to final.json * process B: FileNotFoundError when trying to rename This PR attempts to solve the problem by making `.tmp_generator` a unique name instead, which is the same approach we are doing here: https://github.com/ray-project/ray/blob/b2be3a336d1f292de1f785a6dae29afc6b4611d7/python/ray/tune/callback.py#L467. This should have no further conflicts since the `CKPT_FILE_TMPL` of the different components are all different. Before this change, tmp files looked like `.tmp_generator`. Now they look like `.uuid-tmp-filename.json`. # Testing Unit tests. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
Summary
Tune occasionally ran into issues like this
I was unable to reproduce this issue. However, I was able to get Claude Code to generate a script that shows that the following sequence of events is a possible cause:
This PR attempts to solve the problem by making
.tmp_generatora unique name instead, which is the same approach we are doing here:ray/python/ray/tune/callback.py
Line 467 in b2be3a3
CKPT_FILE_TMPLof the different components are all different.Before this change, tmp files looked like
.tmp_generator. Now they look like.uuid-tmp-filename.json.Testing
Unit tests.