Skip to content

[tune] Avoid file deletion race by using unique tmp file names#60556

Merged
matthewdeng merged 6 commits into
ray-project:masterfrom
TimothySeah:tseah/fix-tune-file-race
Jan 30, 2026
Merged

[tune] Avoid file deletion race by using unique tmp file names#60556
matthewdeng merged 6 commits into
ray-project:masterfrom
TimothySeah:tseah/fix-tune-file-race

Conversation

@TimothySeah

@TimothySeah TimothySeah commented Jan 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Tune occasionally ran into issues like this

Traceback (most recent call last):
  File "/tmp/ray/session_2025-07-07_16-48-33_740950_2862/runtime_resources/working_dir_files/gs_anyscale-public-cloud_org_7c1Kalm9WcX2bNIjW53GUT_cld_6zjdsbn3kbphvk3luhyrd3eg6e_runtime_env_packages_pkg_bc103b49178145e7933f685e54dedda4/run_simple_tune_job.py", line 23, in <module>
    analysis = tune.run(
               ^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/tune.py", line 994, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 701, in step
    raise e
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 698, in step
    self.checkpoint()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 352, in checkpoint
    self._checkpoint_manager.sync_up_experiment_state(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/experiment_state.py", line 167, in sync_up_experiment_state
    save_fn()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 348, in save_to_dir
    self._search_alg.save_to_dir(driver_staging_path, session_str=self._session_str)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/search/basic_variant.py", line 405, in save_to_dir
    _atomic_save(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/utils/util.py", line 415, in _atomic_save
    os.replace(tmp_search_ckpt_path, os.path.join(checkpoint_dir, file_name))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2025-07-07_16-48-33_740950_2862/artifacts/2025-07-07_16-49-27/training_function_2025-07-07_16-49-27/driver_artifacts/.tmp_generator' -> '/tmp/ray/session_2025-07-07_16-48-33_740950_2862/artifacts/2025-07-07_16-49-27/training_function_2025-07-07_16-49-27/driver_artifacts/basic-variant-state-2025-07-07_16-49-27.json'

I was unable to reproduce this issue. However, I was able to get Claude Code to generate a script that shows that the following sequence of events is a possible cause:

  • process A: writes to .tmp_generator
  • process B: writes to .tmp_generator (same name!)
  • process A: renames .tmp_generator to final.json
  • process B: FileNotFoundError when trying to rename

This PR attempts to solve the problem by making .tmp_generator a unique name instead, which is the same approach we are doing here:

tmp_file_name = f".tmp-{file_name}"
. This should have no further conflicts since the CKPT_FILE_TMPL of the different components are all different.

Before this change, tmp files looked like .tmp_generator. Now they look like .uuid-tmp-filename.json.

Testing

Unit tests.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner January 28, 2026 01:19

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a race condition that could occur during checkpoint saving by ensuring unique temporary file names are used. The fix is consistently applied across BasicVariantGenerator, SearchGenerator, and Searcher, replacing hardcoded temporary file names with dynamically generated ones based on the session string. This is a good improvement for the robustness of Tune. I have one suggestion to further improve maintainability by centralizing the temporary filename generation logic.

Comment thread python/ray/tune/search/searcher.py Outdated
@ray-gardener ray-gardener Bot added the tune Tune-related issues label Jan 28, 2026
Comment thread python/ray/tune/search/basic_variant.py Outdated
file_name=self.CKPT_FILE_TMPL.format(session_str),
tmp_file_name=".tmp_generator",
file_name=file_name,
tmp_file_name=f".tmp-{file_name}",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me if this solves the problem.

  1. The previous file names (e.g. .tmp_generator, .tmp_search_generator_ckpt, etc.) are different from each other so there should not be any contention there.
  2. The repro is only running a single job so the uniqueness of session_str in file_name doesn't make a difference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Regarding point 2, I added a uuid to the tmp file name to ensure uniqueness. If 2 processes writing to the same file isn't the issue (iiuc tune_controller should only be calling one search_alg.save_to_dir at a time), perhaps os flushing is the issue e.g.

  1. Write to .tmp_generator
  2. Call os.replace to rename .tmp_generator and return before the renaming is complete
  3. Write to .tmp_generator again
  4. Finish renaming
  5. Call os.replace to rename .tmp_generator, which is now gone.

I wasn't able to reproduce the issue but this should be a safe change.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment thread python/ray/tune/search/searcher.py
Comment thread python/ray/tune/search/searcher.py
Comment thread python/ray/tune/utils/util.py Outdated
Signed-off-by: Timothy Seah <tseah@anyscale.com>
…ing.json

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Jan 30, 2026
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@matthewdeng matthewdeng enabled auto-merge (squash) January 30, 2026 00:33
@matthewdeng matthewdeng merged commit 48c15d1 into ray-project:master Jan 30, 2026
7 checks passed
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
# Summary

Tune occasionally ran into issues like this

```
Traceback (most recent call last):
  File "/tmp/ray/session_2025-07-07_16-48-33_740950_2862/runtime_resources/working_dir_files/gs_anyscale-public-cloud_org_7c1Kalm9WcX2bNIjW53GUT_cld_6zjdsbn3kbphvk3luhyrd3eg6e_runtime_env_packages_pkg_bc103b49178145e7933f685e54dedda4/run_simple_tune_job.py", line 23, in <module>
    analysis = tune.run(
               ^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/tune.py", line 994, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 701, in step
    raise e
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 698, in step
    self.checkpoint()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 352, in checkpoint
    self._checkpoint_manager.sync_up_experiment_state(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/experiment_state.py", line 167, in sync_up_experiment_state
    save_fn()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 348, in save_to_dir
    self._search_alg.save_to_dir(driver_staging_path, session_str=self._session_str)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/search/basic_variant.py", line 405, in save_to_dir
    _atomic_save(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/utils/util.py", line 415, in _atomic_save
    os.replace(tmp_search_ckpt_path, os.path.join(checkpoint_dir, file_name))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2025-07-07_16-48-33_740950_2862/artifacts/2025-07-07_16-49-27/training_function_2025-07-07_16-49-27/driver_artifacts/.tmp_generator' -> '/tmp/ray/session_2025-07-07_16-48-33_740950_2862/artifacts/2025-07-07_16-49-27/training_function_2025-07-07_16-49-27/driver_artifacts/basic-variant-state-2025-07-07_16-49-27.json'
```

I was unable to reproduce this issue. However, I was able to get Claude
Code to generate a
[script](https://gist.github.com/TimothySeah/b033410c2b0b60fb4c4e5849c96cc497)
that shows that the following sequence of events is a possible cause:
* process A: writes to .tmp_generator
* process B: writes to .tmp_generator (same name!)
* process A: renames .tmp_generator to final.json
* process B: FileNotFoundError when trying to rename

This PR attempts to solve the problem by making `.tmp_generator` a
unique name instead, which is the same approach we are doing here:
https://github.com/ray-project/ray/blob/b2be3a336d1f292de1f785a6dae29afc6b4611d7/python/ray/tune/callback.py#L467.
This should have no further conflicts since the `CKPT_FILE_TMPL` of the
different components are all different.

Before this change, tmp files looked like `.tmp_generator`. Now they
look like `.uuid-tmp-filename.json`.

# Testing

Unit tests.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…roject#60556)

# Summary

Tune occasionally ran into issues like this

```
Traceback (most recent call last):
  File "/tmp/ray/session_2025-07-07_16-48-33_740950_2862/runtime_resources/working_dir_files/gs_anyscale-public-cloud_org_7c1Kalm9WcX2bNIjW53GUT_cld_6zjdsbn3kbphvk3luhyrd3eg6e_runtime_env_packages_pkg_bc103b49178145e7933f685e54dedda4/run_simple_tune_job.py", line 23, in <module>
    analysis = tune.run(
               ^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/tune.py", line 994, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 701, in step
    raise e
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 698, in step
    self.checkpoint()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 352, in checkpoint
    self._checkpoint_manager.sync_up_experiment_state(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/experiment_state.py", line 167, in sync_up_experiment_state
    save_fn()
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 348, in save_to_dir
    self._search_alg.save_to_dir(driver_staging_path, session_str=self._session_str)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/search/basic_variant.py", line 405, in save_to_dir
    _atomic_save(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/tune/utils/util.py", line 415, in _atomic_save
    os.replace(tmp_search_ckpt_path, os.path.join(checkpoint_dir, file_name))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2025-07-07_16-48-33_740950_2862/artifacts/2025-07-07_16-49-27/training_function_2025-07-07_16-49-27/driver_artifacts/.tmp_generator' -> '/tmp/ray/session_2025-07-07_16-48-33_740950_2862/artifacts/2025-07-07_16-49-27/training_function_2025-07-07_16-49-27/driver_artifacts/basic-variant-state-2025-07-07_16-49-27.json'
```

I was unable to reproduce this issue. However, I was able to get Claude
Code to generate a
[script](https://gist.github.com/TimothySeah/b033410c2b0b60fb4c4e5849c96cc497)
that shows that the following sequence of events is a possible cause:
* process A: writes to .tmp_generator
* process B: writes to .tmp_generator (same name!)
* process A: renames .tmp_generator to final.json
* process B: FileNotFoundError when trying to rename

This PR attempts to solve the problem by making `.tmp_generator` a
unique name instead, which is the same approach we are doing here:
https://github.com/ray-project/ray/blob/b2be3a336d1f292de1f785a6dae29afc6b4611d7/python/ray/tune/callback.py#L467.
This should have no further conflicts since the `CKPT_FILE_TMPL` of the
different components are all different.

Before this change, tmp files looked like `.tmp_generator`. Now they
look like `.uuid-tmp-filename.json`.

# Testing

Unit tests.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests tune Tune-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants