[tune] FileNotFoundError when deleting checkpoint
See original GitHub issueSystem information
- OS Platform and Distribution: Linux Ubuntu 16.04.6 LTS
- Ray installed from: binary
- Ray version: 0.7.3
- Python version: 3.7.3
Describe the problem
We use ray to tune the hyperparameters of our model with PBT running multiple workers in parallel. Occasionally, some workers crash with the following error message:
Traceback (most recent call last):
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 537, in _process_trial
trial, force=result.get(SHOULD_CHECKPOINT, False))
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 567, in _checkpoint_trial_if_needed
self.trial_executor.save(trial, storage=Checkpoint.DISK)
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 513, in save
self._checkpoint_and_erase(trial)
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 538, in _checkpoint_and_erase
ray.get(trial.runner.delete_checkpoint.remote(trial.history[-1]))
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/worker.py", line 2247, in get
raise value
ray.exceptions.RayTaskError: ^[[36mray_SupervisedTrainable:delete_checkpoint()^[[39m (pid=56457, host=ip-172-31-34-33)
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trainable.py", line 246, in delete_checkpoint
shutil.rmtree(checkpoint_dir)
File "/home/user/.pyenv/versions/3.7.3/lib/python3.7/shutil.py", line 482, in rmtree
onerror(os.lstat, path, sys.exc_info())
File "/home/user/.pyenv/versions/3.7.3/lib/python3.7/shutil.py", line 480, in rmtree
orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/logs/model-20190824_174949/model_0_2019-08-24_17-57-57x48g5wke/checkpoint_5/checkpoint-ec1b867c-8dee-4bad-afac-705ba26bc1fa.pth.tar'
Apparently, ray tries to delete a checkpoint folder that does not exist or has already been deleted. A quick fix seems to be to overwrite delete_checkpoint
of the Trainable
class to check whether the checkpoint directory exists before deleting it:
def delete_checkpoint(self, checkpoint_dir):
if os.path.exists(checkpoint_dir):
if os.path.isfile(checkpoint_dir):
shutil.rmtree(os.path.dirname(checkpoint_dir))
else:
shutil.rmtree(checkpoint_dir)
However, this does not seem to fix the underlying issue for which is not clear to me where it stems from. The checkpoint directory is not on a remote drive.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
FileNotFoundError when resuming from Checkpoint - Ray Tune
It tries to resume from the 9th iteration, and (usually) this has been deleted. If I inspect the output folder I can see...
Read more >pytorch torchvision.datasets.ImageFolder FileNotFoundError ...
The case happen to me is I found a hidden file called .ipynb_checkpoints which is located parallelly to image class subfolders. I think...
Read more >Model Parallel Troubleshooting - Amazon SageMaker
FileNotFoundError : [Errno 2] No such file or directory: '/opt/ml/checkpoints/metadata.json.sagemaker-uploading. To fix this issue, disable Debugger by ...
Read more >Failed to Read Files, and "FileNotFoundException" Is ... - 华为云
In MapReduce tasks, all Map tasks are successfully executed, but Reduce tasks fail. The error message "FileNotFoundException...No lease on.
Read more >How to Grid Search Hyperparameters for Deep Learning ...
In this example, you will tune the optimization algorithm used to train the network, ... How to Checkpoint Deep Learning Models in Keras ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Got it; so there’s an error with keeping the tail of checkpoints and PBT. We’ll look into this.
@FelixOpolka sorry for the delay, this has been fixed. Feel free to give it a try on the latest wheels when available.