question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune] FileNotFoundError when deleting checkpoint

See original GitHub issue

System information

  • OS Platform and Distribution: Linux Ubuntu 16.04.6 LTS
  • Ray installed from: binary
  • Ray version: 0.7.3
  • Python version: 3.7.3

Describe the problem

We use ray to tune the hyperparameters of our model with PBT running multiple workers in parallel. Occasionally, some workers crash with the following error message:

Traceback (most recent call last):
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 537, in _process_trial
    trial, force=result.get(SHOULD_CHECKPOINT, False))
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 567, in _checkpoint_trial_if_needed
    self.trial_executor.save(trial, storage=Checkpoint.DISK)
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 513, in save
    self._checkpoint_and_erase(trial)
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 538, in _checkpoint_and_erase
    ray.get(trial.runner.delete_checkpoint.remote(trial.history[-1]))
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/worker.py", line 2247, in get
    raise value
ray.exceptions.RayTaskError: ^[[36mray_SupervisedTrainable:delete_checkpoint()^[[39m (pid=56457, host=ip-172-31-34-33)
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trainable.py", line 246, in delete_checkpoint
    shutil.rmtree(checkpoint_dir)
  File "/home/user/.pyenv/versions/3.7.3/lib/python3.7/shutil.py", line 482, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/home/user/.pyenv/versions/3.7.3/lib/python3.7/shutil.py", line 480, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/logs/model-20190824_174949/model_0_2019-08-24_17-57-57x48g5wke/checkpoint_5/checkpoint-ec1b867c-8dee-4bad-afac-705ba26bc1fa.pth.tar'

Apparently, ray tries to delete a checkpoint folder that does not exist or has already been deleted. A quick fix seems to be to overwrite delete_checkpoint of the Trainable class to check whether the checkpoint directory exists before deleting it:

    def delete_checkpoint(self, checkpoint_dir):
        if os.path.exists(checkpoint_dir):
            if os.path.isfile(checkpoint_dir):
                shutil.rmtree(os.path.dirname(checkpoint_dir))
            else:
                shutil.rmtree(checkpoint_dir)

However, this does not seem to fix the underlying issue for which is not clear to me where it stems from. The checkpoint directory is not on a remote drive.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
richardliawcommented, Sep 6, 2019

Got it; so there’s an error with keeping the tail of checkpoints and PBT. We’ll look into this.

0reactions
ujvlcommented, Nov 18, 2019

@FelixOpolka sorry for the delay, this has been fixed. Feel free to give it a try on the latest wheels when available.

Read more comments on GitHub >

github_iconTop Results From Across the Web

FileNotFoundError when resuming from Checkpoint - Ray Tune
It tries to resume from the 9th iteration, and (usually) this has been deleted. If I inspect the output folder I can see...
Read more >
pytorch torchvision.datasets.ImageFolder FileNotFoundError ...
The case happen to me is I found a hidden file called .ipynb_checkpoints which is located parallelly to image class subfolders. I think...
Read more >
Model Parallel Troubleshooting - Amazon SageMaker
FileNotFoundError : [Errno 2] No such file or directory: '/opt/ml/checkpoints/metadata.json.sagemaker-uploading. To fix this issue, disable Debugger by ...
Read more >
Failed to Read Files, and "FileNotFoundException" Is ... - 华为云
In MapReduce tasks, all Map tasks are successfully executed, but Reduce tasks fail. The error message "FileNotFoundException...No lease on.
Read more >
How to Grid Search Hyperparameters for Deep Learning ...
In this example, you will tune the optimization algorithm used to train the network, ... How to Checkpoint Deep Learning Models in Keras ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found