FileNotFoundError when running distributed Trainer
See original GitHub issue🐛 Bug
Information
I’m running the language modeling example in distributed mode and am getting FileNotFoundError in the Trainer.
The error :
Traceback (most recent call last):
File "run_language_modeling.py", line 292, in <module>
main()
File "run_language_modeling.py", line 257, in main
trainer.train(model_path=model_path)
File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/transformers/trainer.py", line 451, in train
torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/torch/serialization.py", line 327, in save
with _open_file_like(f, 'wb') as opened_file:
File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/torch/serialization.py", line 212, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/torch/serialization.py", line 193, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home/shaoyent/transformers/examples/language-modeling/output/checkpoint-500/optimizer.pt'
The error seems to be the Trainer saving checkpoints in each local master while the directory is created by world master causing a race condition.
I think the suggested method is to handle save checkpoints in world master, since weights should be synchronized after backprop.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Multi Instance Training Error - Amazon SageMaker
It does not work for multi instance distributed training. ... I get a FileNotFoundError error after training when the script is trying to...
Read more >Ask Question
Why am I receiving ' FileNotFoundError: [Errno 2] No such file or directory:' in the terminal when i run `python ./train.py` · Ask...
Read more >Log distributed training experiments - Weights & Biases - Wandb
In distributed training, models are trained using multiple GPUs in parallel. W&B supports two patterns to track distributed training experiments:.
Read more >Model Parallel Troubleshooting - Amazon SageMaker
Troubleshooting information for distributed training in Amazon SageMaker. ... FileNotFoundError: [Errno 2] No such file or directory: ...
Read more >Model Parallel Troubleshooting - Amazon SageMaker
Troubleshooting information for distributed training in Amazon SageMaker. ... FileNotFoundError: [Errno 2] No such file or directory: ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I see. Thinking about it more, yes, I think it like it if we only save from world_master. (even if it adds one step for some users). Thanks for your insight.
Ah, I see, thanks for the clarification. I’m assuming a shared file system so it’s all saving and loading from the same file. I guess for other cluster configurations that makes sense, but care would have to be taken to avoid write/mkdir conflicts from multiple processes.