question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FileNotFoundError when running distributed Trainer

See original GitHub issue

🐛 Bug

Information

I’m running the language modeling example in distributed mode and am getting FileNotFoundError in the Trainer.

The error :

Traceback (most recent call last):
  File "run_language_modeling.py", line 292, in <module>
    main()
  File "run_language_modeling.py", line 257, in main
    trainer.train(model_path=model_path)
  File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/transformers/trainer.py", line 451, in train
    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
  File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/torch/serialization.py", line 327, in save
    with _open_file_like(f, 'wb') as opened_file:
  File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/torch/serialization.py", line 212, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/torch/serialization.py", line 193, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home/shaoyent/transformers/examples/language-modeling/output/checkpoint-500/optimizer.pt'

The error seems to be the Trainer saving checkpoints in each local master while the directory is created by world master causing a race condition.

I think the suggested method is to handle save checkpoints in world master, since weights should be synchronized after backprop.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
julien-ccommented, May 14, 2020

I see. Thinking about it more, yes, I think it like it if we only save from world_master. (even if it adds one step for some users). Thanks for your insight.

0reactions
shaoyentcommented, May 14, 2020

Ah, I see, thanks for the clarification. I’m assuming a shared file system so it’s all saving and loading from the same file. I guess for other cluster configurations that makes sense, but care would have to be taken to avoid write/mkdir conflicts from multiple processes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi Instance Training Error - Amazon SageMaker
It does not work for multi instance distributed training. ... I get a FileNotFoundError error after training when the script is trying to...
Read more >
Ask Question
Why am I receiving ' FileNotFoundError: [Errno 2] No such file or directory:' in the terminal when i run `python ./train.py` · Ask...
Read more >
Log distributed training experiments - Weights & Biases - Wandb
In distributed training, models are trained using multiple GPUs in parallel. W&B supports two patterns to track distributed training experiments:.
Read more >
Model Parallel Troubleshooting - Amazon SageMaker
Troubleshooting information for distributed training in Amazon SageMaker. ... FileNotFoundError: [Errno 2] No such file or directory: ...
Read more >
Model Parallel Troubleshooting - Amazon SageMaker
Troubleshooting information for distributed training in Amazon SageMaker. ... FileNotFoundError: [Errno 2] No such file or directory: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found