Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FileNotFoundError when running distributed Trainer

See original GitHub issue

🐛 Bug

Information

I’m running the language modeling example in distributed mode and am getting FileNotFoundError in the Trainer.

The error :

Traceback (most recent call last):
  File "run_language_modeling.py", line 292, in <module>
    main()
  File "run_language_modeling.py", line 257, in main
    trainer.train(model_path=model_path)
  File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/transformers/trainer.py", line 451, in train
    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
  File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/torch/serialization.py", line 327, in save
    with _open_file_like(f, 'wb') as opened_file:
  File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/torch/serialization.py", line 212, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/shaoyent/anaconda3/envs/bert/lib/python3.7/site-packages/torch/serialization.py", line 193, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home/shaoyent/transformers/examples/language-modeling/output/checkpoint-500/optimizer.pt'

The error seems to be the Trainer saving checkpoints in each local master while the directory is created by world master causing a race condition.

I think the suggested method is to handle save checkpoints in world master, since weights should be synchronized after backprop.

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

julien-ccommented, May 14, 2020

I see. Thinking about it more, yes, I think it like it if we only save from world_master. (even if it adds one step for some users). Thanks for your insight.

0reactions

shaoyentcommented, May 14, 2020

Ah, I see, thanks for the clarification. I’m assuming a shared file system so it’s all saving and loading from the same file. I guess for other cluster configurations that makes sense, but care would have to be taken to avoid write/mkdir conflicts from multiple processes.