distributed program hangs in SLURM
See original GitHub issue🐛 Bug description
Hi @vfdev-5 ,
We got an urgent issue from MONAI and Clara users that the distributed program hangs in NVIDIA NSL-B platform, which is based on SLURM. You can reproduce the issue with this simple example: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_workflows.py It will hang when creating ignite Accurary metric, seems related to this line: https://github.com/pytorch/ignite/blob/v0.4.4.post1/ignite/distributed/comp_models/native.py#L107 After removing the Accurary metric from the example, it hangs when training started and hasn’t timeout yet. Please note that this example can run successfully with ignite 0.4.2. And we also tried the pure PyTorch dist example in the same hardware and software env, it can run successfully: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_ddp.py
Could you please help analyze the reason and give some advice? It blocks our cooperation with another team now.
Thanks in advance.
Environment
- PyTorch Version (e.g., 1.4): 1.8.1
- Ignite Version (e.g., 0.3.0): 0.4.4
- OS (e.g., Linux): Ubuntu 18.04
- How you installed Ignite (
conda
,pip
, source): pip
Issue Analytics
- State:
- Created 2 years ago
- Comments:60 (21 by maintainers)
Hi @vfdev-5 and @sdesrozis,
Thanks for the prompt response! I remove the first idist.sync and change the second one to idist.barrier. Things are running fine now.
@hw-ju Thanks for the report.
it seems that your environment contains some variables set usually by PyTorch launcher. In the current ignite distributed module, slurm and PyTorch launcher are mutual exclusive, since srun has the same usage than PyTorch launcher. That explains the raised error.
However, I would say that the conflicting variables are set somewhere but it’s not easy to track. By the way, we have to find which script does the setting.