question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

distributed program hangs in SLURM

See original GitHub issue

🐛 Bug description

Hi @vfdev-5 ,

We got an urgent issue from MONAI and Clara users that the distributed program hangs in NVIDIA NSL-B platform, which is based on SLURM. You can reproduce the issue with this simple example: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_workflows.py It will hang when creating ignite Accurary metric, seems related to this line: https://github.com/pytorch/ignite/blob/v0.4.4.post1/ignite/distributed/comp_models/native.py#L107 After removing the Accurary metric from the example, it hangs when training started and hasn’t timeout yet. Please note that this example can run successfully with ignite 0.4.2. And we also tried the pure PyTorch dist example in the same hardware and software env, it can run successfully: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_ddp.py

Could you please help analyze the reason and give some advice? It blocks our cooperation with another team now.

Thanks in advance.

Environment

  • PyTorch Version (e.g., 1.4): 1.8.1
  • Ignite Version (e.g., 0.3.0): 0.4.4
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed Ignite (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:60 (21 by maintainers)

github_iconTop GitHub Comments

3reactions
YuanTingHsiehcommented, Jun 10, 2021

Hi @vfdev-5 and @sdesrozis,

Thanks for the prompt response! I remove the first idist.sync and change the second one to idist.barrier. Things are running fine now.

2reactions
sdesroziscommented, Sep 16, 2021

@hw-ju Thanks for the report.

it seems that your environment contains some variables set usually by PyTorch launcher. In the current ignite distributed module, slurm and PyTorch launcher are mutual exclusive, since srun has the same usage than PyTorch launcher. That explains the raised error.

However, I would say that the conflicting variables are set somewhere but it’s not easy to track. By the way, we have to find which script does the setting.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training stuck running on the SLURM cluster with multiple ...
Bug I try to train a model across multiple nodes on a slurm cluster ... not set world_size and rank in torch.distributed.init_process_group, ...
Read more >
pytorch distributed hang on dist_process_init slurm - You.com
Issue 1: It will hang unless you pass in nprocs=world_size to mp.spawn (). In other words, it's waiting for the "whole world" to...
Read more >
Frequently Asked Questions - Slurm Workload Manager
Why does squeue (and "scontrol show jobid") sometimes not display a job's estimated start time? How can I run an Ansys program with...
Read more >
Re: [slurm-users] 'srun hostname' hangs on the command line
... slurm-users@lists.schedmd.com Subject: [slurm-users] 'srun hostname' hangs on the command line. Hi All, Verbose mode doesn't show much.
Read more >
Slurm Hangs Indefinitely - Google Groups
However, all Slurm commands seem to hang indefinitely: [michaelk@fe2 test]$ sinfo. Nothing happens... I also tried creating and submitting a test script,.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found