Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

distributed program hangs in SLURM

See original GitHub issue

🐛 Bug description

Hi @vfdev-5 ,

We got an urgent issue from MONAI and Clara users that the distributed program hangs in NVIDIA NSL-B platform, which is based on SLURM. You can reproduce the issue with this simple example: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_workflows.py It will hang when creating ignite Accurary metric, seems related to this line: https://github.com/pytorch/ignite/blob/v0.4.4.post1/ignite/distributed/comp_models/native.py#L107 After removing the Accurary metric from the example, it hangs when training started and hasn’t timeout yet. Please note that this example can run successfully with ignite 0.4.2. And we also tried the pure PyTorch dist example in the same hardware and software env, it can run successfully: https://github.com/Project-MONAI/tutorials/blob/master/acceleration/distributed_training/unet_training_ddp.py

Could you please help analyze the reason and give some advice? It blocks our cooperation with another team now.

Thanks in advance.

Environment

PyTorch Version (e.g., 1.4): 1.8.1
Ignite Version (e.g., 0.3.0): 0.4.4
OS (e.g., Linux): Ubuntu 18.04
How you installed Ignite (conda, pip, source): pip

Issue Analytics

State:
Created 2 years ago
Comments:60 (21 by maintainers)

Top GitHub Comments

3reactions

YuanTingHsiehcommented, Jun 10, 2021

Hi @vfdev-5 and @sdesrozis,

Thanks for the prompt response! I remove the first idist.sync and change the second one to idist.barrier. Things are running fine now.

2reactions

sdesroziscommented, Sep 16, 2021

@hw-ju Thanks for the report.

it seems that your environment contains some variables set usually by PyTorch launcher. In the current ignite distributed module, slurm and PyTorch launcher are mutual exclusive, since srun has the same usage than PyTorch launcher. That explains the raised error.

However, I would say that the conflicting variables are set somewhere but it’s not easy to track. By the way, we have to find which script does the setting.

Top Results From Across the Web

Training stuck running on the SLURM cluster with multiple ...

Bug I try to train a model across multiple nodes on a slurm cluster ... not set world_size and rank in torch.distributed.init_process_group, ...

pytorch distributed hang on dist_process_init slurm - You.com

Issue 1: It will hang unless you pass in nprocs=world_size to mp.spawn (). In other words, it's waiting for the "whole world" to...

Frequently Asked Questions - Slurm Workload Manager

Why does squeue (and "scontrol show jobid") sometimes not display a job's estimated start time? How can I run an Ansys program with...

Re: [slurm-users] 'srun hostname' hangs on the command line

... slurm-users@lists.schedmd.com Subject: [slurm-users] 'srun hostname' hangs on the command line. Hi All, Verbose mode doesn't show much.

Slurm Hangs Indefinitely - Google Groups

However, all Slurm commands seem to hang indefinitely: [michaelk@fe2 test]$ sinfo. Nothing happens... I also tried creating and submitting a test script,.