Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to create DiskSaver when program launched with torch.distributed.launcher

See original GitHub issue

🐛 Bug description

As mentioned in this issue in MONAI, I tried to run this tutorial code with torch.distributed.launcher. However, the program froze at instantiating the CheckpointSaver. The reason was that DiskSaver of ignite cannot be created when the program is launched with torch.distributed.launcher (I am using SLURM). I also noticed that it might be caused by calling get_rank() in the one_rank_only decorator, which is used in the definition of DiskSaver: https://github.com/pytorch/ignite/blob/d16d15efbbbfc476702e91f3ab2bc757b839be26/ignite/distributed/utils.py#L595

I also did a simple experiment to verify this. I launched the following script with srun python -m torch.distributed.launcher --nproc_per_node=4 --nnodes=1 script.py, and found that the program froze at creating the DiskSaver.

import torch.distributed as dist
from ignite.handlers import DiskSaver
from argparse import ArgumentParser


def create_disk_saver(args):
	dist.init_process_group(backend='nccl', init_method='env://')

	if dist.get_rank() == 0:
		print('building DiskSaver')
		disk_saver = DiskSaver(dirname='./runs/')
		print('DiskSaver built')

	dist.destroy_process_group()


def main():
	parser = ArgumentParser()
	parser.add_argument('--local_rank', type=int)
	args = parser.parse_args()
	create_disk_saver(args)


if __name__ == '__main__':
	main()

I would be much appreciated if you could fix this. I prefer launching the program with torch.distributed.launcher to ignite.distributed.Parallel context manager, as it has less issues with the SLURM env.

Environment

PyTorch Version (e.g., 1.4): 1.8
Ignite Version (e.g., 0.3.0): 0.4.4
OS (e.g., Linux): Linux
How you installed Ignite (conda, pip, source): pip
Python version: 3.8
Any other relevant information:

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:27 (21 by maintainers)

Top GitHub Comments

1reaction

sdesroziscommented, Jun 8, 2021

@sandylaker I tried a few runs on the cluster of my company.

1 - using `srun` and `torch.launch.distributed` without `ignite.distributed.Parallel`

script, README

srun -N1 -n1 python -m torch.distributed.launch --nproc_per_node 2 helloworld.py

> [http://127.0.0.1:29500] hello from [ener021:nccl] process 0/2
> [http://127.0.0.1:29500] hello from [ener021:nccl] process 1/2

NOTE : I removed some mandatory options like -J, -p, --mem, etc. related to the own configuration of my cluster.

srun -N1 -n1 python -m torch.distributed.launch --nproc_per_node 8 helloworld.py --backend="gloo"

> [http://127.0.0.1:29500] hello from [ener021:gloo] process 0/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 1/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 2/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 3/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 4/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 5/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 6/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 7/8

2 - using `srun` without `torch.launch.distributed` and `ignite.distributed.Parallel`

script, README

srun -N1 -n2 python helloworld.py

> [http://ener021:22163] hello from [ener021:nccl] process 0/2
> [http://ener021:22163] hello from [ener021:nccl] process 1/2

srun -N1 -n8 python helloworld.py --backend="gloo"

> [http://ener021:22165] hello from [ener021:gloo] process 0/8
> [http://ener021:22165] hello from [ener021:gloo] process 1/8
> [http://ener021:22165] hello from [ener021:gloo] process 2/8
> [http://ener021:22165] hello from [ener021:gloo] process 3/8
> [http://ener021:22165] hello from [ener021:gloo] process 4/8
> [http://ener021:22165] hello from [ener021:gloo] process 5/8
> [http://ener021:22165] hello from [ener021:gloo] process 6/8
> [http://ener021:22165] hello from [ener021:gloo] process 7/8

3 - using `srun` and `torch.launch.distributed` with `ignite.distributed.Parallel`

script, README

One script, both usages.

On a computing node, use torch.launch.distributed

python -m torch.distributed.launch --nproc_per_node 2 --use_env helloworld.py

> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'nccl'
> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2aac7e5bf4c0>' in 2 processes
> [http://127.0.0.1:29500] hello from [ener021:nccl] process 0/2
> [http://127.0.0.1:29500] hello from [ener021:nccl] process 1/2
> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 08:57:28,548 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'nccl'

python -m torch.distributed.launch --nproc_per_node 8 --use_env helloworld.py --backend="gloo"

> 2021-06-08 08:58:22,682 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'gloo'
> 2021-06-08 08:58:22,683 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2b0ae40ec4c0>' in 8 processes
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 0/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 1/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 2/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 3/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 4/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 5/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 6/8
> [http://127.0.0.1:29500] hello from [ener021:gloo] process 7/8
> 2021-06-08 08:58:22,685 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 08:58:22,685 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'gloo'

On the frontend, use srun (or sbatch)

srun -N1 -n2 python helloworld.py

> 2021-06-08 09:00:56,121 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'nccl'
> 2021-06-08 09:00:56,121 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2b10e3ce34c0>' in 2 processes
> [http://ener021:22182] hello from [ener021:nccl] process 0/2
> [http://ener021:22182] hello from [ener021:nccl] process 1/2
> 2021-06-08 09:00:56,132 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 09:00:56,132 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'nccl'

srun -N1 -n8 python helloworld.py --backend="gloo"

> 2021-06-08 09:02:26,940 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'gloo'
> 2021-06-08 09:02:26,941 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x2b3c2f58f4c0>' in 8 processes
> [http://ener021:22185] hello from [ener021:gloo] process 0/8
> [http://ener021:22185] hello from [ener021:gloo] process 1/8
> [http://ener021:22185] hello from [ener021:gloo] process 2/8
> [http://ener021:22185] hello from [ener021:gloo] process 3/8
> [http://ener021:22185] hello from [ener021:gloo] process 4/8
> [http://ener021:22185] hello from [ener021:gloo] process 5/8
> [http://ener021:22185] hello from [ener021:gloo] process 6/8
> [http://ener021:22185] hello from [ener021:gloo] process 7/8
> 2021-06-08 09:02:26,946 ignite.distributed.launcher.Parallel INFO: End of run
> 2021-06-08 09:02:26,947 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'gloo'

HTH

1reaction

sdesroziscommented, Jun 7, 2021

Just looking your code, it can’t work if you create the DiskSaver in a if section only restricted to one process. It seems that DiskSaver needs a collective __init__ call.

@vfdev-5 Yes, That’s what I mentioned looking the code a few days ago. However you explained it better 😊

The parallel / sequential sections remain a tricky (and classical) thing in parallel computing. Having to manage the 2 behaviours (collective call similar to reduction or guard per processor) makes the codes more complicated. An idea would be to have only handlers defined in collective, we avoid the if clauses and it’s simpler.

Although I don’t know if the bug label should be added to this issue.

Last thing, I didn’t understand how idist.sync() would help, it doesn’t remove the collective code section ?

Top Results From Across the Web

Distributed communication package - torch.distributed

Checks whether this process was launched with torch.distributed.elastic (aka torchelastic). The existence of TORCHELASTIC_RUN_ID environment variable is ...

pytorch - Running training using torch.distributed.launch

[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP- ...

Distributed GPU training guide (SDK v2) - Azure

You don't need to use a launcher utility like torch.distributed.launch . To run a distributed PyTorch job: Specify the training script and arguments;...

Run a PyTorch Distributed CUDA Application Using ...

The Urika-XC image utilizes the torch.distributed.launch module for launching PyTorch distributed CUDA applications across one or more GPU nodes.

Distributed communication package - torch.distributed

The table below shows which functions are available for use with CPU / CUDA tensors. MPI supports CUDA only if the implementation used...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Unable to create DiskSaver when program launched with torch.distributed.launcher

🐛 Bug description

Environment

Issue Analytics

Top GitHub Comments

1 - using `srun` and `torch.launch.distributed` without `ignite.distributed.Parallel`

2 - using `srun` without `torch.launch.distributed` and `ignite.distributed.Parallel`

3 - using `srun` and `torch.launch.distributed` with `ignite.distributed.Parallel`

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

distributed program hangs in SLURM

Support list of tensors in Metrics

Unable to create DiskSaver when program launched with torch.distributed.launcher

🐛 Bug description

Environment

Issue Analytics

Top GitHub Comments

1 - using srun and torch.launch.distributed without ignite.distributed.Parallel

2 - using srun without torch.launch.distributed and ignite.distributed.Parallel

3 - using srun and torch.launch.distributed with ignite.distributed.Parallel

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

distributed program hangs in SLURM

Support list of tensors in Metrics

1 - using `srun` and `torch.launch.distributed` without `ignite.distributed.Parallel`

2 - using `srun` without `torch.launch.distributed` and `ignite.distributed.Parallel`

3 - using `srun` and `torch.launch.distributed` with `ignite.distributed.Parallel`