Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't get `Parallel.run` with NCCL in SLURM environment to work

See original GitHub issue

🐛 Bug description

Dispatching a distributed multi-node/multi-gpu script via SLURM sbatch raises a RuntimeError

To reproduce:

Slurm invocation:

 OMP_NUM_THREADS=1 sbatch -N1 -n2 -p gpu --gres=gpu:v100-32gb:02 --wrap "python -u test_dist.py run --nnodes=1 --nproc_per_node=2"

`test_dist.py`

Python script

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os

import fire

import torch
import ignite
import ignite.distributed as idist

def run_diagnostic(local_rank):
    prefix = f"{local_rank}) "
    print(f"{prefix}Rank={idist.get_rank()}")
    print(f"{prefix}torch version: {torch.version.__version__}")
    print(f"{prefix}torch git version: {torch.version.git_version}")
    
    if torch.cuda.is_available():
        print(f"{prefix}torch version cuda: {torch.version.cuda}")
        print(f"{prefix}number of cuda devices: {torch.cuda.device_count()}")

        for i in range(torch.cuda.device_count()):
            print(f"{prefix}\t- device {i}: {torch.cuda.get_device_properties(i)}")
    else:
        print("{prefix}no cuda available")


    if "SLURM_JOBID" in os.environ:
        for k in ["SLURM_PROCID", "SLURM_LOCALID", "SLURM_NTASKS", "SLURM_JOB_NODELIST"]:
            print(f"{k}: {os.environ[k]}")
        
        if local_rank == 0:
            hostnames = subprocess.check_output(["scontrol", "show", "hostnames", os.environ["SLURM_JOB_NODELIST"]])
            print(f"hostnames: {hostnames}")


def run(**spawn_kwargs):
    with idist.Parallel(backend='nccl', **spawn_kwargs) as parallel:
        parallel.run(run_diagnostic)

if __name__ == '__main__':
    fire.Fire({'run': run})

Error message (and logged output):

2021-04-25 19:06:35,565 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'nccl'
2021-04-25 19:06:35,566 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes: 
	nproc_per_node: 2
	nnodes: 1
	node_rank: 0
2021-04-25 19:06:35,566 ignite.distributed.launcher.Parallel INFO: Spawn function '<function run_diagnostic at 0x1555554741e0>' in 2 processes
Traceback (most recent call last):
  File "test_dist.py", line 45, in <module>
    fire.Fire({'run': run})
  File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "test_dist.py", line 42, in run
    parallel.run(run_diagnostic)
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/launcher.py", line 309, in run
    idist.spawn(self.backend, func, args=args, kwargs_dict=kwargs, **self._spawn_params)
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/utils.py", line 324, in spawn
    fn, args=args, kwargs_dict=kwargs_dict, nproc_per_node=nproc_per_node, backend=backend, **kwargs
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 380, in spawn
    **spawn_kwargs,
  File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 323, in _dist_worker_task_fn
    backend, init_method=init_method, world_size=arg_world_size, rank=arg_rank, **kw
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 72, in create_from_backend
    backend=backend, init_method=init_method, world_size=world_size, rank=rank, **kwargs
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 93, in __init__
    backend, timeout=timeout, init_method=init_method, world_size=world_size, rank=rank, **kwargs
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 124, in _create_from_backend
    dist.init_process_group(backend, init_method=init_method, **init_pg_kwargs)
  File "path_to_python_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "path_to_python_env/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

Expected behavior

I would like the code to run run_diagnostic for each local_rank in the allocation. Right now it does not reach that point, it seems to complete parallel configuration on the master process but fails in the auxiliary processes.

Environment

PyTorch Version (e.g., 1.4): 1.8.1+cu102
Ignite Version: was using 0.4.4, also tried on 0.5.0.dev20210423
OS: Linux (CentOS)
How you installed Ignite (conda, pip, source): pip
Python version: 3.7.3
Any other relevant information:

I’ve checked that this isn’t a problem of zombie processes lingering on the compute nodes from previous failed runs.
there is an dist.barrier() after init_process_group() in ignite/distributed/comp_models/native.py (Parallel._create_from_backend). Clearly the init_process_group completes for at least one process, but in syncing via barrier(), the other auxiliary processes fail.
Not sure if this is handled elsewhere (pretty sure not) or even relevant, but in ignite.distributed.comp_models.native.Parallel.setup_env_vars, upon discovering that the environment is a SLURM environment, the system makes a call to self._setup_env_in_slurm and returns without setting self._local_rank, self._master_addr, and self._master_port (as it would otherwise).
Thanks!

Issue Analytics

State:
Created 2 years ago
Comments:23 (12 by maintainers)

Top GitHub Comments

2reactions

djberenbergcommented, Apr 28, 2021

@sdesrozis re: docs - Personally I would have benefited greatly from a clear explanation of the interaction between slurm and ignite and how to dispatch jobs on such a system. . I’m awaiting in excitement the upcoming blog post you mentioned a few days ago. Thanks again for working on this so much for the last few days it really has helped a lot. The why-ignite package was also very helpful!

I also think the usage you suggested is good because it is has the least deviation from any other slurm script meaning there is a low barrier to entry and doesn’t require acute understanding of the initialization process of distributed jobs as much as torch.distributed.[launch|spawn].

1reaction

sdesroziscommented, Apr 28, 2021

@fco-dv we definitely need the blog post about idist 😉