question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't get `Parallel.run` with NCCL in SLURM environment to work

See original GitHub issue

🐛 Bug description

Dispatching a distributed multi-node/multi-gpu script via SLURM sbatch raises a RuntimeError

To reproduce:

Slurm invocation:

 OMP_NUM_THREADS=1 sbatch -N1 -n2 -p gpu --gres=gpu:v100-32gb:02 --wrap "python -u test_dist.py run --nnodes=1 --nproc_per_node=2"

test_dist.py

Python script

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os

import fire

import torch
import ignite
import ignite.distributed as idist

def run_diagnostic(local_rank):
    prefix = f"{local_rank}) "
    print(f"{prefix}Rank={idist.get_rank()}")
    print(f"{prefix}torch version: {torch.version.__version__}")
    print(f"{prefix}torch git version: {torch.version.git_version}")
    
    if torch.cuda.is_available():
        print(f"{prefix}torch version cuda: {torch.version.cuda}")
        print(f"{prefix}number of cuda devices: {torch.cuda.device_count()}")

        for i in range(torch.cuda.device_count()):
            print(f"{prefix}\t- device {i}: {torch.cuda.get_device_properties(i)}")
    else:
        print("{prefix}no cuda available")


    if "SLURM_JOBID" in os.environ:
        for k in ["SLURM_PROCID", "SLURM_LOCALID", "SLURM_NTASKS", "SLURM_JOB_NODELIST"]:
            print(f"{k}: {os.environ[k]}")
        
        if local_rank == 0:
            hostnames = subprocess.check_output(["scontrol", "show", "hostnames", os.environ["SLURM_JOB_NODELIST"]])
            print(f"hostnames: {hostnames}")


def run(**spawn_kwargs):
    with idist.Parallel(backend='nccl', **spawn_kwargs) as parallel:
        parallel.run(run_diagnostic)

if __name__ == '__main__':
    fire.Fire({'run': run})

Error message (and logged output):

2021-04-25 19:06:35,565 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'nccl'
2021-04-25 19:06:35,566 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes: 
	nproc_per_node: 2
	nnodes: 1
	node_rank: 0
2021-04-25 19:06:35,566 ignite.distributed.launcher.Parallel INFO: Spawn function '<function run_diagnostic at 0x1555554741e0>' in 2 processes
Traceback (most recent call last):
  File "test_dist.py", line 45, in <module>
    fire.Fire({'run': run})
  File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "test_dist.py", line 42, in run
    parallel.run(run_diagnostic)
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/launcher.py", line 309, in run
    idist.spawn(self.backend, func, args=args, kwargs_dict=kwargs, **self._spawn_params)
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/utils.py", line 324, in spawn
    fn, args=args, kwargs_dict=kwargs_dict, nproc_per_node=nproc_per_node, backend=backend, **kwargs
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 380, in spawn
    **spawn_kwargs,
  File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 323, in _dist_worker_task_fn
    backend, init_method=init_method, world_size=arg_world_size, rank=arg_rank, **kw
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 72, in create_from_backend
    backend=backend, init_method=init_method, world_size=world_size, rank=rank, **kwargs
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 93, in __init__
    backend, timeout=timeout, init_method=init_method, world_size=world_size, rank=rank, **kwargs
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 124, in _create_from_backend
    dist.init_process_group(backend, init_method=init_method, **init_pg_kwargs)
  File "path_to_python_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "path_to_python_env/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

Expected behavior

I would like the code to run run_diagnostic for each local_rank in the allocation. Right now it does not reach that point, it seems to complete parallel configuration on the master process but fails in the auxiliary processes.

Environment

  • PyTorch Version (e.g., 1.4): 1.8.1+cu102
  • Ignite Version: was using 0.4.4, also tried on 0.5.0.dev20210423
  • OS: Linux (CentOS)
  • How you installed Ignite (conda, pip, source): pip
  • Python version: 3.7.3
  • Any other relevant information:
  1. I’ve checked that this isn’t a problem of zombie processes lingering on the compute nodes from previous failed runs.
  2. there is an dist.barrier() after init_process_group() in ignite/distributed/comp_models/native.py (Parallel._create_from_backend). Clearly the init_process_group completes for at least one process, but in syncing via barrier(), the other auxiliary processes fail.
  3. Not sure if this is handled elsewhere (pretty sure not) or even relevant, but in ignite.distributed.comp_models.native.Parallel.setup_env_vars, upon discovering that the environment is a SLURM environment, the system makes a call to self._setup_env_in_slurm and returns without setting self._local_rank, self._master_addr, and self._master_port (as it would otherwise).
  4. Thanks!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:23 (12 by maintainers)

github_iconTop GitHub Comments

2reactions
djberenbergcommented, Apr 28, 2021

@sdesrozis re: docs - Personally I would have benefited greatly from a clear explanation of the interaction between slurm and ignite and how to dispatch jobs on such a system. . I’m awaiting in excitement the upcoming blog post you mentioned a few days ago. Thanks again for working on this so much for the last few days it really has helped a lot. The why-ignite package was also very helpful!

I also think the usage you suggested is good because it is has the least deviation from any other slurm script meaning there is a low barrier to entry and doesn’t require acute understanding of the initialization process of distributed jobs as much as torch.distributed.[launch|spawn].

1reaction
sdesroziscommented, Apr 28, 2021

@fco-dv we definitely need the blog post about idist 😉

Read more comments on GitHub >

github_iconTop Results From Across the Web

Environment Variables — NCCL 2.16.2 documentation
Values accepted. The default value is PARALLEL. Setting is to GROUP will use cooperative groups (CUDA 9.0 and later) for processes managing more...
Read more >
Computing cluster — PyTorch Lightning 1.6.3 documentation
When running in DDP mode, some errors in your code can show up as an NCCL issue. Set the NCCL_DEBUG=INFO environment variable to...
Read more >
PyTorch on the HPC Clusters - Princeton Research Computing
This will allow you to run multiple jobs with one sbatch command. Each job within the array trains the network using a different...
Read more >
Running your AI training jobs on Satori using Slurm
Interactive Jobs . Most users will find batch jobs to be the easiest way to interact with the system, since they permit you...
Read more >
Getting Started - DeepSpeed
To get started with DeepSpeed on AzureML, please see the AzureML ... The default is to use the NCCL backend, which DeepSpeed has...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found