Can't get `Parallel.run` with NCCL in SLURM environment to work
See original GitHub issue🐛 Bug description
Dispatching a distributed multi-node/multi-gpu script via SLURM sbatch
raises a RuntimeError
To reproduce:
Slurm invocation:
OMP_NUM_THREADS=1 sbatch -N1 -n2 -p gpu --gres=gpu:v100-32gb:02 --wrap "python -u test_dist.py run --nnodes=1 --nproc_per_node=2"
test_dist.py
Python script
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import fire
import torch
import ignite
import ignite.distributed as idist
def run_diagnostic(local_rank):
prefix = f"{local_rank}) "
print(f"{prefix}Rank={idist.get_rank()}")
print(f"{prefix}torch version: {torch.version.__version__}")
print(f"{prefix}torch git version: {torch.version.git_version}")
if torch.cuda.is_available():
print(f"{prefix}torch version cuda: {torch.version.cuda}")
print(f"{prefix}number of cuda devices: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
print(f"{prefix}\t- device {i}: {torch.cuda.get_device_properties(i)}")
else:
print("{prefix}no cuda available")
if "SLURM_JOBID" in os.environ:
for k in ["SLURM_PROCID", "SLURM_LOCALID", "SLURM_NTASKS", "SLURM_JOB_NODELIST"]:
print(f"{k}: {os.environ[k]}")
if local_rank == 0:
hostnames = subprocess.check_output(["scontrol", "show", "hostnames", os.environ["SLURM_JOB_NODELIST"]])
print(f"hostnames: {hostnames}")
def run(**spawn_kwargs):
with idist.Parallel(backend='nccl', **spawn_kwargs) as parallel:
parallel.run(run_diagnostic)
if __name__ == '__main__':
fire.Fire({'run': run})
Error message (and logged output):
2021-04-25 19:06:35,565 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'nccl'
2021-04-25 19:06:35,566 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes:
nproc_per_node: 2
nnodes: 1
node_rank: 0
2021-04-25 19:06:35,566 ignite.distributed.launcher.Parallel INFO: Spawn function '<function run_diagnostic at 0x1555554741e0>' in 2 processes
Traceback (most recent call last):
File "test_dist.py", line 45, in <module>
fire.Fire({'run': run})
File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
target=component.__name__)
File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "test_dist.py", line 42, in run
parallel.run(run_diagnostic)
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/launcher.py", line 309, in run
idist.spawn(self.backend, func, args=args, kwargs_dict=kwargs, **self._spawn_params)
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/utils.py", line 324, in spawn
fn, args=args, kwargs_dict=kwargs_dict, nproc_per_node=nproc_per_node, backend=backend, **kwargs
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 380, in spawn
**spawn_kwargs,
File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 323, in _dist_worker_task_fn
backend, init_method=init_method, world_size=arg_world_size, rank=arg_rank, **kw
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 72, in create_from_backend
backend=backend, init_method=init_method, world_size=world_size, rank=rank, **kwargs
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 93, in __init__
backend, timeout=timeout, init_method=init_method, world_size=world_size, rank=rank, **kwargs
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 124, in _create_from_backend
dist.init_process_group(backend, init_method=init_method, **init_pg_kwargs)
File "path_to_python_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "path_to_python_env/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Expected behavior
I would like the code to run run_diagnostic
for each local_rank
in the allocation. Right now it does not reach that point, it seems to complete parallel configuration on the master process but fails in the auxiliary processes.
Environment
- PyTorch Version (e.g., 1.4):
1.8.1+cu102
- Ignite Version: was using
0.4.4
, also tried on0.5.0.dev20210423
- OS: Linux (CentOS)
- How you installed Ignite (
conda
,pip
, source):pip
- Python version:
3.7.3
- Any other relevant information:
- I’ve checked that this isn’t a problem of zombie processes lingering on the compute nodes from previous failed runs.
- there is an
dist.barrier()
afterinit_process_group()
inignite/distributed/comp_models/native.py
(Parallel._create_from_backend
). Clearly theinit_process_group
completes for at least one process, but in syncing viabarrier()
, the other auxiliary processes fail. - Not sure if this is handled elsewhere (pretty sure not) or even relevant, but in
ignite.distributed.comp_models.native.Parallel.setup_env_vars
, upon discovering that the environment is a SLURM environment, the system makes a call toself._setup_env_in_slurm
and returns without settingself._local_rank
,self._master_addr
, andself._master_port
(as it would otherwise). - Thanks!
Issue Analytics
- State:
- Created 2 years ago
- Comments:23 (12 by maintainers)
Top Results From Across the Web
Environment Variables — NCCL 2.16.2 documentation
Values accepted. The default value is PARALLEL. Setting is to GROUP will use cooperative groups (CUDA 9.0 and later) for processes managing more...
Read more >Computing cluster — PyTorch Lightning 1.6.3 documentation
When running in DDP mode, some errors in your code can show up as an NCCL issue. Set the NCCL_DEBUG=INFO environment variable to...
Read more >PyTorch on the HPC Clusters - Princeton Research Computing
This will allow you to run multiple jobs with one sbatch command. Each job within the array trains the network using a different...
Read more >Running your AI training jobs on Satori using Slurm
Interactive Jobs . Most users will find batch jobs to be the easiest way to interact with the system, since they permit you...
Read more >Getting Started - DeepSpeed
To get started with DeepSpeed on AzureML, please see the AzureML ... The default is to use the NCCL backend, which DeepSpeed has...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@sdesrozis re: docs - Personally I would have benefited greatly from a clear explanation of the interaction between slurm and ignite and how to dispatch jobs on such a system. . I’m awaiting in excitement the upcoming blog post you mentioned a few days ago. Thanks again for working on this so much for the last few days it really has helped a lot. The why-ignite package was also very helpful!
I also think the usage you suggested is good because it is has the least deviation from any other slurm script meaning there is a low barrier to entry and doesn’t require acute understanding of the initialization process of distributed jobs as much as
torch.distributed.[launch|spawn]
.@fco-dv we definitely need the blog post about
idist
😉