How should dist_url be set?

❓ How to use Detectron2

I have read the official Pytorch documentation on distributed training but I am still struggling to properly train any of the provided models on more than one gpu, and I feel that it is caused by a misunderstanding on how to properly set the dist_url in tools/train_net.py. I am currently trying to apply the collab example in a distributed setting with 2 gpus. However after running the following command: python train_balloon.py --num-gpus 2 --config-file configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml --dist-url “auto” --machine-rank 0 --num-machines 1

I get an error similar to #91 in which it seems multiple gpus appear to be trying to take the same --dist-url. It was my understanding that after supplying the --dist-url, the code would set up the first gpu with rank 0 and set up the other gpus as well but this error makes me think that we are supposed to iterate through every available gpu and set up each of their --dist-urls such as in this external example. I have already consulted the documentation and have attempted to read the source code so any examples on how to do this properly would be greatly appreciated. The error I get lies below:

`Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False) Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False) Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False) Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False) Process group URL: tcp://127.0.0.1:52111 Process group URL: tcp://127.0.0.1:52111 Process group URL: tcp://127.0.0.1:52111 Traceback (most recent call last): File “train_balloon.py”, line 211, in <module> args=(args,), File “/home/user/detectron2/detectron2/engine/launch.py”, line 49, in launch daemon=False, File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn while not spawn_context.join(): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join raise Exception(msg) Exception:

Traceback (most recent call last): File “train_balloon.py”, line 211, in <module> args=(args,), File “/home/user/detectron2/detectron2/engine/launch.py”, line 49, in launch daemon=False, File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn while not spawn_context.join(): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join raise Exception(msg) Exception:

– Process 0 terminated with the following error: Traceback (most recent call last): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap fn(i, *args) File “/home/user/detectron2/detectron2/engine/launch.py”, line 67, in _distributed_worker raise e File “/home/user/detectron2/detectron2/engine/launch.py”, line 62, in _distributed_worker backend=“NCCL”, init_method=dist_url, world_size=world_size, rank=global_rank File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 400, in init_process_group store, rank, world_size = next(rendezvous(url)) File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 95, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon) RuntimeError: Address already in use

Traceback (most recent call last): File “train_balloon.py”, line 211, in <module> args=(args,), File “/home/user/detectron2/detectron2/engine/launch.py”, line 49, in launch daemon=False, File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn while not spawn_context.join(): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join raise Exception(msg) Exception:

– Process 0 terminated with the following error: Traceback (most recent call last): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap fn(i, *args) File “/home/user/detectron2/detectron2/engine/launch.py”, line 67, in _distributed_worker raise e File “/home/user/detectron2/detectron2/engine/launch.py”, line 62, in _distributed_worker backend=“NCCL”, init_method=dist_url, world_size=world_size, rank=global_rank File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 400, in init_process_group store, rank, world_size = next(rendezvous(url)) File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 95, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon) RuntimeError: Address already in use`

Issue Analytics

State:
Created 4 years ago
Comments:5

Top GitHub Comments

2reactions

ppwwyyxxcommented, Jan 9, 2020

I cannot help with slurm questions but I think you probably run it 4 times with #SBATCH --ntasks=4. You should only see Command Line Args: Namespace(config_file='configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml', dist_url='tcp://127.0.0.1:52111', eval_only=False, machine_rank=0, num_gpus=2, num_machines =1, opts=[], resume=False) once.

0reactions

BisratMcommented, Jan 9, 2020

I see. This is my current slurm configuration(pytorch just happens to be what I named my conda environment). It may be possible that the srun command is somehow causing the program to be executed in parallel.

`#!/bin/bash #SBATCH --job-name=norank # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=4 # total number of tasks across all nodes #SBATCH --cpus-per-task=7 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G per cpu-core is default) #SBATCH --gres=gpu:2 # number of gpus per node #SBATCH --time=00:30:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=all # send email on job start, end and fail

module purge module load anaconda3 module load rh/devtoolset module load cudnn/cuda-10.1/7.5.0

conda activate pytorch pip install -e .

srun python train_balloon.py --num-gpus 2 --config-file configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml --num-machines 1`