How should dist_url be set?
See original GitHub issue❓ How to use Detectron2
I have read the official Pytorch documentation on distributed training but I am still struggling to properly train any of the provided models on more than one gpu, and I feel that it is caused by a misunderstanding on how to properly set the dist_url in tools/train_net.py. I am currently trying to apply the collab example in a distributed setting with 2 gpus. However after running the following command:
python train_balloon.py --num-gpus 2 --config-file configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml --dist-url “auto” --machine-rank 0 --num-machines 1
I get an error similar to #91 in which it seems multiple gpus appear to be trying to take the same --dist-url. It was my understanding that after supplying the --dist-url, the code would set up the first gpu with rank 0 and set up the other gpus as well but this error makes me think that we are supposed to iterate through every available gpu and set up each of their --dist-urls such as in this external example. I have already consulted the documentation and have attempted to read the source code so any examples on how to do this properly would be greatly appreciated. The error I get lies below:
`Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False)
Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False)
Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False)
Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False)
Process group URL: tcp://127.0.0.1:52111
Process group URL: tcp://127.0.0.1:52111
Process group URL: tcp://127.0.0.1:52111
Traceback (most recent call last):
File “train_balloon.py”, line 211, in <module>
args=(args,),
File “/home/user/detectron2/detectron2/engine/launch.py”, line 49, in launch
daemon=False,
File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn
while not spawn_context.join():
File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join
raise Exception(msg)
Exception:
Traceback (most recent call last): File “train_balloon.py”, line 211, in <module> args=(args,), File “/home/user/detectron2/detectron2/engine/launch.py”, line 49, in launch daemon=False, File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn while not spawn_context.join(): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join raise Exception(msg) Exception:
– Process 0 terminated with the following error: Traceback (most recent call last): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap fn(i, *args) File “/home/user/detectron2/detectron2/engine/launch.py”, line 67, in _distributed_worker raise e File “/home/user/detectron2/detectron2/engine/launch.py”, line 62, in _distributed_worker backend=“NCCL”, init_method=dist_url, world_size=world_size, rank=global_rank File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 400, in init_process_group store, rank, world_size = next(rendezvous(url)) File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 95, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon) RuntimeError: Address already in use
Traceback (most recent call last): File “train_balloon.py”, line 211, in <module> args=(args,), File “/home/user/detectron2/detectron2/engine/launch.py”, line 49, in launch daemon=False, File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn while not spawn_context.join(): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join raise Exception(msg) Exception:
– Process 0 terminated with the following error: Traceback (most recent call last): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap fn(i, *args) File “/home/user/detectron2/detectron2/engine/launch.py”, line 67, in _distributed_worker raise e File “/home/user/detectron2/detectron2/engine/launch.py”, line 62, in _distributed_worker backend=“NCCL”, init_method=dist_url, world_size=world_size, rank=global_rank File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 400, in init_process_group store, rank, world_size = next(rendezvous(url)) File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 95, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon) RuntimeError: Address already in use`
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top GitHub Comments
I cannot help with slurm questions but I think you probably run it 4 times with
#SBATCH --ntasks=4
. You should only seeCommand Line Args: Namespace(config_file='configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml', dist_url='tcp://127.0.0.1:52111', eval_only=False, machine_rank=0, num_gpus=2, num_machines =1, opts=[], resume=False)
once.I see. This is my current slurm configuration(pytorch just happens to be what I named my conda environment). It may be possible that the srun command is somehow causing the program to be executed in parallel.
`#!/bin/bash #SBATCH --job-name=norank # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=4 # total number of tasks across all nodes #SBATCH --cpus-per-task=7 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G per cpu-core is default) #SBATCH --gres=gpu:2 # number of gpus per node #SBATCH --time=00:30:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=all # send email on job start, end and fail
module purge module load anaconda3 module load rh/devtoolset module load cudnn/cuda-10.1/7.5.0
conda activate pytorch pip install -e .
srun python train_balloon.py --num-gpus 2 --config-file configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml --num-machines 1`