question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How should dist_url be set?

See original GitHub issue

❓ How to use Detectron2

I have read the official Pytorch documentation on distributed training but I am still struggling to properly train any of the provided models on more than one gpu, and I feel that it is caused by a misunderstanding on how to properly set the dist_url in tools/train_net.py. I am currently trying to apply the collab example in a distributed setting with 2 gpus. However after running the following command: python train_balloon.py --num-gpus 2 --config-file configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml --dist-url “auto” --machine-rank 0 --num-machines 1

I get an error similar to #91 in which it seems multiple gpus appear to be trying to take the same --dist-url. It was my understanding that after supplying the --dist-url, the code would set up the first gpu with rank 0 and set up the other gpus as well but this error makes me think that we are supposed to iterate through every available gpu and set up each of their --dist-urls such as in this external example. I have already consulted the documentation and have attempted to read the source code so any examples on how to do this properly would be greatly appreciated. The error I get lies below:

`Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False) Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False) Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False) Command Line Args: Namespace(config_file=‘configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml’, dist_url=‘tcp://127.0.0.1:52111’, eval_only=False, machine_rank=0, num_gpus=2, num_machines
=1, opts=[], resume=False) Process group URL: tcp://127.0.0.1:52111 Process group URL: tcp://127.0.0.1:52111 Process group URL: tcp://127.0.0.1:52111 Traceback (most recent call last): File “train_balloon.py”, line 211, in <module> args=(args,), File “/home/user/detectron2/detectron2/engine/launch.py”, line 49, in launch daemon=False, File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn while not spawn_context.join(): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join raise Exception(msg) Exception:

Traceback (most recent call last): File “train_balloon.py”, line 211, in <module> args=(args,), File “/home/user/detectron2/detectron2/engine/launch.py”, line 49, in launch daemon=False, File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn while not spawn_context.join(): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join raise Exception(msg) Exception:

– Process 0 terminated with the following error: Traceback (most recent call last): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap fn(i, *args) File “/home/user/detectron2/detectron2/engine/launch.py”, line 67, in _distributed_worker raise e File “/home/user/detectron2/detectron2/engine/launch.py”, line 62, in _distributed_worker backend=“NCCL”, init_method=dist_url, world_size=world_size, rank=global_rank File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 400, in init_process_group store, rank, world_size = next(rendezvous(url)) File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 95, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon) RuntimeError: Address already in use

Traceback (most recent call last): File “train_balloon.py”, line 211, in <module> args=(args,), File “/home/user/detectron2/detectron2/engine/launch.py”, line 49, in launch daemon=False, File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn while not spawn_context.join(): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join raise Exception(msg) Exception:

– Process 0 terminated with the following error: Traceback (most recent call last): File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap fn(i, *args) File “/home/user/detectron2/detectron2/engine/launch.py”, line 67, in _distributed_worker raise e File “/home/user/detectron2/detectron2/engine/launch.py”, line 62, in _distributed_worker backend=“NCCL”, init_method=dist_url, world_size=world_size, rank=global_rank File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 400, in init_process_group store, rank, world_size = next(rendezvous(url)) File “/home/user/.conda/envs/pytorch/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 95, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon) RuntimeError: Address already in use`

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5

github_iconTop GitHub Comments

2reactions
ppwwyyxxcommented, Jan 9, 2020

I cannot help with slurm questions but I think you probably run it 4 times with #SBATCH --ntasks=4. You should only see Command Line Args: Namespace(config_file='configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml', dist_url='tcp://127.0.0.1:52111', eval_only=False, machine_rank=0, num_gpus=2, num_machines =1, opts=[], resume=False) once.

0reactions
BisratMcommented, Jan 9, 2020

I see. This is my current slurm configuration(pytorch just happens to be what I named my conda environment). It may be possible that the srun command is somehow causing the program to be executed in parallel.

`#!/bin/bash #SBATCH --job-name=norank # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=4 # total number of tasks across all nodes #SBATCH --cpus-per-task=7 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G per cpu-core is default) #SBATCH --gres=gpu:2 # number of gpus per node #SBATCH --time=00:30:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=all # send email on job start, end and fail

module purge module load anaconda3 module load rh/devtoolset module load cudnn/cuda-10.1/7.5.0

conda activate pytorch pip install -e .

srun python train_balloon.py --num-gpus 2 --config-file configs/COCO-InstanceSegmentation/ballon_mask_rcnn_R_50_FPN_3x.yaml --num-machines 1`

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to set the tarballurl instead of disturl · Issue #317 - GitHub
I know that we can pass disturl from the command line, but is there any way we can pass the whole tarball address...
Read more >
Use Do Not Disturb with Focus on your iPhone or iPad
Go to Settings > Focus. Tap Do Not Disturb. iPhone screen showing how to turn on Do not Disturb; Under Turn on Automatically,...
Read more >
How to set tarball url of node-gyp via environment variable
ok, I figure it out now, node-gyp currently doesn't support to set tarball url, but we can set disturl via environment variable or...
Read more >
Limit interruptions with Do Not Disturb on Android - Android Help
Open your phone's Settings app. · Tap Sound & vibration And then Do Not Disturb. · Under "What can interrupt Do Not Disturb,"...
Read more >
How to set Do Not Disturb Mode in Slack Admin - YouTube
Created by myguide.org, Create your own free videos via MyGuide 1. Enter your workspace and log into Slack!2. Click on your "Admin Profile"3 ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found