Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

run more than one DDP model on one machine

See original GitHub issue

Hi, thank you so much for your input so far. I have now successfully converted my models to use DDP instead of DP and time per epoch has gone down from 1400s to 1140s when using 8 GPUs. One problem that I was unable to solve so far is running more than one DDP model on the same machine:

Traceback (most recent call last):
  File "run/run_training_DPP.py", line 68, in <module>
    unpack_data=unpack, deterministic=deterministic, fp16=args.fp16)
  File "/home/fabian/PhD/meddec/meddec/model_training/distributed_training/nnUNetTrainerDPP.py", line 26, in __init__
    dist.init_process_group(backend='nccl', init_method='env://')
  File "/home/fabian/dl_venv_python3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/home/fabian/dl_venv_python3/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use

(which is kind of to be expected given how DDP works). Do you know a workaround for this? We recently have gotten a dgx2 with 16 GPUs and I would like to run two different experiments in parallel, each using 8 GPUs. Best, Fabian

Issue Analytics

State:
Created 5 years ago
Comments:27 (10 by maintainers)

Top GitHub Comments

3reactions

FabianIsenseecommented, Feb 21, 2019

Hah

python -m torch.distributed.launch --nproc_per_node=1 --master_port=2345 train1.py

python1 -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 train2.py

this works! (python1 is an alias for CUDA_VISIBLE_DEVICES=1 python)

Now I just have to test this for our servers and also figure out a way to set a unique port for all jobs, but !thanks! for you help =)

2reactions

mcarillicommented, Feb 24, 2019

By different experiments do you mean you are manually launching two separate jobs with two separate invocations of python -m torch.distributed.launch --nproc_per_node=8 train.py ....?

If so, I think you can make this work by additionally supplying unique MASTER_ADDR and MASTER_PORT environment variables to each job. If you supply the appropriate args to torch.distributed.launch, torch.distributed.launch will do this for you: https://pytorch.org/docs/stable/distributed.html#launch-utility

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
           --master_addr="192.168.1.1"
           --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
           and all other arguments of your training script)

^i took their multinode launch example and removed the multinode-specific args, since you’re not doing a multinode launch, you’re (I think) launching two jobs separately on a single node.

Another thing you will need to do is make sure CUDA_VISIBLE_DEVICES is different for each launch, so that the first launch uses gpus 0-7 and the second uses gpus 8-15. What I would do is open two different terminals. In the first terminal say

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 --master_addr="IP for job 1" --master_port=<port for job 1> train.py ...

In the second terminal say

export CUDA_VISIBLE_DEVICES=8,9,10,11,12,13,14,15
python -m torch.distributed.launch --nproc_per_node=8 --master_addr="IP for job 2" --master_port=<port for job 2> train.py ...

Top Results From Across the Web

DDP with multiple models - distributed - PyTorch Forums

I'm trying to train two models A and B on 4 GPUs, each being trained on 2 GPUs (and thus DDP is needed)....

Efficient Training on Multiple GPUs - Hugging Face

Most users with just 2 GPUs already enjoy the increased training speed up thanks to DataParallel (DP) and DistributedDataParallel (DDP) that are almost...

Multi node PyTorch Distributed Training Guide For People In A ...

The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...

Training on multiple GPUs and multi-node training ... - YouTube

In this video we'll cover how multi -GPU and multi -node training works in general.We'll also show how to do this using PyTorch ......

Pytorch distributed data parallel step by step

Here, I only show how to use DDP on single machine with multiple GPUs. Get start with DDPPermalink. RunPermalink. torch.distributed.launch will ...