question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

run more than one DDP model on one machine

See original GitHub issue

Hi, thank you so much for your input so far. I have now successfully converted my models to use DDP instead of DP and time per epoch has gone down from 1400s to 1140s when using 8 GPUs. One problem that I was unable to solve so far is running more than one DDP model on the same machine:

Traceback (most recent call last):
  File "run/run_training_DPP.py", line 68, in <module>
    unpack_data=unpack, deterministic=deterministic, fp16=args.fp16)
  File "/home/fabian/PhD/meddec/meddec/model_training/distributed_training/nnUNetTrainerDPP.py", line 26, in __init__
    dist.init_process_group(backend='nccl', init_method='env://')
  File "/home/fabian/dl_venv_python3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/home/fabian/dl_venv_python3/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use

(which is kind of to be expected given how DDP works). Do you know a workaround for this? We recently have gotten a dgx2 with 16 GPUs and I would like to run two different experiments in parallel, each using 8 GPUs. Best, Fabian

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:27 (10 by maintainers)

github_iconTop GitHub Comments

3reactions
FabianIsenseecommented, Feb 21, 2019

Hah

python -m torch.distributed.launch --nproc_per_node=1 --master_port=2345 train1.py

python1 -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 train2.py

this works! (python1 is an alias for CUDA_VISIBLE_DEVICES=1 python)

Now I just have to test this for our servers and also figure out a way to set a unique port for all jobs, but !thanks! for you help =)

2reactions
mcarillicommented, Feb 24, 2019

By different experiments do you mean you are manually launching two separate jobs with two separate invocations of python -m torch.distributed.launch --nproc_per_node=8 train.py ....?

If so, I think you can make this work by additionally supplying unique MASTER_ADDR and MASTER_PORT environment variables to each job. If you supply the appropriate args to torch.distributed.launch, torch.distributed.launch will do this for you: https://pytorch.org/docs/stable/distributed.html#launch-utility

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
           --master_addr="192.168.1.1"
           --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
           and all other arguments of your training script)

^i took their multinode launch example and removed the multinode-specific args, since you’re not doing a multinode launch, you’re (I think) launching two jobs separately on a single node.

Another thing you will need to do is make sure CUDA_VISIBLE_DEVICES is different for each launch, so that the first launch uses gpus 0-7 and the second uses gpus 8-15. What I would do is open two different terminals. In the first terminal say

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 --master_addr="IP for job 1" --master_port=<port for job 1> train.py ...

In the second terminal say

export CUDA_VISIBLE_DEVICES=8,9,10,11,12,13,14,15
python -m torch.distributed.launch --nproc_per_node=8 --master_addr="IP for job 2" --master_port=<port for job 2> train.py ...
Read more comments on GitHub >

github_iconTop Results From Across the Web

DDP with multiple models - distributed - PyTorch Forums
I'm trying to train two models A and B on 4 GPUs, each being trained on 2 GPUs (and thus DDP is needed)....
Read more >
Efficient Training on Multiple GPUs - Hugging Face
Most users with just 2 GPUs already enjoy the increased training speed up thanks to DataParallel (DP) and DistributedDataParallel (DDP) that are almost...
Read more >
Multi node PyTorch Distributed Training Guide For People In A ...
The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...
Read more >
Training on multiple GPUs and multi-node training ... - YouTube
In this video we'll cover how multi -GPU and multi -node training works in general.We'll also show how to do this using PyTorch ......
Read more >
Pytorch distributed data parallel step by step
Here, I only show how to use DDP on single machine with multiple GPUs. Get start with DDPPermalink. RunPermalink. torch.distributed.launch will ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found