run more than one DDP model on one machine
See original GitHub issueHi, thank you so much for your input so far. I have now successfully converted my models to use DDP instead of DP and time per epoch has gone down from 1400s to 1140s when using 8 GPUs. One problem that I was unable to solve so far is running more than one DDP model on the same machine:
Traceback (most recent call last):
File "run/run_training_DPP.py", line 68, in <module>
unpack_data=unpack, deterministic=deterministic, fp16=args.fp16)
File "/home/fabian/PhD/meddec/meddec/model_training/distributed_training/nnUNetTrainerDPP.py", line 26, in __init__
dist.init_process_group(backend='nccl', init_method='env://')
File "/home/fabian/dl_venv_python3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/home/fabian/dl_venv_python3/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use
(which is kind of to be expected given how DDP works). Do you know a workaround for this? We recently have gotten a dgx2 with 16 GPUs and I would like to run two different experiments in parallel, each using 8 GPUs. Best, Fabian
Issue Analytics
- State:
- Created 5 years ago
- Comments:27 (10 by maintainers)
Top Results From Across the Web
DDP with multiple models - distributed - PyTorch Forums
I'm trying to train two models A and B on 4 GPUs, each being trained on 2 GPUs (and thus DDP is needed)....
Read more >Efficient Training on Multiple GPUs - Hugging Face
Most users with just 2 GPUs already enjoy the increased training speed up thanks to DataParallel (DP) and DistributedDataParallel (DDP) that are almost...
Read more >Multi node PyTorch Distributed Training Guide For People In A ...
The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...
Read more >Training on multiple GPUs and multi-node training ... - YouTube
In this video we'll cover how multi -GPU and multi -node training works in general.We'll also show how to do this using PyTorch ......
Read more >Pytorch distributed data parallel step by step
Here, I only show how to use DDP on single machine with multiple GPUs. Get start with DDPPermalink. RunPermalink. torch.distributed.launch will ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hah
python -m torch.distributed.launch --nproc_per_node=1 --master_port=2345 train1.py
python1 -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 train2.py
this works! (python1 is an alias for
CUDA_VISIBLE_DEVICES=1 python
)Now I just have to test this for our servers and also figure out a way to set a unique port for all jobs, but !thanks! for you help =)
By different experiments do you mean you are manually launching two separate jobs with two separate invocations of
python -m torch.distributed.launch --nproc_per_node=8 train.py ....
?If so, I think you can make this work by additionally supplying unique MASTER_ADDR and MASTER_PORT environment variables to each job. If you supply the appropriate args to
torch.distributed.launch
,torch.distributed.launch
will do this for you: https://pytorch.org/docs/stable/distributed.html#launch-utility^i took their multinode launch example and removed the multinode-specific args, since you’re not doing a multinode launch, you’re (I think) launching two jobs separately on a single node.
Another thing you will need to do is make sure
CUDA_VISIBLE_DEVICES
is different for each launch, so that the first launch uses gpus 0-7 and the second uses gpus 8-15. What I would do is open two different terminals. In the first terminal sayIn the second terminal say