question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why use torch.multiprocessing.spawn for distributed training

See original GitHub issue

Hi there,

In the Swin UNETR scripts, e.g., https://github.com/Project-MONAI/research-contributions/blob/main/SwinUNETR/BRATS21/main.py, torch.multiprocessing.spawn is used for launching distributed training. Any reason why you didn’t use torch.distributed.launch? Did torch.multiprocessing.spawn give better performance than torch.distributed.launch for BraTS/BTCV-based Swin UNETR training?

Thanks!

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
tangy5commented, Dec 8, 2022

Thank you for clarification. Here are initial logs. single GPU, batch_size=1 image

2 GPUs, batch_size=2 image

multi GPU keeps taking longer time as number of GPUs increases. It will be worse if running with batch_size=1 on multi GPUs.

I mean, yes, when training with single GPU, the batch size is 1, then train on 2 GPUs, batch size is 2, the time is expected to be longer but should be less than 2 x time of Single GPU for each step/iteration. You could see 2 GPUs training is faster here, but is not exactly 2x faster, it’s ~1.7x faster.

1reaction
tangy5commented, Sep 2, 2022

Hi @hw-ju , the SwinUNETR is tested of multi-GPU training with both DDP and MP Spawn. Both works well, no performance preference regarding different multi-GPU frameworks. You can safely use DDP. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Torch.distributed.launch vs torch.multiprocessing.spawn
If you need multi-server distributed data parallel training, it might be more convenient to use torch.distributed.launch as it automatically ...
Read more >
Why using mp.spawn is slower than using torch.distributed ...
mp.spawn is usually slower due to initialization overhead. In general distributed training is long running, so usually the initialization time ...
Read more >
Distributed Computing with PyTorch - Shiv Gehlot
Hence, “torch.multiprocessing.spawn” can be used to spawn the training function “fn(”) on each of the GPU through “args”.
Read more >
Writing Distributed Applications with PyTorch
torch.distributed ) enables researchers and practitioners to easily parallelize their computations across processes and clusters of machines. · torch.
Read more >
Distributed Training Made Easy with PyTorch-Ignite
Then we will also cover several ways of spawning processes via torch native torch.multiprocessing.spawn and also via multiple distributed ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found