Be able to Set Timeout When DDP Strategy is Used
See original GitHub issue🚀 Feature
Motivation
When I use Pytorch Lightning in cluster of our company to train a big model which needs a lot of gpu cards, it normally means that I have to wait more than 30 minutes before all the cards I need get all prepared. Therefore, It will be great if I can set the timeout parameter of init_process_group rather than just use the default value.
And I think this can be accomplished by adding some parameters at the __init__
function of DDPStrategy
, which will then be passed to the init_dist_connection
fucntion residing in the DDPStrategy.setup_distributed
. If you guys believe this is ok, I can create a PR.
cc @borda @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7
Issue Analytics
- State:
- Created a year ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
pytorch_lightning.strategies.ddp - PyTorch Lightning
_model_averaging_period is None: raise ValueError( "Post-localSGD algorithm is used, but model averaging period is not provided to DDP strategy.
Read more >Getting Started with Distributed Data Parallel - PyTorch
This tutorial starts from a basic DDP use case and then demonstrates more ... To create a DDP module, you must first set...
Read more >Creating parameter defaults in Amazon QuickSight
Use this section to learn more about the types of parameter defaults that are available, and how to set up each of them....
Read more >Running test calculations in DDP mode with multiple GPUs ...
I think you should use following techniques: test_epoch_end : In ddp mode, every gpu runs same code in this method.
Read more >Trainer - Hugging Face
predict — Returns predictions (with metrics if labels are available) on a test set. The Trainer class is optimized for Transformers models and...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
There’s already checks for this inside PyTorch: https://github.com/pytorch/pytorch/blob/a8b098859688a3f1993821eecc036be973a15605/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L555-L572
I suggest we don’t go into these internal details in PL
@carmocca OK. A new PR https://github.com/Lightning-AI/lightning/pull/13383 is here