question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Be able to Set Timeout When DDP Strategy is Used

See original GitHub issue

🚀 Feature

Motivation

When I use Pytorch Lightning in cluster of our company to train a big model which needs a lot of gpu cards, it normally means that I have to wait more than 30 minutes before all the cards I need get all prepared. Therefore, It will be great if I can set the timeout parameter of init_process_group rather than just use the default value.

And I think this can be accomplished by adding some parameters at the __init__ function of DDPStrategy, which will then be passed to the init_dist_connection fucntion residing in the DDPStrategy.setup_distributed. If you guys believe this is ok, I can create a PR.

cc @borda @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
carmoccacommented, Jun 21, 2022

There’s already checks for this inside PyTorch: https://github.com/pytorch/pytorch/blob/a8b098859688a3f1993821eecc036be973a15605/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L555-L572

I suggest we don’t go into these internal details in PL

1reaction
lsy643commented, Jun 23, 2022
Read more comments on GitHub >

github_iconTop Results From Across the Web

pytorch_lightning.strategies.ddp - PyTorch Lightning
_model_averaging_period is None: raise ValueError( "Post-localSGD algorithm is used, but model averaging period is not provided to DDP strategy.
Read more >
Getting Started with Distributed Data Parallel - PyTorch
This tutorial starts from a basic DDP use case and then demonstrates more ... To create a DDP module, you must first set...
Read more >
Creating parameter defaults in Amazon QuickSight
Use this section to learn more about the types of parameter defaults that are available, and how to set up each of them....
Read more >
Running test calculations in DDP mode with multiple GPUs ...
I think you should use following techniques: test_epoch_end : In ddp mode, every gpu runs same code in this method.
Read more >
Trainer - Hugging Face
predict — Returns predictions (with metrics if labels are available) on a test set. The Trainer class is optimized for Transformers models and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found