Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Be able to Set Timeout When DDP Strategy is Used

See original GitHub issue

🚀 Feature

Motivation

When I use Pytorch Lightning in cluster of our company to train a big model which needs a lot of gpu cards, it normally means that I have to wait more than 30 minutes before all the cards I need get all prepared. Therefore, It will be great if I can set the timeout parameter of init_process_group rather than just use the default value.

And I think this can be accomplished by adding some parameters at the __init__ function of DDPStrategy, which will then be passed to the init_dist_connection fucntion residing in the DDPStrategy.setup_distributed. If you guys believe this is ok, I can create a PR.

cc @borda @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Issue Analytics

State:
Created a year ago
Comments:8 (8 by maintainers)

Top GitHub Comments

2reactions

carmoccacommented, Jun 21, 2022

There’s already checks for this inside PyTorch: https://github.com/pytorch/pytorch/blob/a8b098859688a3f1993821eecc036be973a15605/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L555-L572

I suggest we don’t go into these internal details in PL

1reaction

lsy643commented, Jun 23, 2022

@carmocca OK. A new PR https://github.com/Lightning-AI/lightning/pull/13383 is here

Read more comments on GitHub >

Top Results From Across the Web

pytorch_lightning.strategies.ddp - PyTorch Lightning

_model_averaging_period is None: raise ValueError( "Post-localSGD algorithm is used, but model averaging period is not provided to DDP strategy.

Getting Started with Distributed Data Parallel - PyTorch

This tutorial starts from a basic DDP use case and then demonstrates more ... To create a DDP module, you must first set...

Creating parameter defaults in Amazon QuickSight

Use this section to learn more about the types of parameter defaults that are available, and how to set up each of them....

Running test calculations in DDP mode with multiple GPUs ...

I think you should use following techniques: test_epoch_end : In ddp mode, every gpu runs same code in this method.

Trainer - Hugging Face

predict — Returns predictions (with metrics if labels are available) on a test set. The Trainer class is optimized for Transformers models and...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

CheckpointConnector should support loading with `strict=False`

tqdm progress bar in v1.6 is slower than v1.5