Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

WarmupDecayLR.params.total_num_steps - total or per gpu?

See original GitHub issue

We are having a bit of hard time getting total_num_steps to pass to WarmupDecayLR at init time - it’s a bit too early for the logic as these points are configured once ddp/ds has started - we found a workaround, but it doesn’t take into the account the number of gpus.

I wanted to check that WarmupDecayLR.params.total_num_steps expects the total for the whole world and not per gpu.

Currently the doc says:

class WarmupDecayLR(WarmupLR):
[...]
            total_num_steps (int): total number of training steps

so it’d be awesome to disambiguate the context.

Thank you!

Issue Analytics

State:
Created 3 years ago
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Jan 7, 2021

It worked beautifully and greatly simplified our code - thank you, @jeffra

It’d be awesome to document it at https://www.deepspeed.ai/getting-started/#writing-deepspeed-models perhaps?

I’ve proposed this doc to reflect this important function https://github.com/microsoft/DeepSpeed/pull/644

1reaction

stas00commented, Jan 7, 2021

oh, you have recently added just that - fantastic - I will give it a try and report back. Thank you, @jeffra

Top Results From Across the Web

DeepSpeed Integration - Hugging Face

Gathering Parameters. Under ZeRO-3 on multiple GPUs no single GPU has all the parameters unless it's the parameters for the currently executing layer....

Learning Rate Schedulers — DeepSpeed 0.8.0 documentation

Sets the learning rate of each parameter group according to learning rate range test (LRRT) policy. The policy increases learning rate starting from...

DeepSpeed Configuration JSON

The maximum number of parameters resident per GPU before releasing. Smaller values use less memory, but perform more communication. 1e9 ...

Training with multiple GPUs - NVIDIA Documentation Center

Horovod synchronizes the gradients across all processes at each training step. ... to set up the training parameters properly with multi-GPU training.

Scalable, Low-cost Training of Massive Deep Learning Models

For example, Nvidia's NVLink connects 16 GPUs within a. DGX-2 server at 2.4 Tbps all-to-all bandwidth but the DGX-2 servers themselves are connected...