Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Delayed Parameter Update when step(wait=False)

See original GitHub issue

Is your feature request related to a problem? Please describe.

Eh, this could be a question. I’m trying to use TrainingAverager with step(wait=False). That requires data_lock and use_old_local_tensor=True follows.

When use_old_local_tensor=True, is it correct to simply add the weight difference between local model and all-reduced model to the new model parameters? The gradients calculated from the old model parameter is being added to the new model parameters. That doesn’t seem quite right.

Describe the solution you’d like

https://arxiv.org/abs/2101.06840 proposes Delayed Parameter Update. Parameter update is delayed by one step. Apparently, it makes little difference in the training curve if DPU is applied after 40 iterations in BERT-large training.

I think to implement DPU, you simply have to copy back the averaged tensor back to the model in the beginning of step().

Describe alternatives you’ve considered

I understand that if the weight difference is not added back, the local steps taken before the asynchronous all-reduce completes are being wasted. Not only it defeats purpose of asynchronous all-reduce(if local updates are going to be wasted until async completes, why not just go sync) but it also skips over input data which could trouble training.

Issue Analytics

State:
Created 2 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

justheuristiccommented, Dec 15, 2021

To whom it may concern: delayed parameter updates are enabled with hivemind.Optimizer via delay_grad_averaging=True, delay_optimizer_step=True .

Minimalistic example: benchmark_optimizer.py

More advanced usage examples (with full or partial DPU up to user’s discretion):

A more detailed API reference can be found here:

https://learning-at-home.readthedocs.io/en/latest/modules/optim.html

Currently, DPU requires installing hivemind from the github repo, i.e. pip install https://github.com/learning-at-home/hivemind/archive/master.zip

It will be available from PyPI after v1.0.0 is released, which is to say “sometime very soon”

If you have any other questions, feel free to open another issue or join our discord channel (link above)

0reactions

justheuristiccommented, Oct 30, 2021

when lr is decreased by lr scheulder(0.1 times in step-wise fashion) at 1.2kish steps, the training seems to be working

That might indeed be the case. In our DPU experiments, we enabled it early on, during the initial LR warmup, so the learning rate was still very small. That might have allowed DPU to phase in without significant performance drawdown.

p.s. i’ve finished the rest of my backlog yesterday, now working on making the DPU work in hivemind master. I’d still appreciate if you have time to chat a little to better coordinate our effort. (we can meet on discord or whichever other means of communication you prefer). Anyway, i’ll post updates to this thread as soon as i make any meaningful progress (within <=96h).