Delayed Parameter Update when step(wait=False)
See original GitHub issueIs your feature request related to a problem? Please describe.
Eh, this could be a question. I’m trying to use TrainingAverager
with step(wait=False)
. That requires data_lock
and use_old_local_tensor=True
follows.
When use_old_local_tensor=True
, is it correct to simply add the weight difference between local model and all-reduced model to the new model parameters? The gradients calculated from the old model parameter is being added to the new model parameters. That doesn’t seem quite right.
Describe the solution you’d like
https://arxiv.org/abs/2101.06840 proposes Delayed Parameter Update. Parameter update is delayed by one step. Apparently, it makes little difference in the training curve if DPU is applied after 40 iterations in BERT-large training.
I think to implement DPU, you simply have to copy back the averaged tensor back to the model in the beginning of step()
.
Describe alternatives you’ve considered
I understand that if the weight difference is not added back, the local steps taken before the asynchronous all-reduce completes are being wasted. Not only it defeats purpose of asynchronous all-reduce(if local updates are going to be wasted until async completes, why not just go sync) but it also skips over input data which could trouble training.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
To whom it may concern: delayed parameter updates are enabled with hivemind.Optimizer via
delay_grad_averaging=True, delay_optimizer_step=True
.Minimalistic example: benchmark_optimizer.py
More advanced usage examples (with full or partial DPU up to user’s discretion):
A more detailed API reference can be found here:
Currently, DPU requires installing hivemind from the github repo, i.e.
pip install https://github.com/learning-at-home/hivemind/archive/master.zip
It will be available from PyPI after v1.0.0 is released, which is to say “sometime very soon”
If you have any other questions, feel free to open another issue or join our discord channel (link above)
That might indeed be the case. In our DPU experiments, we enabled it early on, during the initial LR warmup, so the learning rate was still very small. That might have allowed DPU to phase in without significant performance drawdown.
p.s. i’ve finished the rest of my backlog yesterday, now working on making the DPU work in hivemind master. I’d still appreciate if you have time to chat a little to better coordinate our effort. (we can meet on discord or whichever other means of communication you prefer). Anyway, i’ll post updates to this thread as soon as i make any meaningful progress (within <=96h).