question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trouble with the backward pass in ZeRO 3

See original GitHub issue

I have a custom megatron model and a corresponding custom DeepSpeed. I believe that I have incorporated your recent update correctly, but when I try to train a ZeRO 3 model I get the error RuntimeError: The size of tensor a (171) must match the size of tensor b (169) at non-singleton dimension 0.

When I turn off CPU adam, I instead get this error RuntimeError: start (0) + length (174763) exceeds dimension size (174761)

I notice in both cases the shape of a tensor seems to be off by 2, but I have no idea what’s causing this. My code is overall extremely similar to yours, though as I note at https://github.com/microsoft/DeepSpeedExamples/issues/92 I cannot get your code to run either (though for different reasons).

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:18 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
StellaAthenacommented, Mar 16, 2021

Look at that beautiful learning curve! The problem was on our end, we were handling non-pipeline models incorrectly. Once we got that fixed ZeRO-3 ran straight away. The model is still not as efficient as I had hoped (6.1 e12 flops/s/gpu) but this is with extremely unoptimized settings. Time to do benchmarking!

Capture

1reaction
samyamcommented, Mar 12, 2021

@StellaAthena the model size you can run would depend on how much CPU memory you have with Offload. Generally a 10B parameter will take about 200 GB of CPU memory with offload. If you can give some more details on the system you are running (exact number of GPUs on a node, number of nodes, exact amount of CPU per node), I can give you an estimation of what is that max model size you should be able to run with Z3 Offload.

Regarding your port of Z3, I think the issue might be that you are initializing some of the embedding parameters outside the class where the parameters were created. To do it correctly, you need to first gather those parameters before initializing them, as shown here: https://github.com/microsoft/DeepSpeedExamples/blob/20ea07a2a069696abec212e25476a9bf76aced70/Megatron-LM-v1.1.5-ZeRO3/megatron/model/language_model.py#L133.

This was the very last step of our tutorial: https://www.deepspeed.ai/tutorials/zero/#training-trillion-scale-models-with-zero-3-offload so its easy to miss

From a cursory look at your code base, the place where you need to make this changes are here: https://github.com/EleutherAI/gpt-neox/blob/630575ff1b84e491921da616ca5e3c34eb02d865/megatron/model/language_model.py#L162 https://github.com/EleutherAI/gpt-neox/blob/630575ff1b84e491921da616ca5e3c34eb02d865/megatron/model/language_model.py#L175

If there are other places where you access parameters outside of the module where it was created, then you need to do the Gather there as well, except for if its in the forward pass. Then it will be handled by the register_external_parameters

Please let us know if this fixes your issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Project Management Networks Part 2: Forward and Backward ...
42K views 3 years ago ... The forward and backward pass is also used to fully calculate the critical path(s) in a project....
Read more >
Forward pass and backward pass in project scheduling
Forward pass is a technique to move forward through network diagram to determining project duration and finding the critical path or Free Float ......
Read more >
Solved Since activity G has a total float equal to zero (0) - Chegg
Since activity G has a total float equal to zero (0), activity G is part of the critical path. True. False. In Backward...
Read more >
two pass O(n) solution by marking failed loop by zero
I have refactored this code. Basically there are only two differences between the forward and backward passes. First, the way to compute next...
Read more >
Calculating Float with Forward and Backward Pass - Lean CX
When finding our critical path, which is the path that has zero leeway, we first need the early start ... Step 3 –...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found