Trouble with the backward pass in ZeRO 3
See original GitHub issueI have a custom megatron model and a corresponding custom DeepSpeed. I believe that I have incorporated your recent update correctly, but when I try to train a ZeRO 3 model I get the error RuntimeError: The size of tensor a (171) must match the size of tensor b (169) at non-singleton dimension 0.
When I turn off CPU adam, I instead get this error RuntimeError: start (0) + length (174763) exceeds dimension size (174761)
I notice in both cases the shape of a tensor seems to be off by 2, but I have no idea what’s causing this. My code is overall extremely similar to yours, though as I note at https://github.com/microsoft/DeepSpeedExamples/issues/92 I cannot get your code to run either (though for different reasons).
Issue Analytics
- State:
- Created 3 years ago
- Comments:18 (8 by maintainers)
Top Results From Across the Web
Project Management Networks Part 2: Forward and Backward ...
42K views 3 years ago ... The forward and backward pass is also used to fully calculate the critical path(s) in a project....
Read more >Forward pass and backward pass in project scheduling
Forward pass is a technique to move forward through network diagram to determining project duration and finding the critical path or Free Float ......
Read more >Solved Since activity G has a total float equal to zero (0) - Chegg
Since activity G has a total float equal to zero (0), activity G is part of the critical path. True. False. In Backward...
Read more >two pass O(n) solution by marking failed loop by zero
I have refactored this code. Basically there are only two differences between the forward and backward passes. First, the way to compute next...
Read more >Calculating Float with Forward and Backward Pass - Lean CX
When finding our critical path, which is the path that has zero leeway, we first need the early start ... Step 3 –...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Look at that beautiful learning curve! The problem was on our end, we were handling non-pipeline models incorrectly. Once we got that fixed ZeRO-3 ran straight away. The model is still not as efficient as I had hoped (6.1 e12 flops/s/gpu) but this is with extremely unoptimized settings. Time to do benchmarking!
@StellaAthena the model size you can run would depend on how much CPU memory you have with Offload. Generally a 10B parameter will take about 200 GB of CPU memory with offload. If you can give some more details on the system you are running (exact number of GPUs on a node, number of nodes, exact amount of CPU per node), I can give you an estimation of what is that max model size you should be able to run with Z3 Offload.
Regarding your port of Z3, I think the issue might be that you are initializing some of the embedding parameters outside the class where the parameters were created. To do it correctly, you need to first gather those parameters before initializing them, as shown here: https://github.com/microsoft/DeepSpeedExamples/blob/20ea07a2a069696abec212e25476a9bf76aced70/Megatron-LM-v1.1.5-ZeRO3/megatron/model/language_model.py#L133.
This was the very last step of our tutorial: https://www.deepspeed.ai/tutorials/zero/#training-trillion-scale-models-with-zero-3-offload so its easy to miss
From a cursory look at your code base, the place where you need to make this changes are here: https://github.com/EleutherAI/gpt-neox/blob/630575ff1b84e491921da616ca5e3c34eb02d865/megatron/model/language_model.py#L162 https://github.com/EleutherAI/gpt-neox/blob/630575ff1b84e491921da616ca5e3c34eb02d865/megatron/model/language_model.py#L175
If there are other places where you access parameters outside of the module where it was created, then you need to do the Gather there as well, except for if its in the forward pass. Then it will be handled by the register_external_parameters
Please let us know if this fixes your issue.