Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trouble with the backward pass in ZeRO 3

See original GitHub issue

I have a custom megatron model and a corresponding custom DeepSpeed. I believe that I have incorporated your recent update correctly, but when I try to train a ZeRO 3 model I get the error RuntimeError: The size of tensor a (171) must match the size of tensor b (169) at non-singleton dimension 0.

When I turn off CPU adam, I instead get this error RuntimeError: start (0) + length (174763) exceeds dimension size (174761)

I notice in both cases the shape of a tensor seems to be off by 2, but I have no idea what’s causing this. My code is overall extremely similar to yours, though as I note at https://github.com/microsoft/DeepSpeedExamples/issues/92 I cannot get your code to run either (though for different reasons).

Issue Analytics

State:
Created 3 years ago
Comments:18 (8 by maintainers)

Top GitHub Comments

1reaction

StellaAthenacommented, Mar 16, 2021

Look at that beautiful learning curve! The problem was on our end, we were handling non-pipeline models incorrectly. Once we got that fixed ZeRO-3 ran straight away. The model is still not as efficient as I had hoped (6.1 e12 flops/s/gpu) but this is with extremely unoptimized settings. Time to do benchmarking!

Capture

1reaction

samyamcommented, Mar 12, 2021

@StellaAthena the model size you can run would depend on how much CPU memory you have with Offload. Generally a 10B parameter will take about 200 GB of CPU memory with offload. If you can give some more details on the system you are running (exact number of GPUs on a node, number of nodes, exact amount of CPU per node), I can give you an estimation of what is that max model size you should be able to run with Z3 Offload.

Regarding your port of Z3, I think the issue might be that you are initializing some of the embedding parameters outside the class where the parameters were created. To do it correctly, you need to first gather those parameters before initializing them, as shown here: https://github.com/microsoft/DeepSpeedExamples/blob/20ea07a2a069696abec212e25476a9bf76aced70/Megatron-LM-v1.1.5-ZeRO3/megatron/model/language_model.py#L133.

This was the very last step of our tutorial: https://www.deepspeed.ai/tutorials/zero/#training-trillion-scale-models-with-zero-3-offload so its easy to miss

From a cursory look at your code base, the place where you need to make this changes are here: https://github.com/EleutherAI/gpt-neox/blob/630575ff1b84e491921da616ca5e3c34eb02d865/megatron/model/language_model.py#L162 https://github.com/EleutherAI/gpt-neox/blob/630575ff1b84e491921da616ca5e3c34eb02d865/megatron/model/language_model.py#L175

If there are other places where you access parameters outside of the module where it was created, then you need to do the Gather there as well, except for if its in the forward pass. Then it will be handled by the register_external_parameters

Please let us know if this fixes your issue.

Top Results From Across the Web

Project Management Networks Part 2: Forward and Backward ...

42K views 3 years ago ... The forward and backward pass is also used to fully calculate the critical path(s) in a project....

Forward pass and backward pass in project scheduling

Forward pass is a technique to move forward through network diagram to determining project duration and finding the critical path or Free Float ......

Solved Since activity G has a total float equal to zero (0) - Chegg

Since activity G has a total float equal to zero (0), activity G is part of the critical path. True. False. In Backward...

two pass O(n) solution by marking failed loop by zero

I have refactored this code. Basically there are only two differences between the forward and backward passes. First, the way to compute next...

Calculating Float with Forward and Backward Pass - Lean CX

When finding our critical path, which is the path that has zero leeway, we first need the early start ... Step 3 –...