stage 3 shape/dimension issues
See original GitHub issueEdit: Okay - I have zero clue why this is happening but at this point can only assume that it has to do with using CUDA 11 and sparse attention. I’ve disabled literally everything - including FusedAdam and cpu_offload, and I even just reinstalled deepspeed but this issue persists.
Wouldn’t be the first time I had strange errors because of a somehow borked CUDA install either…
On the other hand, the issue does at least seem quite coincidentally related to the linked deepspeed issue. So I guess I’ll leave this up until I know more.
Original post:
Perhaps it has to do with this: https://github.com/microsoft/DeepSpeed/issues/828
When enabling deepspeed stage 3 and using their “FusedAdam” optimizer (the super fast CPU one) instead of passing in the normal Adam optimizer, I get the following stacktrace:
Traceback (most recent call last): File "train_dalle.py", line 331, in <module> loss = distr_dalle(text, images, return_loss = True) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 914, in forward loss = self.module(*inputs, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/DALLE-pytorch/dalle_pytorch/dalle_pytorch.py", line 459, in forward image = self.vae.get_codebook_indices(image) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/root/DALLE-pytorch/dalle_pytorch/vae.py", line 152, in get_codebook_indices _, _, [_, _, indices] = self.model.encode(img) File "/root/.local/lib/python3.8/site-packages/taming_transformers-0.0.1-py3.8.egg/taming/models/vqgan.py", line 54, in encode quant, emb_loss, info = self.quantize(h) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/.local/lib/python3.8/site-packages/taming_transformers-0.0.1-py3.8.egg/taming/modules/vqvae/quantize.py", line 42, in forward torch.sum(self.embedding.weight**2, dim=1) - 2 * \ IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
The input_mask that we currently support in our implementation must have this shape (batch_size, 1, 1, sequence_length). This means that the mask can be different for different input.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (5 by maintainers)
Top GitHub Comments
May also be related to external parameters in DeepSpeed. I’ve opened an issue asking about these but haven’t gotten an answer yet. (EDIT: They answered.)
Since you’re getting an error, I’m assuming these are problems with external parameters finally showing up. 😃
DeepSpeed 0.3.15 now automatically detects external parameters but #207 adds manual external parameter registration anyway.