Deepspeed sparse attention error
See original GitHub issueHello!
I’ve been using DeepSpeed sparse attention
pretty well, but I got this error since tag/0.0.59 release.
Traceback (most recent call last):
File "train_DALLE.py", line 300, in <module>
main()
File "train_DALLE.py", line 219, in main
loss = dalle(caption_tokens, images, mask=mask, return_loss=True)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/shared/workspace/torch_research/text-to-image/dalle-pytorch/models/model_arch.py", line 421, in forward
out = self.transformer(tokens, mask = mask)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/transformer.py", line 106, in forward
return self.layers(x, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/reversible.py", line 139, in forward
x = x + f(x, **f_args)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/transformer.py", line 34, in forward
return self.fn(self.norm(x), **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/attention.py", line 314, in forward
if self.noncausal_attn_len:
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'SparseAttention' object has no attribute 'noncausal_attn_len'
I guess the error occurs because self.noncausal_attn_len
is removed from Attention
class in https://github.com/lucidrains/DALLE-pytorch/commit/95ce537dcbe1afb06fed405008afc642233a5199 this update.
I hope DeepSpeed sparse attention
would continue to be updated, because it reduces model size by half, speeds up 2x, and final performance seems better 😃
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:11 (11 by maintainers)
Top Results From Across the Web
DeepSpeed Sparse Attention
SparseSelfAttention : This module uses MatMul and Softmax kernels and generates Context Layer output given Query, Keys and Values. It is a ...
Read more >deepspeed - PyPI
Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution ...
Read more >What is Microsoft Researcher's DeepSpeed? - LinkedIn
DeepSpeed is an open source deep learning optimization library for PyTorch. ... DeepSpeed offers sparse attention kernels—an instrumental ...
Read more >Fit More and Train Faster With ZeRO via DeepSpeed and ...
DeepSpeed attacks this problem by managing GPU memory by itself and ensuring ... These include DeepSpeed Sparse Attention and 1-bit Adam, ...
Read more >DeepSpeed: Extreme-scale model training for everyone
Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention: DeepSpeed offers sparse attention kernels—an ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@lucidrains Thank you for the quick response! Here is my brief experimental results.
VAE
I did not use resnet blocks for discrete VAE cuz I saw https://github.com/lucidrains/DALLE-pytorch/issues/10#issuecomment-758038919 this issue. I tried not to spend too much time on VAE, and focused on DALLE instead. So, I cannot say much about resnet blocks on VAE.
VAE model
Result
DALLE on CUB200 (Small dataset)
Model and training
ReduceLROnPlateau
Result
During training
Test with random text
DALLE on COCO (Medium dataset)
@lucidrains Awesome! I will try again with the sparse attention and gradient clipping 😃