Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deepspeed sparse attention error

See original GitHub issue

Hello! I’ve been using DeepSpeed sparse attention pretty well, but I got this error since tag/0.0.59 release.

Traceback (most recent call last):
  File "train_DALLE.py", line 300, in <module>
    main()
  File "train_DALLE.py", line 219, in main
    loss = dalle(caption_tokens, images, mask=mask, return_loss=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/shared/workspace/torch_research/text-to-image/dalle-pytorch/models/model_arch.py", line 421, in forward
    out = self.transformer(tokens, mask = mask)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/transformer.py", line 106, in forward
    return self.layers(x, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/reversible.py", line 139, in forward
    x = x + f(x, **f_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/transformer.py", line 34, in forward
    return self.fn(self.norm(x), **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/attention.py", line 314, in forward
    if self.noncausal_attn_len:
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'SparseAttention' object has no attribute 'noncausal_attn_len'

I guess the error occurs because self.noncausal_attn_len is removed from Attention class in https://github.com/lucidrains/DALLE-pytorch/commit/95ce537dcbe1afb06fed405008afc642233a5199 this update.

I hope DeepSpeed sparse attention would continue to be updated, because it reduces model size by half, speeds up 2x, and final performance seems better 😃

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:11 (11 by maintainers)

Top GitHub Comments

7reactions

kobisocommented, Feb 14, 2021

@lucidrains Thank you for the quick response! Here is my brief experimental results.

VAE

I did not use resnet blocks for discrete VAE cuz I saw https://github.com/lucidrains/DALLE-pytorch/issues/10#issuecomment-758038919 this issue. I tried not to spend too much time on VAE, and focused on DALLE instead. So, I cannot say much about resnet blocks on VAE.

VAE model

vae = DiscreteVAE(
    image_size=128,
    num_layers=3,
    num_tokens=2048,
    codebook_dim=256,
    hidden_dim=128,
    temperature=0.9,
).cuda()

Result

Dataset: CUB200
First row: GT
Second row: soft generation w/ gumbel softmax
Third row: hard generation w/o gumbel softmax

DALLE on CUB200 (Small dataset)

Training DALLE on small dataset seems to be okay.

Model and training

Tips
- DeepSpeed sparse attention is awesome: requires half memory, 2x faster training speed, better performance
- Learning rate decay helps: personally, i’m using ReduceLROnPlateau
- To speed up training, Automatic Mixed Precision (AMP) is also helpful

        dalle = DALLE(
            dim=256
            vae=vae, 
            num_text_tokens=7800 # this is up to dataset
            text_seq_len=128,  # this is up to dataset
            depth=32
            heads=16
            dim_head=64
            reversible=False,
            attn_types = ('full', 'sparse')
        ).cuda()

Result

During training
- First row: GT
- Second row: generated images from each text (these are from training set, so could be overfitted)
Test with random text
- It seems to generate birds-like images, but i don’t think the model actually understand the meaning of each word.

input_text = ["this small white bird has light gray primaries , and a pointed bill",
              "this small white bird has light gray primaries",
              "this small white bird",
              "this bird has a grey crown with a white belly and small black beak",
              "this bird has a grey crown",
              "this bird"
             ]

DALLE on COCO (Medium dataset)

I’m trying to train a DALLE model with COCO dataset, but I encountered NaN problem.
It is hard to debug because NaN occurs in the middle of the training.

Generated image before NaN occurs

2reactions

kobisocommented, Feb 15, 2021

@lucidrains Awesome! I will try again with the sparse attention and gradient clipping 😃

Top Results From Across the Web

DeepSpeed Sparse Attention

SparseSelfAttention : This module uses MatMul and Softmax kernels and generates Context Layer output given Query, Keys and Values. It is a ...

deepspeed - PyPI

Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution ...

What is Microsoft Researcher's DeepSpeed? - LinkedIn

DeepSpeed is an open source deep learning optimization library for PyTorch. ... DeepSpeed offers sparse attention kernels—an instrumental ...

Fit More and Train Faster With ZeRO via DeepSpeed and ...

DeepSpeed attacks this problem by managing GPU memory by itself and ensuring ... These include DeepSpeed Sparse Attention and 1-bit Adam, ...

DeepSpeed: Extreme-scale model training for everyone

Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention: DeepSpeed offers sparse attention kernels—an ...