question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deepspeed sparse attention error

See original GitHub issue

Hello! I’ve been using DeepSpeed sparse attention pretty well, but I got this error since tag/0.0.59 release.

Traceback (most recent call last):
  File "train_DALLE.py", line 300, in <module>
    main()
  File "train_DALLE.py", line 219, in main
    loss = dalle(caption_tokens, images, mask=mask, return_loss=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/shared/workspace/torch_research/text-to-image/dalle-pytorch/models/model_arch.py", line 421, in forward
    out = self.transformer(tokens, mask = mask)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/transformer.py", line 106, in forward
    return self.layers(x, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/reversible.py", line 139, in forward
    x = x + f(x, **f_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/transformer.py", line 34, in forward
    return self.fn(self.norm(x), **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dalle_pytorch/attention.py", line 314, in forward
    if self.noncausal_attn_len:
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'SparseAttention' object has no attribute 'noncausal_attn_len'

I guess the error occurs because self.noncausal_attn_len is removed from Attention class in https://github.com/lucidrains/DALLE-pytorch/commit/95ce537dcbe1afb06fed405008afc642233a5199 this update.

I hope DeepSpeed sparse attention would continue to be updated, because it reduces model size by half, speeds up 2x, and final performance seems better 😃

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

7reactions
kobisocommented, Feb 14, 2021

@lucidrains Thank you for the quick response! Here is my brief experimental results.

VAE

I did not use resnet blocks for discrete VAE cuz I saw https://github.com/lucidrains/DALLE-pytorch/issues/10#issuecomment-758038919 this issue. I tried not to spend too much time on VAE, and focused on DALLE instead. So, I cannot say much about resnet blocks on VAE.

VAE model

vae = DiscreteVAE(
    image_size=128,
    num_layers=3,
    num_tokens=2048,
    codebook_dim=256,
    hidden_dim=128,
    temperature=0.9,
).cuda()

Result

  • Dataset: CUB200
  • First row: GT
  • Second row: soft generation w/ gumbel softmax
  • Third row: hard generation w/o gumbel softmax

image

DALLE on CUB200 (Small dataset)

  • Training DALLE on small dataset seems to be okay.

Model and training

  • Tips
    • DeepSpeed sparse attention is awesome: requires half memory, 2x faster training speed, better performance
    • Learning rate decay helps: personally, i’m using ReduceLROnPlateau
    • To speed up training, Automatic Mixed Precision (AMP) is also helpful
        dalle = DALLE(
            dim=256
            vae=vae, 
            num_text_tokens=7800 # this is up to dataset
            text_seq_len=128,  # this is up to dataset
            depth=32
            heads=16
            dim_head=64
            reversible=False,
            attn_types = ('full', 'sparse')
        ).cuda()

Result

  • During training

    • First row: GT
    • Second row: generated images from each text (these are from training set, so could be overfitted) image image
  • Test with random text

    • It seems to generate birds-like images, but i don’t think the model actually understand the meaning of each word.
input_text = ["this small white bird has light gray primaries , and a pointed bill",
              "this small white bird has light gray primaries",
              "this small white bird",
              "this bird has a grey crown with a white belly and small black beak",
              "this bird has a grey crown",
              "this bird"
             ]

image

DALLE on COCO (Medium dataset)

  • I’m trying to train a DALLE model with COCO dataset, but I encountered NaN problem.
  • It is hard to debug because NaN occurs in the middle of the training.

image

  • Generated image before NaN occurs

image image

2reactions
kobisocommented, Feb 15, 2021

@lucidrains Awesome! I will try again with the sparse attention and gradient clipping 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed Sparse Attention
SparseSelfAttention : This module uses MatMul and Softmax kernels and generates Context Layer output given Query, Keys and Values. It is a ...
Read more >
deepspeed - PyPI
Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution ...
Read more >
What is Microsoft Researcher's DeepSpeed? - LinkedIn
DeepSpeed is an open source deep learning optimization library for PyTorch. ... DeepSpeed offers sparse attention kernels—an instrumental ...
Read more >
Fit More and Train Faster With ZeRO via DeepSpeed and ...
DeepSpeed attacks this problem by managing GPU memory by itself and ensuring ... These include DeepSpeed Sparse Attention and 1-bit Adam, ...
Read more >
DeepSpeed: Extreme-scale model training for everyone
Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention: DeepSpeed offers sparse attention kernels—an ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found