question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TransformerLayer input_mask format

See original GitHub issue

I am trying to use the DeepSpeedTransformerLayer and wondering what format the attention mask should be for left to right language model training. From https://github.com/microsoft/DeepSpeed/blob/44bd538b110ce0e8fc69626854631c3aee0dc094/tests/unit/test_cuda_forward.py#L181 , it seems like (bs, 1, seq_len, seq_len) could be correct,

but input_size: torch.Size([1, 501, 512]) and input_mask.shape=[1, 501, 501] raises

            input_mask = torch.cat((input_mask, torch.ones((inp_size[0], input_mask.shape[1], input_mask.shape[2], \>                                           (16 - (inp_size[1] % 16))), device=input_mask.device, dtype=input_mask.dtype) * -10000), 3)
E           IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)

There is no docstring so I figured I’d ask. Thanks!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:19 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
RezaYazdaniAminabadicommented, Apr 28, 2021

Hi @hwidong-na

Yes, you are right, the unit test currently does not check this type of masking. I will soon add a case for that. Thanks, Reza

2reactions
sshleifercommented, Apr 15, 2021

I got it working with (1,1,seq_len,seq_len)! It seems to be faster and less memory in early benchmarks. I am very grateful 😃.

This is obviously out of scope of this issue, but I was wondering whether it’s possible to skip the FFN layers at the end of the transformer block and/or set them to identity and use my own custom FFN layers after your very fast attention?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Transformer — PyTorch 1.13 documentation
A transformer model. User is able to modify the attributes as needed. The architecture is based on the paper “Attention Is All You...
Read more >
merlin.models.tf.MaskedLanguageModeling — Merlin Models ...
During training, the Transformer layer is allowed to use positions on the right (future ... Retrieves the input mask tensor(s) of a layer...
Read more >
Masking in Transformers' self-attention mechanism | - Medium
Masking is needed to prevent the attention mechanism of a transformer from “cheating” in the decoder when training (on a translating task for...
Read more >
joeynmt.transformer_layers — Joey NMT 1.2 documentation
Source code for joeynmt.transformer_layers. # -*- coding: utf-8 -*- import math import torch import torch.nn ...
Read more >
Working with Input Masks - Tassos Marinos
Format: 999999. The Input Mask field on the back-end of your form: convert forms input mask example field. On the front-end, the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found