TransformerLayer input_mask format
See original GitHub issueI am trying to use the DeepSpeedTransformerLayer
and wondering what format the attention mask should be for left to right language model training.
From https://github.com/microsoft/DeepSpeed/blob/44bd538b110ce0e8fc69626854631c3aee0dc094/tests/unit/test_cuda_forward.py#L181 , it seems like (bs, 1, seq_len, seq_len)
could be correct,
but input_size: torch.Size([1, 501, 512])
and input_mask.shape=[1, 501, 501]
raises
input_mask = torch.cat((input_mask, torch.ones((inp_size[0], input_mask.shape[1], input_mask.shape[2], \> (16 - (inp_size[1] % 16))), device=input_mask.device, dtype=input_mask.dtype) * -10000), 3)
E IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
There is no docstring so I figured I’d ask. Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Comments:19 (10 by maintainers)
Top Results From Across the Web
Transformer — PyTorch 1.13 documentation
A transformer model. User is able to modify the attributes as needed. The architecture is based on the paper “Attention Is All You...
Read more >merlin.models.tf.MaskedLanguageModeling — Merlin Models ...
During training, the Transformer layer is allowed to use positions on the right (future ... Retrieves the input mask tensor(s) of a layer...
Read more >Masking in Transformers' self-attention mechanism | - Medium
Masking is needed to prevent the attention mechanism of a transformer from “cheating” in the decoder when training (on a translating task for...
Read more >joeynmt.transformer_layers — Joey NMT 1.2 documentation
Source code for joeynmt.transformer_layers. # -*- coding: utf-8 -*- import math import torch import torch.nn ...
Read more >Working with Input Masks - Tassos Marinos
Format: 999999. The Input Mask field on the back-end of your form: convert forms input mask example field. On the front-end, the ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @hwidong-na
Yes, you are right, the unit test currently does not check this type of masking. I will soon add a case for that. Thanks, Reza
I got it working with
(1,1,seq_len,seq_len)
! It seems to be faster and less memory in early benchmarks. I am very grateful 😃.This is obviously out of scope of this issue, but I was wondering whether it’s possible to skip the FFN layers at the end of the transformer block and/or set them to identity and use my own custom FFN layers after your very fast attention?