Transformer-XL recurrence different from how it is presented in the paper
See original GitHub issueThe current Transformer-XL implementation uses attention length equal to the input segment length plus the memory length, while in the paper the attention length is presented as independent from the input length or the memory length. This behavior is unwanted since you can’t benefit from the extended receptive field presented in figure 2. https://arxiv.org/pdf/1901.02860.pdf A solution could be to use an attention mask providing a further parameter to the model that automatically generates the attention mask. A snippet of code of how it could be implemented:
if self.causal:
i, j = dots.shape[-2:]
r = torch.arange(i, device = device)
distance = rearrange(r, 'j -> () () () j') - rearrange(r, 'i -> () () i ()')
mask = distance > 0
if self.att_len:
mask_2 = distance < self.att_len
mask = torch.logical_and(mask, mask_2)
del mask_2
mask = F.pad(mask, (j - i, 0), value = False)
dots.masked_fill_(mask, mask_value)
del mask
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Combining Transformers and RNNs into a State-of-the-art ...
The goal of the recurrence mechanism is to enable long-term dependencies by using information from previous segments. Similarly to the vanilla ...
Read more >Transformer-XL: Unleashing the Potential of Attention Models
Transformer-XL consists of two techniques: a segment-level recurrence mechanism and a relative positional encoding scheme.
Read more >Recurrent Memory Transformer
Transformer-XL (Dai et al., 2019) introduces a segment-level recurrence at the level of hidden representations. These representations of a ...
Read more >Day 9: Transformer-XL: Attentive Language Models ...
The big crux of this paper is the introduction of recurrence in the weights ... As discussed before, Transformer-XL achieves a 1,874 times ......
Read more >Attentive Language Models beyond a Fixed-Length Context
Figure 2: Illustration of the Transformer-XL model with a segment length 4. per-segment, which differs from the same-layer recurrence in conventional ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Looking great! Looking back, there were some problems in my snippet, thanks and good work!
@Atze00 how does this look? https://github.com/lucidrains/x-transformers/commit/55ca5d96c8b850b064177091f7a1dcfe784b24ce