question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Transformer-XL recurrence different from how it is presented in the paper

See original GitHub issue

The current Transformer-XL implementation uses attention length equal to the input segment length plus the memory length, while in the paper the attention length is presented as independent from the input length or the memory length. This behavior is unwanted since you can’t benefit from the extended receptive field presented in figure 2. https://arxiv.org/pdf/1901.02860.pdf A solution could be to use an attention mask providing a further parameter to the model that automatically generates the attention mask. A snippet of code of how it could be implemented:

if self.causal:
    i, j = dots.shape[-2:]
    r = torch.arange(i, device = device)
    distance = rearrange(r, 'j -> () () () j') - rearrange(r, 'i -> () () i ()')
    mask = distance > 0
    if self.att_len:
        mask_2 = distance < self.att_len
        mask = torch.logical_and(mask, mask_2)
        del mask_2
    mask = F.pad(mask, (j - i, 0), value = False)
    dots.masked_fill_(mask, mask_value)
    del mask

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Atze00commented, Nov 7, 2021

Looking great! Looking back, there were some problems in my snippet, thanks and good work!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Combining Transformers and RNNs into a State-of-the-art ...
The goal of the recurrence mechanism is to enable long-term dependencies by using information from previous segments. Similarly to the vanilla ...
Read more >
Transformer-XL: Unleashing the Potential of Attention Models
Transformer-XL consists of two techniques: a segment-level recurrence mechanism and a relative positional encoding scheme.
Read more >
Recurrent Memory Transformer
Transformer-XL (Dai et al., 2019) introduces a segment-level recurrence at the level of hidden representations. These representations of a ...
Read more >
Day 9: Transformer-XL: Attentive Language Models ...
The big crux of this paper is the introduction of recurrence in the weights ... As discussed before, Transformer-XL achieves a 1,874 times ......
Read more >
Attentive Language Models beyond a Fixed-Length Context
Figure 2: Illustration of the Transformer-XL model with a segment length 4. per-segment, which differs from the same-layer recurrence in conventional ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found