Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Transformer-XL recurrence different from how it is presented in the paper

See original GitHub issue

The current Transformer-XL implementation uses attention length equal to the input segment length plus the memory length, while in the paper the attention length is presented as independent from the input length or the memory length. This behavior is unwanted since you can’t benefit from the extended receptive field presented in figure 2. https://arxiv.org/pdf/1901.02860.pdf A solution could be to use an attention mask providing a further parameter to the model that automatically generates the attention mask. A snippet of code of how it could be implemented:

if self.causal:
    i, j = dots.shape[-2:]
    r = torch.arange(i, device = device)
    distance = rearrange(r, 'j -> () () () j') - rearrange(r, 'i -> () () i ()')
    mask = distance > 0
    if self.att_len:
        mask_2 = distance < self.att_len
        mask = torch.logical_and(mask, mask_2)
        del mask_2
    mask = F.pad(mask, (j - i, 0), value = False)
    dots.masked_fill_(mask, mask_value)
    del mask

Issue Analytics

State:
Created 2 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

Atze00commented, Nov 7, 2021

Looking great! Looking back, there were some problems in my snippet, thanks and good work!

0reactions

lucidrainscommented, Nov 6, 2021

@Atze00 how does this look? https://github.com/lucidrains/x-transformers/commit/55ca5d96c8b850b064177091f7a1dcfe784b24ce

Top Results From Across the Web

Combining Transformers and RNNs into a State-of-the-art ...

The goal of the recurrence mechanism is to enable long-term dependencies by using information from previous segments. Similarly to the vanilla ...

Transformer-XL: Unleashing the Potential of Attention Models

Transformer-XL consists of two techniques: a segment-level recurrence mechanism and a relative positional encoding scheme.

Recurrent Memory Transformer

Transformer-XL (Dai et al., 2019) introduces a segment-level recurrence at the level of hidden representations. These representations of a ...

Day 9: Transformer-XL: Attentive Language Models ...

The big crux of this paper is the introduction of recurrence in the weights ... As discussed before, Transformer-XL achieves a 1,874 times ......

Attentive Language Models beyond a Fixed-Length Context

Figure 2: Illustration of the Transformer-XL model with a segment length 4. per-segment, which differs from the same-layer recurrence in conventional ...