[Longformer] Output both local attentions and global attentions when `output_attentions=True` -> Good Second Issue
See original GitHub issueš Feature request
Good Second Issue - A more advanced issue for contributors who want to dive more into Longformerās attention mechanism.
Longformer currently only outputs global attentions, which is suboptimal because users might be interested in the local attentions as well. I propose to change the āoutput_attentionā logic as follows in longformer:
attentions
should correspond to the ālocalā attentions and then weāll add a new output type global_attention
that contains the global_attentions. This is consistent with the naming of attention_mask
and global_attention_mask
IMO and the cleanest way to implement the feature.
Implementing this feature would mean to that Longformer will require its own ModelOutput
class =>
BaseModelOutput,
=> LongformerBaseModelOutput
or BaseModelOutputWithGlobalAttention
(prefer the first name though)
BaseModelOutputWithPooling,
=> ā¦
Also some tests will have to be adapted.
This is a slightly more difficult issue, so Iām happy to help on it. One should understand the difference between local and global attention and how Longformerās attention is different to e.g. Bertās attention in general.
For more detail check out discussion here: https://github.com/huggingface/transformers/issues/5646
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (8 by maintainers)
I have made the pull request.
I checked that the Longformer tests passed with my changes, and I added one more test to check the output of attention probabilities.
Quite stupidly I made the pull request to the master branch, I am sorry about this. I left it as is to avoid duplicating pull requests for now. You can reject it and I will make a cleaner pull request to a separate branch.
I am working on a pull request to address this. I donāt see any major challenge so far, but this made me realize how much
attentions
in Bert-like models and in Longformers are different. Why not replaceattentions
in the Longformer bylocal_attentions
?This means that the interface of Longformers would become incompatible with every other Transformer, but maybe it should be? I donāt think that there is a way to plug Longformer
attentions
into a code that expects Bert-likeattentions
and get meaningful results, so users always have to write a special case for Longformers if they use them. As is, the risk is that they get bogus output and wonāt realize it until they carefully read the doc (that is not yet written).What are your thoughts on this @patrickvonplaten?