question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Same training time for different values of sliding window in Longformer

See original GitHub issue

System Info

Transformers: 4.20.1 Python: 3.8.12 Pretrained models & tokenizer from HF: “allenai/longformer-base-4096”

The training time does not change for any value of sliding window. For e.g. a sliding window of 2 or 512 (which is the default) or 1024 takes the same training time. This seems to be a bug to me. I need a very small local window span (sliding window max 64 across 4096 tokens) and the model is simply unusable in this scenario due to excessive training time

Who can help?

@ydshieh

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

A simple: model.config.attention_window = [SLIDE_WIN_ATTN]*12

Expected behavior

I would expect training time to fall somewhat quadratically for lower values of SLIDE_WIN_ATTN (say for 64) as compared to the default which is 512. However the training time for both cases is the same (around 24 hours per epoch). In fact SLIDE_WIN_ATTN values from 2 to 1024 roughly take the same training time which should not be the case

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
allohvkcommented, Jul 20, 2022
  • Got it. I suppose this is very reasonable ramification of using a specialised attention model which handles long sequences. There is no visible benefit in having sliding window size < 128. Possibly it can just be documented somewhere. I will close this as “not a bug” for now.
  • I may still have a problem with the model taking quadratic time for longer sequences even with default values of sliding window. However will recheck if it is a bug in my training code. If not, will share a simulatable code by which the problem can be replicated. I will open a new ticket for that.
0reactions
ydshiehcommented, Jul 19, 2022

Hi @allohvk , I know you are talking about the training time. However, even with just the forward method of the model, we already see that the effect of window_size (used for local attentions), i.e. to have linear time instead of quadratic time, will appear only for large enough window_size (and therefore with long enough sequences).

For small window_size, some overhead will prevent this much desired effect. From this observation, I am afraid that this holds for training too.

If you try to measure this line directly https://github.com/huggingface/transformers/blob/8a61fe023430115bb61ec328a29d35571f4fc2c4/src/transformers/models/longformer/modeling_longformer.py#L820

(without any other parts, and therefore no other overhead), you will see this linear/quadratic running time.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Longformer — transformers 2.11.0 documentation
Using Longformer self attention, the memory and time complexity of the query-key ... Longformer selfattention combines a local (sliding window) and global ...
Read more >
Longformer: The Long-Document Transformer
Sliding Window Attention via Longformer Paper ... The paper suggests using different values of d for different heads in the multi-headed ...
Read more >
Efficient Transformers for Language and Vision - arXiv
We use the simple yet effective sliding window attention to capture fine-grained local correlations, where each query attends to nearby tokens ...
Read more >
LongFormer | Shubham Gupta
Training and Evaluation. The model was trained in multiple phases. The window and sequence length was increased in each phase. This is to...
Read more >
Sliding Window Attention Explained | Papers With Code
Sliding Window Attention is an attention pattern for attention-based models. It was proposed as part of the Longformer architecture.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found