Causal linear attention benchmark
See original GitHub issueFirst, thanks for this awesome repo!!
Based on T5 model classes from Huggingface’s transformers, I was trying to use performer attention instead of original T5 attention. We finetuned t5-large
with summarization model, and tried to profile both time and memory usage, and compare the performer attention with the original attention. I have only benchmarked with input size of 1024.
The result clearly showed that performer attention use lot less memory compared to the original transformer. I know from the paper that performer outperforms the original transformer when input size is bigger than 1024. However, finetuning and generation with the performer actually took longer, so I profiled the forward call of both the original T5 attention and the performer attention. The forward of T5 performer took twice longer and the main bottleneck was causal_dot_product_kernel
from fast-transformers.
Is this a normal performace of the performer or causal attention calculation? or Will the performer attention be faster with the bigger input size?
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (8 by maintainers)
Top GitHub Comments
ok! i’ll work on the other issue (fast generation) - glad to hear the original issue is resolved!
Actually installing from pip or building from source took a while, and that should have happend due to EPFL compilation(I have a shallow knowlodge on cuda kernel or library 😅).
We have fixed our code to use
SelfAttention
instead ofFastAttention
, and we might have been setting wrong parameters or etc, since now the performance and speed of the performer looks similar to wha the paper was specifying. So I think you can close this issue for now, and thanks for responsive feedbacks!