Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Causal linear attention benchmark

See original GitHub issue

First, thanks for this awesome repo!!

Based on T5 model classes from Huggingface’s transformers, I was trying to use performer attention instead of original T5 attention. We finetuned t5-large with summarization model, and tried to profile both time and memory usage, and compare the performer attention with the original attention. I have only benchmarked with input size of 1024.

The result clearly showed that performer attention use lot less memory compared to the original transformer. I know from the paper that performer outperforms the original transformer when input size is bigger than 1024. However, finetuning and generation with the performer actually took longer, so I profiled the forward call of both the original T5 attention and the performer attention. The forward of T5 performer took twice longer and the main bottleneck was causal_dot_product_kernel from fast-transformers.

Is this a normal performace of the performer or causal attention calculation? or Will the performer attention be faster with the bigger input size?

Issue Analytics

State:
Created 2 years ago
Comments:13 (8 by maintainers)

Top GitHub Comments

1reaction

lucidrainscommented, Apr 20, 2021

ok! i’ll work on the other issue (fast generation) - glad to hear the original issue is resolved!

1reaction

ice-americanocommented, Apr 19, 2021

Actually installing from pip or building from source took a while, and that should have happend due to EPFL compilation(I have a shallow knowlodge on cuda kernel or library 😅).

We have fixed our code to use SelfAttention instead of FastAttention, and we might have been setting wrong parameters or etc, since now the performance and speed of the performer looks similar to wha the paper was specifying. So I think you can close this issue for now, and thanks for responsive feedbacks!