T5 GPU Runtime Degradation
See original GitHub issueEnvironment info
transformers
version: 4.2.1 VS 3.4.0- Platform: Colab (K80 GPU)
- Python version: 3.6.9
- PyTorch version (GPU?): 1.7.0+cu101
- Tensorflow version (GPU?): N.A.
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
@patrickvonplaten, @patil-suraj
Information
Model I am using (Bert, XLNet …): T5
The problem arises when using:
- the official example scripts: (give details below)
- [] my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
Hello,
I’ve noticed that the running time of T5 on a GPU has increased between v3.4.0 and the current version (v4.2.1). When running inference on a single example on a K80 GPU (Google Colab), the average runtime of a generate() call for a single example (the one in the transformers documentation) with t5-base in v3.4.0 is 539 ± 13 ms, while the runtime for v4.2.1 is 627 ± 13 ms. On t5-large, the difference is 1004 ± 22 ms, compared to 1242 ± 15 ms.
I made two colab notebooks that compare the two versions: https://colab.research.google.com/drive/1Rm9RFdfLUFFHOvjAOg816-6oXw8zm_tE?usp=sharing#scrollTo=eeJ0sS_g7-X2 https://colab.research.google.com/drive/1U2QPA4MR48xPCpn4XiG5KBk3qZGYeoIJ?usp=sharing
I’m aware of a at least one bug fix that was made to the attention mechanism of T5 in v4.0.0 (#8158), but I don’t think this change should have caused such a degradation. Any idea why such a degradation occurred?
Thanks!
To reproduce
See Colab notebooks attached. See the following code snippet as well:
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
print(f"Using device: {device}")
t5_tokenizer = T5TokenizerFast.from_pretrained('t5-base')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')
t5_model = t5_model.to(device)
t5_input_ids = t5_tokenizer("summarize: studies have shown that owning a dog is good for you ", return_tensors="pt").input_ids # Batch size 1
t5_input_ids = t5_input_ids.to(device)
import time
import numpy as np
N = 100
times = []
for _ in range(N):
start = time.time()
t5_outputs = t5_model.generate(t5_input_ids)
end = time.time()
times.append(end-start)
print(f"transformers version: {transformers_version}")
print(f"torch version: {torch_version}")
print(f"{1000*np.mean(times):.0f} ms \u00B1 {1000*np.std(times):.2f} ms per loop (mean \u00B1 std of {N} runs)")
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Okey, I can reproduce the degradation! Will try to fix it today
Thanks a lot for this issue @dsgissin! Will take a look this week!