Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5 GPU Runtime Degradation

See original GitHub issue

Environment info

transformers version: 4.2.1 VS 3.4.0
Platform: Colab (K80 GPU)
Python version: 3.6.9
PyTorch version (GPU?): 1.7.0+cu101
Tensorflow version (GPU?): N.A.
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten, @patil-suraj

Information

Model I am using (Bert, XLNet …): T5

The problem arises when using:

the official example scripts: (give details below)
[] my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

Hello,

I’ve noticed that the running time of T5 on a GPU has increased between v3.4.0 and the current version (v4.2.1). When running inference on a single example on a K80 GPU (Google Colab), the average runtime of a generate() call for a single example (the one in the transformers documentation) with t5-base in v3.4.0 is 539 ± 13 ms, while the runtime for v4.2.1 is 627 ± 13 ms. On t5-large, the difference is 1004 ± 22 ms, compared to 1242 ± 15 ms.

I made two colab notebooks that compare the two versions: https://colab.research.google.com/drive/1Rm9RFdfLUFFHOvjAOg816-6oXw8zm_tE?usp=sharing#scrollTo=eeJ0sS_g7-X2 https://colab.research.google.com/drive/1U2QPA4MR48xPCpn4XiG5KBk3qZGYeoIJ?usp=sharing

I’m aware of a at least one bug fix that was made to the attention mechanism of T5 in v4.0.0 (#8158), but I don’t think this change should have caused such a degradation. Any idea why such a degradation occurred?

Thanks!

To reproduce

See Colab notebooks attached. See the following code snippet as well:

device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
print(f"Using device: {device}")

t5_tokenizer = T5TokenizerFast.from_pretrained('t5-base')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')
t5_model = t5_model.to(device)

t5_input_ids = t5_tokenizer("summarize: studies have shown that owning a dog is good for you ", return_tensors="pt").input_ids  # Batch size 1
t5_input_ids = t5_input_ids.to(device)

import time
import numpy as np
N = 100
times = []
for _ in range(N):
  start = time.time()
  t5_outputs = t5_model.generate(t5_input_ids)
  end = time.time()
  times.append(end-start)
print(f"transformers version: {transformers_version}")
print(f"torch version: {torch_version}")
print(f"{1000*np.mean(times):.0f} ms \u00B1 {1000*np.std(times):.2f} ms per loop (mean \u00B1 std of {N} runs)")

Issue Analytics

State:
Created 3 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

patrickvonplatencommented, Mar 3, 2021

Okey, I can reproduce the degradation! Will try to fix it today

1reaction

patrickvonplatencommented, Feb 15, 2021

Thanks a lot for this issue @dsgissin! Will take a look this week!

Top Results From Across the Web

A Gentle Introduction to 8-bit Matrix Multiplication for ...

Because these huge models require so many GPUs to run, we need to find ways to ... A gentle summary of LLM.int8(): zero...

Fine-tuning giant neural networks on ... - Mark Silberstein

Our extensive experiments on giant state-of-the-art NLP models (BERT-340M, GPT2-1.5B, and T5-3B) show that FT-. Pipe achieves up to 3× speedup and state-of-the- ......

Optimizing T5 and GPT-2 for Real-Time Inference with NVIDIA ...

TensorRT 8.2 optimizes HuggingFace T5 and GPT-2 models. You can build real-time translation, summarization, and other online NLP apps.

renderMode gpu on iOS -- severe performance degradation in AIR 18?

We suspect that the GPU is not correctly cleared, and that the runtime is kept endlessly busy swapping images. We managed to test...

Is the Labview Runtime Environment sufficient to support ...

(1) Use a TCC-capable board (e.g., a Tesla) and enable TCC mode with nvidia-smi. (2) Increase the watchdog timeout in the registry (I...