question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5 GPU Runtime Degradation

See original GitHub issue

Environment info

  • transformers version: 4.2.1 VS 3.4.0
  • Platform: Colab (K80 GPU)
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.7.0+cu101
  • Tensorflow version (GPU?): N.A.
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten, @patil-suraj

Information

Model I am using (Bert, XLNet …): T5

The problem arises when using:

  • the official example scripts: (give details below)
  • [] my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

Hello,

I’ve noticed that the running time of T5 on a GPU has increased between v3.4.0 and the current version (v4.2.1). When running inference on a single example on a K80 GPU (Google Colab), the average runtime of a generate() call for a single example (the one in the transformers documentation) with t5-base in v3.4.0 is 539 ± 13 ms, while the runtime for v4.2.1 is 627 ± 13 ms. On t5-large, the difference is 1004 ± 22 ms, compared to 1242 ± 15 ms.

I made two colab notebooks that compare the two versions: https://colab.research.google.com/drive/1Rm9RFdfLUFFHOvjAOg816-6oXw8zm_tE?usp=sharing#scrollTo=eeJ0sS_g7-X2 https://colab.research.google.com/drive/1U2QPA4MR48xPCpn4XiG5KBk3qZGYeoIJ?usp=sharing

I’m aware of a at least one bug fix that was made to the attention mechanism of T5 in v4.0.0 (#8158), but I don’t think this change should have caused such a degradation. Any idea why such a degradation occurred?

Thanks!

To reproduce

See Colab notebooks attached. See the following code snippet as well:

device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
print(f"Using device: {device}")

t5_tokenizer = T5TokenizerFast.from_pretrained('t5-base')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')
t5_model = t5_model.to(device)

t5_input_ids = t5_tokenizer("summarize: studies have shown that owning a dog is good for you ", return_tensors="pt").input_ids  # Batch size 1
t5_input_ids = t5_input_ids.to(device)

import time
import numpy as np
N = 100
times = []
for _ in range(N):
  start = time.time()
  t5_outputs = t5_model.generate(t5_input_ids)
  end = time.time()
  times.append(end-start)
print(f"transformers version: {transformers_version}")
print(f"torch version: {torch_version}")
print(f"{1000*np.mean(times):.0f} ms \u00B1 {1000*np.std(times):.2f} ms per loop (mean \u00B1 std of {N} runs)")

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
patrickvonplatencommented, Mar 3, 2021

Okey, I can reproduce the degradation! Will try to fix it today

1reaction
patrickvonplatencommented, Feb 15, 2021

Thanks a lot for this issue @dsgissin! Will take a look this week!

Read more comments on GitHub >

github_iconTop Results From Across the Web

A Gentle Introduction to 8-bit Matrix Multiplication for ...
Because these huge models require so many GPUs to run, we need to find ways to ... A gentle summary of LLM.int8(): zero...
Read more >
Fine-tuning giant neural networks on ... - Mark Silberstein
Our extensive experiments on giant state-of-the-art NLP models (BERT-340M, GPT2-1.5B, and T5-3B) show that FT-. Pipe achieves up to 3× speedup and state-of-the- ......
Read more >
Optimizing T5 and GPT-2 for Real-Time Inference with NVIDIA ...
TensorRT 8.2 optimizes HuggingFace T5 and GPT-2 models. You can build real-time translation, summarization, and other online NLP apps.
Read more >
renderMode gpu on iOS -- severe performance degradation in AIR 18?
We suspect that the GPU is not correctly cleared, and that the runtime is kept endlessly busy swapping images. We managed to test...
Read more >
Is the Labview Runtime Environment sufficient to support ...
(1) Use a TCC-capable board (e.g., a Tesla) and enable TCC mode with nvidia-smi. (2) Increase the watchdog timeout in the registry (I...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found