question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TF2 DeBERTaV2 runs super slow on TPUs

See original GitHub issue

System Info

latest version of transformers, Colab TPU, tensorflow 2

Who can help?

@kamalkraj @Rocketknight1 @BigBird01

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

It’s currently hard to share code and access to the google bucket. But I believe any TF2 DeBERTaV2 code running on TPUs will have this issue

Expected behavior

I’ve been trying to train a deberta v3 model on GPU and TPUs. I got it to work on multi-node and multi-gpus using Nvidia deeplearning examples libraries https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow2/LanguageModeling/ I basically used the training setup and loop from the BERT code, the dataset utils from the ELECTRA code, and the model from Huggingface transformers with some changes in order to share embeddings.

On 6xA40 45gb gpus i get around 1370 sentences per seconds during training (which is lower than what Nvidia gets for Electra but it’s fine).

Ok, now the problem… on TPU i get 20 sentences per second

I traced the issue back to the tf.gather function here https://github.com/huggingface/transformers/blob/main/src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py#L525

I ran TPU profiling and this is the output: image

GatherV2 takes most of the time: image

zoomed in pictures of the fast ops image

Also, I’m not sure if this is TPU specific since on GPUs the training ~30% slower compared to regular ELECTRA.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:34 (31 by maintainers)

github_iconTop GitHub Comments

1reaction
sanchit-gandhicommented, Aug 8, 2022

@sanchit-gandhi do you know a good point of contact for TPU problems?

Only for JAX on TPU, I’ll ask around and see if there is anyone who can help with TF!

1reaction
sanchit-gandhicommented, Aug 2, 2022

For JAX BLOOM we couldn’t even compile the 176B parameter model with the naive implementation of concatenate_to_cache, yet alone benchmark which operations consumed the bulk of the execution time! We swapped it for this more efficient implementation (with one-hot encodings etc): https://github.com/huggingface/bloom-jax-inference/blob/2a04aa519d262729d54adef3d19d63879f81ea89/bloom_inference/modeling_bloom/modeling_bloom.py#L119 Coincidentally, we’ve just run the JAX profiler for this implementation and are going through the traceback it with some of the Google JAX guys later today. Will report back on how performance fares!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fine tuning TensorFlow DeBERTa fails on TPU #18476 - GitHub
I have seen similar issues when using microsoft/deberta-base . I believe the following issues are related: TF2 DeBERTaV2 runs super slow on TPUs...
Read more >
Use TPUs | TensorFlow Core
This guide demonstrates how to perform basic training on Tensor Processing Units (TPUs) and TPU Pods, a collection of TPU devices connected by...
Read more >
Troubleshooting TensorFlow - TPU - Google Cloud
This guide, along with the FAQ, provides troubleshooting help for users who are training TensorFlow models on Cloud TPU. If you are troubleshooting...
Read more >
François Chollet on Twitter: "Tip: when running on TPU, you ...
Tip: when running on TPU, you can significantly speed up your model by running *multiple steps of gradient descent in a single graph...
Read more >
Why is Google Colab TPU as slow as my computer?
Since I have a large dataset and not much power in my PC, I thought it was a good idea to use TPU...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found