Slow evaluation using Trainer with TPUs in Colab
See original GitHub issueEnvironment info
transformers
version: 4.3.3- Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.10
- PyTorch version (GPU?): 1.9.0a0+7a178a8 (False)
- Tensorflow version (GPU?): 2.4.1 (False)
- Using GPU in script?: TPU
- Using distributed or parallel set-up in script?: NO
Model I am using (Bert, XLNet …): BERT
I’m having very slow eval times using the Trainer
API in conjunction with XLA
in Google Colab. While the training epochs are running at a good speed, evaluating after each epoch it takes a very long time. I’ve tried restricting dataset size and tokenization max length with no success.
I’m not sure how to check whether it’s using XLA
during evaluation.
The task I am working on is NLI, using multi-nli
from datasets
To reproduce
Execute this notebook
https://colab.research.google.com/drive/1dVEfoxGvMAKd0GLnrUJSHZycGtyKt9mr?usp=sharing
Expected behavior
Evaluation speed should be approximately the same as training.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
T5 evaluation via Trainer `predict_with_generate` extremely ...
Here is a Colab notebook demonstrating the issue Google Colab ... T5 evaluation via Trainer `predict_with_generate` extremely slow on TPU?
Read more >Very slow training on colab with TPU · Issue #2148 - GitHub
Am running into this issue only when I run the code inline. Instead of that, if I have the code in a separate...
Read more >Using TPUs in Google Colab (properly) - matthewmcateer.me
A simple way to use TPUs with minimal hardware optimization. ... “Why is my model not training faster than it would with just...
Read more >How to Colab with TPU - Towards Data Science
Google Colab provides experimental support for TPUs for free! In this article, we'll be discussing how to train a model using TPU on...
Read more >Cloud TPU performance guide
To ensure the TPU is not idle, it is important to make sure there is a steady stream of data being loaded onto...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ok, I followed this notebook (T5 on TPU) and I managed to solve that error by using
start_method="fork"
onxmp.spawn
. Thanks for your help @sgugger!The notebook with the full code is here
I don’t know of any easier way than launching the training function (in PyTorch). If you come across an easy example, please let me know and we will try to make the
Trainer
as easy to use.