question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow evaluation using Trainer with TPUs in Colab

See original GitHub issue

Environment info

  • transformers version: 4.3.3
  • Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.9.0a0+7a178a8 (False)
  • Tensorflow version (GPU?): 2.4.1 (False)
  • Using GPU in script?: TPU
  • Using distributed or parallel set-up in script?: NO

@sgugger @patrickvonplaten

Model I am using (Bert, XLNet …): BERT

I’m having very slow eval times using the Trainer API in conjunction with XLA in Google Colab. While the training epochs are running at a good speed, evaluating after each epoch it takes a very long time. I’ve tried restricting dataset size and tokenization max length with no success.

I’m not sure how to check whether it’s using XLA during evaluation.

The task I am working on is NLI, using multi-nli from datasets

To reproduce

Execute this notebook

https://colab.research.google.com/drive/1dVEfoxGvMAKd0GLnrUJSHZycGtyKt9mr?usp=sharing

Expected behavior

Evaluation speed should be approximately the same as training.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
finiteautomatacommented, Apr 19, 2021

Ok, I followed this notebook (T5 on TPU) and I managed to solve that error by using start_method="fork" on xmp.spawn. Thanks for your help @sgugger!

def train_nli(index):
   # All the training code here
   ...
   
xmp.spawn(train_nli, args=(), start_method="spawn")

The notebook with the full code is here

1reaction
sguggercommented, Feb 26, 2021

I don’t know of any easier way than launching the training function (in PyTorch). If you come across an easy example, please let me know and we will try to make the Trainer as easy to use.

Read more comments on GitHub >

github_iconTop Results From Across the Web

T5 evaluation via Trainer `predict_with_generate` extremely ...
Here is a Colab notebook demonstrating the issue Google Colab ... T5 evaluation via Trainer `predict_with_generate` extremely slow on TPU?
Read more >
Very slow training on colab with TPU · Issue #2148 - GitHub
Am running into this issue only when I run the code inline. Instead of that, if I have the code in a separate...
Read more >
Using TPUs in Google Colab (properly) - matthewmcateer.me
A simple way to use TPUs with minimal hardware optimization. ... “Why is my model not training faster than it would with just...
Read more >
How to Colab with TPU - Towards Data Science
Google Colab provides experimental support for TPUs for free! In this article, we'll be discussing how to train a model using TPU on...
Read more >
Cloud TPU performance guide
To ensure the TPU is not idle, it is important to make sure there is a steady stream of data being loaded onto...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found