question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training in google colab with TPU using TFTrainer fails with

See original GitHub issue

Environment info

  • transformers version: 4.6.1
  • Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.8.1+cu101 (False)
  • Tensorflow version (GPU?): 2.5.0 (False)
  • Using GPU in script?: Using TPU
  • Using distributed or parallel set-up in script?: I assume Yes, under the hood

Who can help

Information

Model I am using (Albert):

The problem arises when using:

  • my own modified scripts

The tasks I am working on is:

  • my own task or dataset

To reproduce

I’m trying to train classification model on TPU using TFTrainer, it fails with the following error:

Trying to run metric.update_state in replica context when the metric was not created in TPUStrategy scope. Make sure the keras Metric is created in TPUstrategy scope.

I tried training without eval and it finishes without an error but the model is not really trained and results are poor. Also tried to train with eval and without compute_metrics but the same error is thrown.

from transformers import TFTrainer, TFTrainingArguments
from transformers import TFAutoModelForSequenceClassification

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
      'precision': precision,
      'recall': recall,
      'f1': f1
  }

training_args = TFTrainingArguments(
    tpu_num_cores=8,
    output_dir=output_dir,          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=3,  # batch size per device during training
    per_device_eval_batch_size=3,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir=logging_dir,            # directory for storing logs
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=3000,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    learning_rate=1e-5
)

with training_args.strategy.scope():
    model = TFAutoModelForSequenceClassification.from_pretrained(modelName,
                                                          num_labels=len(label_dict),
                                                          output_attentions=False,
                                                          output_hidden_states=False)

trainer = TFTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,             # evaluation dataset 
)

trainer.train()

Expected behavior

I would expect to train successfully on TPU

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Rocketknight1commented, Aug 30, 2021

Not easily, unfortunately. This is a known issue at our end and we’re hoping to implement a fix, but in the meantime you can try exporting your trained model to a GPU instance and running predict() there.

1reaction
Rocketknight1commented, Jun 17, 2021

Hi! We’re trying to move away from using TFTrainer for TensorFlow and instead train models with the native Keras API. We have a full example using the Keras approach here: https://github.com/huggingface/transformers/tree/master/examples/tensorflow/text-classification

Training on TPU with this example works correctly, but there are some issues with Keras predictions on TPU that we’re actively working on. If you encounter these (the output object contains None fields that should contain values), you can try moving any predict calls out of the strategy.scope(), or saving the model and doing the predictions on a GPU or CPU instance instead.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error Training Keras Model on Google Colab using TPU runtime
Dataset.from_generator is expected to not work with TPUs as it uses py_function underneath which is incompatible with Cloud TPU 2VM setup.
Read more >
Tutorials for using Colab TPUs with Huggingface Transformers?
I looking for an easy-to-follow tutorial for using Huggingface Transformer models (e.g. BERT) in PyTorch on Google Colab with TPUs.
Read more >
Issue with TPUs on Google Colab when training BERT
I'm trying to run BERT in Google Colab using TPU, however I'm getting an error message which can be seen here. Tensorflow version...
Read more >
Fine-tune a German GPT-2 Model with Tensorflow ... - Data Dive
Using TPUs on Google Colab we reduce training time to a reasonable amount; Our data set has text comments and their corresponding rating....
Read more >
How to Colab with TPU - Towards Data Science
Training a Huggingface BERT on Google Colab TPU ... Following are some use cases where we might want to use a TPU as...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found