Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training in google colab with TPU using TFTrainer fails with

See original GitHub issue

Environment info

transformers version: 4.6.1
Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.8.1+cu101 (False)
Tensorflow version (GPU?): 2.5.0 (False)
Using GPU in script?: Using TPU
Using distributed or parallel set-up in script?: I assume Yes, under the hood

Who can help

trainer: @sgugger @Rocketknight1

Information

Model I am using (Albert):

The problem arises when using:

my own modified scripts

The tasks I am working on is:

my own task or dataset

To reproduce

I’m trying to train classification model on TPU using TFTrainer, it fails with the following error:

Trying to run metric.update_state in replica context when the metric was not created in TPUStrategy scope. Make sure the keras Metric is created in TPUstrategy scope.

I tried training without eval and it finishes without an error but the model is not really trained and results are poor. Also tried to train with eval and without compute_metrics but the same error is thrown.

from transformers import TFTrainer, TFTrainingArguments
from transformers import TFAutoModelForSequenceClassification

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
      'precision': precision,
      'recall': recall,
      'f1': f1
  }

training_args = TFTrainingArguments(
    tpu_num_cores=8,
    output_dir=output_dir,          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=3,  # batch size per device during training
    per_device_eval_batch_size=3,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir=logging_dir,            # directory for storing logs
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=3000,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    learning_rate=1e-5
)

with training_args.strategy.scope():
    model = TFAutoModelForSequenceClassification.from_pretrained(modelName,
                                                          num_labels=len(label_dict),
                                                          output_attentions=False,
                                                          output_hidden_states=False)

trainer = TFTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,             # evaluation dataset 
)

trainer.train()

Expected behavior

I would expect to train successfully on TPU

Issue Analytics

State:
Created 2 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

Rocketknight1commented, Aug 30, 2021

Not easily, unfortunately. This is a known issue at our end and we’re hoping to implement a fix, but in the meantime you can try exporting your trained model to a GPU instance and running predict() there.

1reaction

Rocketknight1commented, Jun 17, 2021

Hi! We’re trying to move away from using TFTrainer for TensorFlow and instead train models with the native Keras API. We have a full example using the Keras approach here: https://github.com/huggingface/transformers/tree/master/examples/tensorflow/text-classification

Training on TPU with this example works correctly, but there are some issues with Keras predictions on TPU that we’re actively working on. If you encounter these (the output object contains None fields that should contain values), you can try moving any predict calls out of the strategy.scope(), or saving the model and doing the predictions on a GPU or CPU instance instead.