Training in google colab with TPU using TFTrainer fails with
See original GitHub issueEnvironment info
transformers
version: 4.6.1- Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.10
- PyTorch version (GPU?): 1.8.1+cu101 (False)
- Tensorflow version (GPU?): 2.5.0 (False)
- Using GPU in script?: Using TPU
- Using distributed or parallel set-up in script?: I assume Yes, under the hood
Who can help
- trainer: @sgugger @Rocketknight1
Information
Model I am using (Albert):
The problem arises when using:
- my own modified scripts
The tasks I am working on is:
- my own task or dataset
To reproduce
I’m trying to train classification model on TPU using TFTrainer, it fails with the following error:
Trying to run metric.update_state in replica context when the metric was not created in TPUStrategy scope. Make sure the keras Metric is created in TPUstrategy scope.
I tried training without eval and it finishes without an error but the model is not really trained and results are poor. Also tried to train with eval and without compute_metrics but the same error is thrown.
from transformers import TFTrainer, TFTrainingArguments
from transformers import TFAutoModelForSequenceClassification
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'precision': precision,
'recall': recall,
'f1': f1
}
training_args = TFTrainingArguments(
tpu_num_cores=8,
output_dir=output_dir, # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=3, # batch size per device during training
per_device_eval_batch_size=3, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir=logging_dir, # directory for storing logs
logging_steps=10,
evaluation_strategy="steps",
eval_steps=500,
save_steps=3000,
load_best_model_at_end=True,
metric_for_best_model="f1",
learning_rate=1e-5
)
with training_args.strategy.scope():
model = TFAutoModelForSequenceClassification.from_pretrained(modelName,
num_labels=len(label_dict),
output_attentions=False,
output_hidden_states=False)
trainer = TFTrainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
compute_metrics=compute_metrics,
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset, # evaluation dataset
)
trainer.train()
Expected behavior
I would expect to train successfully on TPU
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Error Training Keras Model on Google Colab using TPU runtime
Dataset.from_generator is expected to not work with TPUs as it uses py_function underneath which is incompatible with Cloud TPU 2VM setup.
Read more >Tutorials for using Colab TPUs with Huggingface Transformers?
I looking for an easy-to-follow tutorial for using Huggingface Transformer models (e.g. BERT) in PyTorch on Google Colab with TPUs.
Read more >Issue with TPUs on Google Colab when training BERT
I'm trying to run BERT in Google Colab using TPU, however I'm getting an error message which can be seen here. Tensorflow version...
Read more >Fine-tune a German GPT-2 Model with Tensorflow ... - Data Dive
Using TPUs on Google Colab we reduce training time to a reasonable amount; Our data set has text comments and their corresponding rating....
Read more >How to Colab with TPU - Towards Data Science
Training a Huggingface BERT on Google Colab TPU ... Following are some use cases where we might want to use a TPU as...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Not easily, unfortunately. This is a known issue at our end and we’re hoping to implement a fix, but in the meantime you can try exporting your trained model to a GPU instance and running
predict()
there.Hi! We’re trying to move away from using TFTrainer for TensorFlow and instead train models with the native Keras API. We have a full example using the Keras approach here: https://github.com/huggingface/transformers/tree/master/examples/tensorflow/text-classification
Training on TPU with this example works correctly, but there are some issues with Keras predictions on TPU that we’re actively working on. If you encounter these (the output object contains None fields that should contain values), you can try moving any
predict
calls out of thestrategy.scope()
, or saving the model and doing the predictions on a GPU or CPU instance instead.