Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incorrect loss values calculated for TPU training.

See original GitHub issue

🐛 Bug

Currently using Trainer on TPU calculates incorrect training and eval_during_training loss values. This leads to loss values logged on Wandb also being incorrect.

Information

The problem seems to be that with a PyTorch/XLA training setup with multiprocessing, each processes trains and evals on disjoint (I believe) subsets of the training and validation set respectively. This leads to multiple train_loss and eval_during_training_loss values equaling the number of processes used. These loss values are also different. None of these values is the correct loss values as the loss is calculated on the entire dataset and not on smaller subsets of it.

The solution would be to aggregate these loss values with XLA operations into a single train_loss and eval_loss values.

Different eval_loss values is evident in this console log

06/11/2020 05:33:59 - INFO - transformers.trainer -   ***** Running Evaluation *****                                                                                       06/11/2020 05:33:59 - INFO - transformers.trainer -     Num examples = 5180                     
06/11/2020 05:33:59 - INFO - transformers.trainer -     Batch size = 8                                                                                                     
Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [00:41<00:00,  1.96it/s]
{"eval_loss": 2.407614219335862, "epoch": 0.06633645851760127, "step": 1500}                                                                                               
06/11/2020 05:34:40 - INFO - transformers.trainer -   Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500█████| 81/81 [00:41<00:00,  2.06it/s$
Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [00:42<00:00,  1.89it/s$
{"eval_loss": 1.757087172181518, "epoch": 0.06633645851760127, "step": 1500}                          
06/11/2020 05:34:41 - INFO - transformers.trainer -   Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500                                     
Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [00:43<00:00,  1.87it/s$
{"eval_loss": 2.2870501747101915, "epoch": 0.06633645851760127, "step": 1500}                                                                                              
06/11/2020 05:34:42 - INFO - transformers.trainer -   Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500
Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [00:43<00:00,  1.87it/s$
{"eval_loss": 2.3224751780062545, "epoch": 0.06633645851760127, "step": 1500}               
06/11/2020 05:34:42 - INFO - transformers.trainer -   Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500
Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [00:43<00:00,  1.84it/s]
{"eval_loss": 2.339173612035351, "epoch": 0.06633645851760127, "step": 1500}
06/11/2020 05:34:42 - INFO - transformers.trainer -   Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500
Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [00:43<00:00,  1.85it/s]
{"eval_loss": 2.3176549371377924, "epoch": 0.06633645851760127, "step": 1500}
06/11/2020 05:34:42 - INFO - transformers.trainer -   Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500
Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [00:43<00:00,  1.84it/s]
{"eval_loss": 2.449997420664187, "epoch": 0.06633645851760127, "step": 1500}
06/11/2020 05:34:42 - INFO - transformers.trainer -   Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500
Evaluation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [00:44<00:00,  1.84it/s]
{"eval_loss": 2.18177890336072, "epoch": 0.06633645851760127, "step": 1500}

Model I am using (Bert, XLNet …): Every model with PyTorch Trainer

Language I am using the model on (English, Chinese …): Doesn’t matter

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior: Run any PyTorch/TPU training, for example a language modelling task

Setup a PyTorch/XLA training environment

export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
export WANDB_WATCH=false  # Fixes bug https://github.com/huggingface/transformers/issues/4814
python xla_spawn.py --num_cores 8 language_modeling/run_language_modeling.py \
    --output_dir=output \
    --model_type=roberta \
    --model_name_or_path=roberta-base \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm
    --evaluate_during_training
    --per_device_train_batch_size=4
    --per_device_eval_batch_size=4

Expected behavior

A single train_loss and eval_loss value per logging_step in console output and also with Wandb.

Environment info

transformers version: 2.11.0 (master)
Platform: Linux-5.3.0-1026-gcp-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.6.0a0+6bdfd6a (False)
Tensorflow version (GPU?): 2.2.0 (False)
Using GPU in script?: no
Using distributed or parallel set-up in script?: yes, 8 way TPU/XLA multiprocessing

Issue Analytics

State:
Created 3 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

borisdaymacommented, Jun 16, 2020

Ok, maybe we should wrap the entire logging (wandb + tensorboard + console) with “is_world_master” instead of doing it only for wandb.

@julien-c what do you think? If that’s the way to go I can submit a quick PR.

0reactions

stale[bot]commented, Aug 15, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Top Results From Across the Web

Troubleshooting TensorFlow - TPU - Google Cloud

This guide, along with the FAQ, provides troubleshooting help for users who are training TensorFlow models on Cloud TPU. If you are troubleshooting...

TPU Training. Harnessing the power of dedicated DNN…

The value you will find in a custom AI will be greatly determined by your success in optimizing your model to use it...

Use TPUs | TensorFlow Core

This guide demonstrates how to perform basic training on Tensor Processing Units (TPUs) and TPU Pods, a collection of TPU devices connected by...

Accuracy and Loss - AI Wiki

It is binary (true/false) for a particular sample. Accuracy is often graphed and monitored during the training phase though the value is often...

Trainer - Hugging Face

The API supports distributed training on multiple GPUs/TPUs, ... compute_loss - Computes the loss on a batch of training inputs. ... Expand 9...