Incorrect loss values calculated for TPU training.
See original GitHub issueπ Bug
Currently using Trainer on TPU calculates incorrect training and eval_during_training loss values. This leads to loss values logged on Wandb also being incorrect.
Information
The problem seems to be that with a PyTorch/XLA training setup with multiprocessing, each processes trains and evals on disjoint (I believe) subsets of the training and validation set respectively. This leads to multiple train_loss and eval_during_training_loss values equaling the number of processes used. These loss values are also different. None of these values is the correct loss values as the loss is calculated on the entire dataset and not on smaller subsets of it.
The solution would be to aggregate these loss values with XLA operations into a single train_loss and eval_loss values.
Different eval_loss values is evident in this console log
06/11/2020 05:33:59 - INFO - transformers.trainer - ***** Running Evaluation ***** 06/11/2020 05:33:59 - INFO - transformers.trainer - Num examples = 5180
06/11/2020 05:33:59 - INFO - transformers.trainer - Batch size = 8
Evaluation: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 81/81 [00:41<00:00, 1.96it/s]
{"eval_loss": 2.407614219335862, "epoch": 0.06633645851760127, "step": 1500}
06/11/2020 05:34:40 - INFO - transformers.trainer - Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500βββββ| 81/81 [00:41<00:00, 2.06it/s$
Evaluation: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 81/81 [00:42<00:00, 1.89it/s$
{"eval_loss": 1.757087172181518, "epoch": 0.06633645851760127, "step": 1500}
06/11/2020 05:34:41 - INFO - transformers.trainer - Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500
Evaluation: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 81/81 [00:43<00:00, 1.87it/s$
{"eval_loss": 2.2870501747101915, "epoch": 0.06633645851760127, "step": 1500}
06/11/2020 05:34:42 - INFO - transformers.trainer - Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500
Evaluation: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 81/81 [00:43<00:00, 1.87it/s$
{"eval_loss": 2.3224751780062545, "epoch": 0.06633645851760127, "step": 1500}
06/11/2020 05:34:42 - INFO - transformers.trainer - Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500
Evaluation: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 81/81 [00:43<00:00, 1.84it/s]
{"eval_loss": 2.339173612035351, "epoch": 0.06633645851760127, "step": 1500}
06/11/2020 05:34:42 - INFO - transformers.trainer - Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500
Evaluation: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 81/81 [00:43<00:00, 1.85it/s]
{"eval_loss": 2.3176549371377924, "epoch": 0.06633645851760127, "step": 1500}
06/11/2020 05:34:42 - INFO - transformers.trainer - Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500
Evaluation: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 81/81 [00:43<00:00, 1.84it/s]
{"eval_loss": 2.449997420664187, "epoch": 0.06633645851760127, "step": 1500}
06/11/2020 05:34:42 - INFO - transformers.trainer - Saving model checkpoint to /home/saurabh/data/<retracted>/checkpoint-1500
Evaluation: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 81/81 [00:44<00:00, 1.84it/s]
{"eval_loss": 2.18177890336072, "epoch": 0.06633645851760127, "step": 1500}
Model I am using (Bert, XLNet β¦): Every model with PyTorch Trainer
Language I am using the model on (English, Chinese β¦): Doesnβt matter
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior: Run any PyTorch/TPU training, for example a language modelling task
- Setup a PyTorch/XLA training environment
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
export WANDB_WATCH=false # Fixes bug https://github.com/huggingface/transformers/issues/4814
python xla_spawn.py --num_cores 8 language_modeling/run_language_modeling.py \
--output_dir=output \
--model_type=roberta \
--model_name_or_path=roberta-base \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--mlm
--evaluate_during_training
--per_device_train_batch_size=4
--per_device_eval_batch_size=4
Expected behavior
A single train_loss and eval_loss value per logging_step in console output and also with Wandb.
Environment info
transformers
version: 2.11.0 (master)- Platform: Linux-5.3.0-1026-gcp-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.6.0a0+6bdfd6a (False)
- Tensorflow version (GPU?): 2.2.0 (False)
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: yes, 8 way TPU/XLA multiprocessing
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (6 by maintainers)
Top GitHub Comments
Ok, maybe we should wrap the entire logging (wandb + tensorboard + console) with βis_world_masterβ instead of doing it only for wandb.
@julien-c what do you think? If thatβs the way to go I can submit a quick PR.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.