question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

train/eval step results log not shown in terminal for tf_trainer.py

See original GitHub issue

Environment info

  • transformers version: 3.1.0
  • Platform: Linux-5.4.0-42-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.6.0 (False)
  • Tensorflow version (GPU?): 2.2.0 (False)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

Trainer: @sgugger tensorflow: @jplu @LysandreJik

Information

In the current code, which is without setting logger.setLevel(logging.INFO) in trainer_tf.py:

09/12/2020 03:42:41 - INFO - absl -   Load dataset info from /home/imo/tensorflow_datasets/glue/sst2/1.0.0
09/12/2020 03:42:41 - INFO - absl -   Reusing dataset glue (/home/imo/tensorflow_datasets/glue/sst2/1.0.0)
09/12/2020 03:42:41 - INFO - absl -   Constructing tf.data.Dataset for split validation, from /home/imo/tensorflow_datasets/glue/sst2/1.0.0
2020-09-12 03:42:57.010229: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 41707 of 67349
2020-09-12 03:43:03.412045: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
2020-09-12 03:43:56.636791: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 36279 of 67349
2020-09-12 03:44:04.474751: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
09/12/2020 03:44:51 - INFO - __main__ -   *** Evaluate ***
09/12/2020 03:45:02 - INFO - __main__ -   ***** Eval results *****
09/12/2020 03:45:02 - INFO - __main__ -     eval_loss = 0.712074209790711
09/12/2020 03:45:02 - INFO - __main__ -     eval_acc = 0.48977272727272725

You can see that the train/eval step logs are not shown.

If I specify, manually, logger.setLevel(logging.INFO) in trainer_tf.py:

09/12/2020 06:04:39 - INFO - absl -   Load dataset info from /home/imo/tensorflow_datasets/glue/sst2/1.0.0
09/12/2020 06:04:39 - INFO - absl -   Reusing dataset glue (/home/imo/tensorflow_datasets/glue/sst2/1.0.0)
09/12/2020 06:04:39 - INFO - absl -   Constructing tf.data.Dataset for split validation, from /home/imo/tensorflow_datasets/glue/sst2/1.0.0
You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
To use comet_ml logging, run `pip/conda install comet_ml` see https://www.comet.ml/docs/python-sdk/huggingface/
***** Running training *****
  Num examples = 67349
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Steps per epoch = 4
  Total optimization steps = 4
2020-09-12 06:04:49.637373: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 39626 of 67349
2020-09-12 06:04:56.805687: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
{'loss': 0.6994307, 'learning_rate': 3.7499998e-05, 'epoch': 0.5, 'step': 1}
{'loss': 0.6897122, 'learning_rate': 2.5e-05, 'epoch': 0.75, 'step': 2}
Saving checkpoint for step 2 at ./sst-2/checkpoint/ckpt-1
{'loss': 0.683386, 'learning_rate': 1.25e-05, 'epoch': 1.0, 'step': 3}
{'loss': 0.68290234, 'learning_rate': 0.0, 'epoch': 1.25, 'step': 4}
Saving checkpoint for step 4 at ./sst-2/checkpoint/ckpt-2
Training took: 0:00:43.099437
Saving model in ./sst-2/
09/12/2020 06:05:26 - INFO - __main__ -   *** Evaluate ***
***** Running Evaluation *****
  Num examples = 872
  Batch size = 8
{'eval_loss': 0.6990196158032899, 'eval_acc': 0.49204545454545456, 'epoch': 1.25, 'step': 4}
09/12/2020 06:05:35 - INFO - __main__ -   ***** Eval results *****
09/12/2020 06:05:35 - INFO - __main__ -     eval_loss = 0.6990196158032899
09/12/2020 06:05:35 - INFO - __main__ -     eval_acc = 0.49204545454545456

We see more information like

{'loss': 0.6994307, 'learning_rate': 3.7499998e-05, 'epoch': 0.5, 'step': 1}

More importantly, we also see this message

You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
To use comet_ml logging, run `pip/conda install comet_ml` see https://www.comet.ml/docs/python-sdk/huggingface/

, which won’t be shown if logging level is not set to INFO.

Related

In the PR #6097, @LysandreJik changed logger.info(output) to print(output) in trainer.py in order to show logs on the screen. Maybe we should do the same thing for tf_trainer.py. If not, could we set logging level to INFO in tf_trainer.py - however this would become different from trainer.py where the logging level is not set (at least, not in the trainer script).

To reproduce

python3 run_tf_glue.py \
--task_name sst-2 \
--model_name_or_path distilbert-base-uncased \
--output_dir ./sst-2/ \
--max_seq_length  16 \
--num_train_epochs 2 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 1 \
--max_steps 4 \
--logging_steps 1 \
--save_steps 2 \
--seed 1 \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir

Expected behavior

I expect the train/eval step logs will be shown on the screen.

Remark

I can make a PR once a decision is made by the team.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
ydshiehcommented, Sep 14, 2020

@jplu As you might know, I open this issue, but I don’t necessary have the whole context. So I leave you to decide the desired behavior for tf_trainer.

0reactions
stale[bot]commented, Nov 29, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trainer doesn't show the loss at each step - Transformers
Hi, is there a way to display/print the loss (or metrics if you are evaluating) at each step (or n steps) or every...
Read more >
How to get the accuracy per epoch or step for the huggingface ...
The logs contain the loss for each 10 steps, but I can't seem to find the training accuracy. Does anyone know how to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found