question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[trainer] port metrics logging and saving methods to all example scripts

See original GitHub issue

In an effort to make the examples easier to read, in https://github.com/huggingface/transformers/pull/10266 we added new trainer methods:

  • trainer.log_metrics - to perform consistent formatting for logged metrics
  • trainer.save_metrics - to save the metrics into a corresponding json file.

and deployed them in run_seq2seq.py.

The next task is do the same for all the other examples/*/run_*.py scripts.

Steps:

  1. Study the diff for run_seq2seq.py. https://github.com/huggingface/transformers/pull/10266/files#diff-82bfb61a8b91894c2c2101734a6ab7b415be4ace5cd1e01b4c37663020d924ae
  2. pick a script, e.g. examples/multiple-choice/run_swag.py
  3. apply the same changes as in step 1 removing the explicit metrics printing lines and replacing them with the 2 new methods
  4. test the modified script (usually README.md for that folder should have the instructions to do so) and see that your change works - train/eval/test metrics are printed using the new way and that (train|eval|test|all)_results.json are generated. You can use a very short datasample 5 records is enough, by just adding: --max_train_samples 5 --max_val_samples 5 --max_test_samples 5

repeat for other scripts.

Thank you very much!

The metrics log should be similar to this, with the exception of using different scoring metrics:



02/16/2021 17:06:39 - INFO - __main__ -   ***** train metrics *****
02/16/2021 17:06:39 - INFO - __main__ -     epoch                      =    1.0
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_cpu_alloc_delta   =    2MB
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_cpu_peaked_delta  =    0MB
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_gpu_alloc_delta   =  230MB
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_gpu_peaked_delta  =    0MB
02/16/2021 17:06:39 - INFO - __main__ -     total_flos                 = 2128GF
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_cpu_alloc_delta  =   55MB
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_cpu_peaked_delta =    0MB
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_gpu_alloc_delta  =  692MB
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_gpu_peaked_delta =  661MB
02/16/2021 17:06:39 - INFO - __main__ -     train_runtime              = 2.3114
02/16/2021 17:06:39 - INFO - __main__ -     train_samples              =    100
02/16/2021 17:06:39 - INFO - __main__ -     train_samples_per_second   =  3.028

02/16/2021 17:06:43 - INFO - __main__ -   ***** val metrics *****
02/16/2021 17:13:05 - INFO - __main__ -     epoch                     =     1.0
02/16/2021 17:13:05 - INFO - __main__ -     eval_bleu                 = 24.6502
02/16/2021 17:13:05 - INFO - __main__ -     eval_gen_len              =    32.9
02/16/2021 17:13:05 - INFO - __main__ -     eval_loss                 =  3.7533
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_cpu_alloc_delta  =     0MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_cpu_peaked_delta =     0MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_gpu_alloc_delta  =     0MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_gpu_peaked_delta =   510MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_runtime              =  3.9266
02/16/2021 17:13:05 - INFO - __main__ -     eval_samples              =     100
02/16/2021 17:13:05 - INFO - __main__ -     eval_samples_per_second   =  25.467

02/16/2021 17:06:48 - INFO - __main__ -     ***** test metrics *****
02/16/2021 17:06:48 - INFO - __main__ -     test_bleu                 = 27.146
02/16/2021 17:06:48 - INFO - __main__ -     test_gen_len              =  41.37
02/16/2021 17:06:48 - INFO - __main__ -     test_loss                 = 3.6682
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_cpu_alloc_delta  =    0MB
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_cpu_peaked_delta =    0MB
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_gpu_alloc_delta  =    0MB
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_gpu_peaked_delta =  645MB
02/16/2021 17:06:48 - INFO - __main__ -     test_runtime              = 5.1136
02/16/2021 17:06:48 - INFO - __main__ -     test_samples              =    100
02/16/2021 17:06:48 - INFO - __main__ -     test_samples_per_second   = 19.556

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
bhadreshpsavanicommented, Feb 26, 2021

Sure @stas00, I will be happy to work on it!

1reaction
stas00commented, Feb 26, 2021

Oh and as you are doing an amazingly useful work syncing all examples to look and feel similar, there is one very crucial thing to sync and it’s templates/adding_a_new_example_script/ on which all new examples will be based, so we better have a good template to start with. I forgot to mention that earlier. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Log metrics in the designer - Azure Machine Learning
Monitor your Azure ML designer experiments. Enable logging using the Execute Python Script component and view the logged results in the ...
Read more >
Ingest logs and metrics with Elastic Agent
This guide describes how to: Monitor logs and infrastructure metrics from systems and services across your organization; Monitor Nginx logs and metrics ......
Read more >
Getting started - Prometheus.io
This guide is a "Hello World"-style tutorial which shows how to install, configure, and use a simple Prometheus instance. You will download and...
Read more >
Monitoring Ray Serve — Ray 3.0.0.dev0
This section helps you debug and monitor your Serve applications by: viewing the Ray dashboard. using Ray logging and Loki. inspecting built-in Ray...
Read more >
Log4j 2 Tutorial: Configuration Example for Logging in Java
Learn how the Java Log4j 2 library works. Discover how to configure and use appenders, filters, layouts, and levels for logging your Java ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found