question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] EarlyStopping logging on rank 0 only

See original GitHub issue

🚀 Feature

Toggle switch to turn off EarlyStopping logging for processes other than rank 0

Motivation

EarlyStopping logging can be a bit spammy when viewing aggregate logs across all processes. For example, with my custom CloudWatch logger:

xnpww4j62d-algo-1-vr8o9 | 14:17:49 [INFO] Epoch 9: [ Training | 100%  iter# 49/49    19.28 batches/s ] train/loss_step=0.764418, train/loss_epoch=0.773, train/acc=0.68356
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] Epoch 9: [ Validation | 100%  iter# 10/10     2.34 batches/s ] val/loss_step=1.253475, val/loss_epoch=1.278802, val/acc=0.6107
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 0] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 2] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 1] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 3] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 4] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 5] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 6] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 7] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:18:20 [INFO] Epoch 14: [ Training | 100%  iter# 49/49    18.94 batches/s ] train/loss_step=0.611876, train/loss_epoch=0.55, train/acc=0.80096
xnpww4j62d-algo-1-vr8o9 | 14:18:26 [INFO] Epoch 14: [ Validation | 100%  iter# 10/10     2.29 batches/s ] val/loss_step=0.748429, val/loss_epoch=0.828285, val/acc=0.726

Pitch

It would be nice if we could turn off printing of this message on processes other than rank 0. I understand that this is actually useful to monitor in some cases, so maybe this toggle could be set to False by default.

Alternatives

Custom EarlyStopping callback?

cc @borda @carmocca @awaelchli @rohitgr7

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

4reactions
carmoccacommented, May 26, 2022

I think we can add this flag. It’s useful for metrics logged with sync_dist=True.

The relevant piece of code is here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/dd475183227644a8d22dca3deb18c99fb0a9b2c4/pytorch_lightning/callbacks/early_stopping.py#L256-L261

2reactions
ekagra-ranjancommented, May 30, 2022

Hi @carmocca! I would like to take this up.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Feature Request] EarlyStopping logging on rank 0 only #13162
Motivation. EarlyStopping logging can be a bit spammy when viewing aggregate logs across all processes. For example, with my custom CloudWatch ...
Read more >
EarlyStopping — PyTorch Lightning 1.8.5.post0 documentation
log_rank_zero_only ( bool ) – When set True , logs the status of the early stopping callback only for rank 0 process. Raises....
Read more >
tf.keras.callbacks.EarlyStopping | TensorFlow v2.11.0
The quantity to be monitored needs to be available in logs dict. ... Mode 0 is silent, and mode 1 displays messages when...
Read more >
Trainer - Hugging Face
Log metrics in a specially formatted way. Under distributed environment this is done only for a process with rank 0. Notes on memory...
Read more >
Study Duration for Clinical Trials with Survival Response and ...
Halpern and Brown (1987) use a simulation based on the log-rank test and the modified ... Group sequential tests allow early stopping of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found