EarlyStopping not working / wrong keys in log
See original GitHub issue🐛 Bug
I’m trying to implement EarlyStopping when validation loss stops decreasing. I add callback as follows:
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.l1_loss(y_hat, y)
result = pl.EvalResult(checkpoint_on=loss)
result.log("val_loss", loss, sync_dist=True)
return result
early_stop_callback = EarlyStopping(
monitor="val_loss",
min_delta=0.1,
patience=1,
verbose=True,
mode="min")
trainer = pl.Trainer(
gpus=-1,
max_epochs=50,
distributed_backend="ddp",
early_stop_callback=early_stop_callback,
logger=wandb_logger)
This does not work - it is returning False at from the _validate_condition_metric
function
When I checked what’s in the log dictionary, the values looked like
{'val_early_stop_on': None, 'val_checkpoint_on': tensor(0.5601, device='cuda:0')}
- which is slightly confusing. Where does “val_checkpoint_on” come from and why it is not called “val_loss”?
It feels like it might be slightly connected to the result = pl.EvalResult(checkpoint_on=loss)
line.
I was reading documentation, but frankly speaking I found
checkpoint_on (Union[Tensor, bool, None]) – Metric to checkpoint on.
to be slightly not intuitive. What does it mean for the metric to be checkpoints on? And does it really connect to keys in log being renamed in a strange way?
Code sample
https://github.com/matsuokalab/cosmoflow/blob/ac75fe317f8daf3444c96b837bb109064aa81dab/main.py
Expected behavior
Expecting EarlyStopping to work, log to have val_loss key
Environment
* CUDA:
- GPU:
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- available: True
- version: 10.2
* Packages:
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.0
- tensorboard: 2.2.0
- tqdm: 4.46.1
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.2
- version: #1 SMP Fri Apr 20 16:44:24 UTC 2018
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Aha, I saw
early_stop_on=loss
, but thought it’s some sort of automated way without callback. It is a bit counter-intuitive still thatmonitor
is not working, but at least the stopping works now, thanks!I guess can close this, if it is considered “not a bug, but a feature” 😃