Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Metrics are logged on each batch, but not on each accum. step

See original GitHub issue

🐛 Bug

I’m not sure, that this is a bug, but it’s really unexpected behavior. Metrics are logged on each batch, as we can see here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/c32e3f3ea57dd4439255b809ed5519608a585d73/pytorch_lightning/trainer/training_loop.py#L435-L439

In case of using Tensorboard logger we will see the following picture: Screenshot from 2020-03-17 18-30-48

these vertical lines are a sets of points, logged for each individual step. Is there any way to log aggregated metrics for each step (so that only one log point appears for each step) ?

I can set row_log_interval equal to the accumulation steps, and only one logging point will appear, but in this case, I’ll not aggregate the accumulation step metrics, but only take the last one.

Environment

PyTorch version: 1.4.0 OS: Ubuntu 16.04.6 LTS Python version: 3.7

Versions of relevant libraries: [pip] numpy==1.18.1 [pip] pytorch-lightning==0.7.1 [pip] torch==1.4.0 [pip] torchvision==0.5.0

Issue Analytics

State:
Created 4 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

5reactions

alexeykarnachevcommented, Mar 27, 2020

@jeremyjordan Let me first formalize the problem.

Here are log_metrics methods for few pytorch-lightning loggers:

TensorBoardLogger:

    @rank_zero_only
    def log_metrics(self, metrics: Dict[str, float], step: Optional[int] = None) -> None:
        for k, v in metrics.items():
            if isinstance(v, torch.Tensor):
                v = v.item()
            self.experiment.add_scalar(k, v, step)

NeptuneLogger:

    @rank_zero_only
    def log_metrics(
            self,
            metrics: Dict[str, Union[torch.Tensor, float]],
            step: Optional[int] = None
    ) -> None:
        for key, val in metrics.items():
            self.log_metric(key, val, step=step)

NeptuneLogger:

    @rank_zero_only
    def log_metrics(
            self,
            metrics: Dict[str, Union[torch.Tensor, float]],
            step: Optional[int] = None
    ) -> None:
        for key, val in metrics.items():
            self.log_metric(key, val, step=step)

MLFlowLogger:

    @rank_zero_only
    def log_metrics(self, metrics: Dict[str, float], step: Optional[int] = None) -> None:
        timestamp_ms = int(time() * 1000)
        for k, v in metrics.items():
            if isinstance(v, str):
                log.warning(f'Discarding metric with string value {k}={v}.')
                continue
            self.experiment.log_metric(self.run_id, k, v, timestamp_ms, step)

CometLogger:

    @rank_zero_only
    def log_metrics(
            self,
            metrics: Dict[str, Union[torch.Tensor, float]],
            step: Optional[int] = None
    ) -> None:
        # Comet.ml expects metrics to be a dictionary of detached tensors on CPU
        for key, val in metrics.items():
            if is_tensor(val):
                metrics[key] = val.cpu().detach()

        self.experiment.log_metrics(metrics, step=step)

As we can see, each logger’s method receives the step argument. And it works great, when there is accumulate_grad_batches equals 1. But if we set accumulate_grad_batches, for instance equals 2, then each step will be logged twice. There will be 2 sets of metrics for step 0, 2 sets of metrics for step 1 and so on.

It happens, because the train loop calls log_metrics routine on each batch: https://github.com/PyTorchLightning/pytorch-lightning/blob/12b39a74b4d5892c03115770b1d0b796ff941b57/pytorch_lightning/trainer/training_loop.py#L445-L449

And when we have accumulate_grad_batches != 1, batch and step are not the same thing.

So, I see here the following solutions:

Perform a step check in each logger’s log_metrics method. If step is equal to the previous one, don’t log metrics, but instead, aggregate them with the previous ones (sum or weighted sum). It’s a bad solution, because it’ll require a lot of fixing (actually in all existing loggers) and also, it’ll make it oblige future loggers developers to implement such the check and aggregation in their loggers.
The same as the 1, but it moves all logic (step check and metrics aggregation) to some super-class. All loggers will be inherited from this super class and such the aggregation will be performed automatically. It’s also a bad solution: it requires a lot of refactoring and also, I feel, that there are a lot of corner cases here.
Don’t touch loggers, but instead, move the train loop log_metrics method in another place (where the actual optimizer step is performed). Such the refactoring assumes, that all accum. batches metrics will be stored somewhere in the train loop, aggregated and passed to the log_metrics method after optimizer step is performed. It’s slightly better solution, than previous two, but is also assumes a portion of refactoring and structure changes.
Don’t touch anything. Each user, which will faced this issue (like me) will implement custom logger for his project like it’s describe in the 1. option. Obviously, this is the most clean solution, it’ll not touch any PL code, but in such case, the global problem will not be solved: the loggers will continue log metrics accumulate_grad_batches times for each optimizer step.

What do you think? maybe you have another options? Or maybe I’ve totally missed something, and there exists some already implemented trick in the PL, which solves this issue?

2reactions

jeremyjordancommented, Mar 27, 2020

yeah we should handle that case better, would you be interested in drafting a PR to address this?

Top Results From Across the Web

Metrics are logged on each batch, but not on each accum. step

I can set row_log_interval equal to the accumulation steps, and only one logging point will appear, but in this case, I'll not aggregate...

[PyTorch Lightning] Log Training Losses when Accumulating ...

The step number is correct now, but we now have too many data points! The training loss of every step is recorded ,and...

TorchMetrics in PyTorch Lightning

Logging metrics can be done in two ways: either logging the metric object directly or the computed metric values. When Metric objects, which...

Deployment policies and settings - AWS Elastic Beanstalk

When processing a batch, Elastic Beanstalk detaches all instances in the batch from the load balancer, deploys the new application version, and then...

How to use TorchMetrics - Towards AI

If logging or printing the metric result for each batch is not required, this is ... While during the validation stage, step metrics...