Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

mlflow training loss not reported until end of run

See original GitHub issue

I think I’m logging correctly, this is my training_step

    result = pl.TrainResult(loss)
    result.log('loss/train', loss)
    return result

and validation_step

    result = pl.EvalResult(loss)
    result.log('loss/validation', loss)
    return result

The validation loss is updated in mlflow each epoch, however the training loss isn’t displayed until training has finished. Then it’s available for every step. This may be a mlflow rather than pytorch-lighting issue - somewhere along the line it seems to be buffered?

Versions:

pytorch-lightning==0.9.0 mlflow==1.11.0

Edit: logging TrainResult with on_epoch=True results in the metric appearing in mlflow during training, it’s only the default train logging which gets delayed. i.e.

    result.log('accuracy/train', acc, on_epoch=True)

is fine

Issue Analytics

State:
Created 3 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

david-waterworthcommented, Sep 9, 2020

Thanks for the assistance, no nothing unresolved.

1reaction

patrickorlandocommented, Sep 9, 2020

So I think this is because of the default behaviour of the TrainResult and the way row_log_interval works. And it only appears if the number of batches per epoch is less than row_log_interval

By default TrainResult logs on step and not on epoch. https://github.com/PyTorchLightning/pytorch-lightning/blob/aaf26d70c4658e961192ba4c408558f1cf39bb18/pytorch_lightning/core/step_result.py#L510-L517

When logging only per step, the logger connector only logs when the batch_idx is a multiple of row_log_interval. However if you don’t have more than row_log_interval batches, the metrics are not logged. https://github.com/PyTorchLightning/pytorch-lightning/blob/aaf26d70c4658e961192ba4c408558f1cf39bb18/pytorch_lightning/trainer/logger_connector.py#L229-L237

@david-waterworth Do you have less than 50 batches per epoch in your model? can you try setting row_log_interval to be less than the number of train batches to confirm whether the issue is caused by this?

Top Results From Across the Web

mlflow training loss not reported until end of run #3392 - GitHub

The validation loss is updated in mlflow each epoch, however the training loss isn't displayed until training has finished.

MLflow Tracking — MLflow 2.0.1 documentation

You can record runs using MLflow Python, R, Java, and REST APIs from anywhere you run your code. ... Training loss; validation loss;...

MLflow quickstart part 1: training and logging - Databricks

MLflow quickstart: training and logging. This tutorial is based on the MLflow ElasticNet Diabetes example. It illustrates how to use MLflow to track...

Track machine learning training runs - Azure Databricks

If no active experiment is set, runs are logged to the notebook experiment. To log your experiment results to a remotely hosted MLflow...

ML End-to-End Example (Azure) - Databricks

Run a parallel hyperparameter sweep to train machine learning models on the dataset; Explore the results of the hyperparameter sweep with MLflow; Register...