Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Synchronize model checkpointing and servable model saving

See original GitHub issue

Currently, TrainableModel.save_servable() is called by the user at the end of the training loop. This is problematic because we may end up with saving an overfitted state of the model even if we are trying to monitor an evaluation metric with pl.callbacks.ModelCheckpoint. So we need to come up with a way to synchronize both.

Possible solution

We may need to subclass ModelCheckpoint inside quaterion for synchronization.
We may accept additional keyword arguments in Quaterion.fit to automatically save a servable checkpoints to the specified directory with a specified interval.

Issue Analytics

State:
Created 2 years ago
Comments:13 (13 by maintainers)

Top GitHub Comments

1reaction

monatiscommented, Feb 16, 2022

TrainableModel.load_from_checkpoint

We don’t need it. It’s fine with the following inside TrainableModel.save_servable:

def save_servable(...):
    # ...
    self.load_state_dict(torch.load(checkpoint_path)["state_dict"])
    # we restored the state from the given checkpoint, now save it as servable

0reactions

generallcommented, May 20, 2022

After giving it some thought I decided that this functionality is not something our framework should be responsible for. Check-pointing is an optional parameter of pytorch lightning and having additional logic on top,which have nothing to do with similarity learning, might complicate the automation in an unpredictable way.

Top Results From Across the Web

Save and load models | TensorFlow Core

Model progress can be saved during and after training. This means a model can resume where it left off and avoid long training...

Checkpointing Deep Learning Models in Keras

Steps for saving and loading model and weights using checkpoint · Create the model · Specify the path where we want to save...

Infrastructure Design for Real-time Machine Learning Inference

Serve and reload the real-time inference model in a way that synchronizes the served model with online feature stores while minimizing (and ...

How to deploy TensorFlow models to production using TF ...

The TensorFlow Saver provides functionalities to save/restore the model's checkpoint files to/from disk. In fact, SavedModel wraps the ...

large scale distributed neural network

provides no benefit for synchronous or asynchronous stochastic gradient descent. ... ensemble of models into a single still-servable model using a two-phase ...