question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Synchronize model checkpointing and servable model saving

See original GitHub issue

Currently, TrainableModel.save_servable() is called by the user at the end of the training loop. This is problematic because we may end up with saving an overfitted state of the model even if we are trying to monitor an evaluation metric with pl.callbacks.ModelCheckpoint. So we need to come up with a way to synchronize both.

Possible solution

  • We may need to subclass ModelCheckpoint inside quaterion for synchronization.
  • We may accept additional keyword arguments in Quaterion.fit to automatically save a servable checkpoints to the specified directory with a specified interval.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
monatiscommented, Feb 16, 2022

TrainableModel.load_from_checkpoint

We don’t need it. It’s fine with the following inside TrainableModel.save_servable:

def save_servable(...):
    # ...
    self.load_state_dict(torch.load(checkpoint_path)["state_dict"])
    # we restored the state from the given checkpoint, now save it as servable
0reactions
generallcommented, May 20, 2022

After giving it some thought I decided that this functionality is not something our framework should be responsible for. Check-pointing is an optional parameter of pytorch lightning and having additional logic on top,which have nothing to do with similarity learning, might complicate the automation in an unpredictable way.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Save and load models | TensorFlow Core
Model progress can be saved during and after training. This means a model can resume where it left off and avoid long training...
Read more >
Checkpointing Deep Learning Models in Keras
Steps for saving and loading model and weights using checkpoint · Create the model · Specify the path where we want to save...
Read more >
Infrastructure Design for Real-time Machine Learning Inference
Serve and reload the real-time inference model in a way that synchronizes the served model with online feature stores while minimizing (and ...
Read more >
How to deploy TensorFlow models to production using TF ...
The TensorFlow Saver provides functionalities to save/restore the model's checkpoint files to/from disk. In fact, SavedModel wraps the ...
Read more >
large scale distributed neural network
provides no benefit for synchronous or asynchronous stochastic gradient descent. ... ensemble of models into a single still-servable model using a two-phase ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found