Synchronize model checkpointing and servable model saving
See original GitHub issueCurrently, TrainableModel.save_servable()
is called by the user at the end of the training loop. This is problematic because we may end up with saving an overfitted state of the model even if we are trying to monitor an evaluation metric with pl.callbacks.ModelCheckpoint
. So we need to come up with a way to synchronize both.
Possible solution
- We may need to subclass
ModelCheckpoint
insidequaterion
for synchronization. - We may accept additional keyword arguments in
Quaterion.fit
to automatically save a servable checkpoints to the specified directory with a specified interval.
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (13 by maintainers)
Top Results From Across the Web
Save and load models | TensorFlow Core
Model progress can be saved during and after training. This means a model can resume where it left off and avoid long training...
Read more >Checkpointing Deep Learning Models in Keras
Steps for saving and loading model and weights using checkpoint · Create the model · Specify the path where we want to save...
Read more >Infrastructure Design for Real-time Machine Learning Inference
Serve and reload the real-time inference model in a way that synchronizes the served model with online feature stores while minimizing (and ...
Read more >How to deploy TensorFlow models to production using TF ...
The TensorFlow Saver provides functionalities to save/restore the model's checkpoint files to/from disk. In fact, SavedModel wraps the ...
Read more >large scale distributed neural network
provides no benefit for synchronous or asynchronous stochastic gradient descent. ... ensemble of models into a single still-servable model using a two-phase ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
We don’t need it. It’s fine with the following inside
TrainableModel.save_servable
:After giving it some thought I decided that this functionality is not something our framework should be responsible for. Check-pointing is an optional parameter of pytorch lightning and having additional logic on top,which have nothing to do with similarity learning, might complicate the automation in an unpredictable way.