Model Checkpoint Does not work with multi-gpu-model
See original GitHub issuekeras.utils.multi-gpu-model(model,5)
will not work well with ModelCheckpoint callback. It throws a “cannot serialize IO object error.” I guess I understand why this might is happening since multiple copies of the same model span my gpus but I am not sure how to fix it.
Any workarounds? It works awesome otherwise.
EDIT: Closing this issue. Saving weights works just fine.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:5
Top Results From Across the Web
Error occurs when saving model in multi-gpu settings
I'm finetuning a language model on multiple gpus. However, I met some problems with saving the model. After saving the model using ...
Read more >Training with multiple GPUs and ModelCheckpoint leads to ...
I'm training a 1D CNN with two GPUs (2xK80) with Keras (TensorFlow as backend). The issue I'm having. The issue is (my guess)...
Read more >Checkpoint in Multi GPU - PyTorch Forums
No. It is not an issue. nn.DataParallel saves the parameters under self.module . For example, let's assume your original single-gpu model had ...
Read more >Get Started — MMPose 0.29.0 documentation
Train with multiple GPUs ... Launch multiple jobs on a single machine. Benchmark ... Especially, if set to none, it will test in...
Read more >Getting Started with DeepSpeed for Inferencing Transformer ...
To run inference on multi-GPU for compatible models, provide the model parallelism degree and the checkpoint information or the model which is already ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I Solved this problem using following updates…! we need to use the multi-GPU model on our other callbacks for performance reasons, but we also need the template model for ModelCheckpoint and some other callbacks. For that reason, we made a tiny adapter called AltModelCheckpoint to wrap ModelCheckpoint with the checkpointed model being explicitly specified.
Installation is easy pip install alt-model-checkpoint
from alt_model_checkpoint import AltModelCheckpoint from keras.models import Model from keras.utils import multi_gpu_model base_model = Model(…) gpu_model = multi_gpu_model(base_model) gpu_model.compile(…) gpu_model.fit(…, callbacks= AltModelCheckpoint(‘save/path/for/model.hdf5’, base_model) ])
Enjoy…! 😃
I solved the problem using the following way. I changed some lines in the major codes of keras (particularly in topology.py/network.py and callbacks.py). Here, I just modified the following codes.
Reminder: You need to replace ‘saving.save_weights_to_hdf5_group’ with ‘save_weights_to_hdf5_group(f, layers)’ if you use an older version of Keras.
network.py:
callback.py: class ModelCheckpoint(Callback): “”"Save the model after every epoch.