question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Model Checkpoint Does not work with multi-gpu-model

See original GitHub issue

keras.utils.multi-gpu-model(model,5) will not work well with ModelCheckpoint callback. It throws a “cannot serialize IO object error.” I guess I understand why this might is happening since multiple copies of the same model span my gpus but I am not sure how to fix it.

Any workarounds? It works awesome otherwise.

EDIT: Closing this issue. Saving weights works just fine.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:5

github_iconTop GitHub Comments

3reactions
bordeprashantcommented, Apr 25, 2019

I Solved this problem using following updates…! we need to use the multi-GPU model on our other callbacks for performance reasons, but we also need the template model for ModelCheckpoint and some other callbacks. For that reason, we made a tiny adapter called AltModelCheckpoint to wrap ModelCheckpoint with the checkpointed model being explicitly specified.

Installation is easy pip install alt-model-checkpoint

from alt_model_checkpoint import AltModelCheckpoint from keras.models import Model from keras.utils import multi_gpu_model base_model = Model(…) gpu_model = multi_gpu_model(base_model) gpu_model.compile(…) gpu_model.fit(…, callbacks= AltModelCheckpoint(‘save/path/for/model.hdf5’, base_model) ])

Enjoy…! 😃

1reaction
oeminagacommented, Jan 20, 2019

I solved the problem using the following way. I changed some lines in the major codes of keras (particularly in topology.py/network.py and callbacks.py). Here, I just modified the following codes.

Reminder: You need to replace ‘saving.save_weights_to_hdf5_group’ with ‘save_weights_to_hdf5_group(f, layers)’ if you use an older version of Keras.

network.py:

def save_weights(self, filepath, overwrite=True, multiple_gpu=False, name_of_model=""):
    """Dumps all layer weights to a HDF5 file.
   name_of_model is usually model_1, you can check the name of the model by calling summary after running multi_gpu_model

    The weight file has:
        - `layer_names` (attribute), a list of strings
            (ordered names of model layers).
        - For every layer, a `group` named `layer.name`
            - For every such layer group, a group attribute `weight_names`,
                a list of strings
                (ordered names of weights tensor of the layer).
            - For every weight in the layer, a dataset
                storing the weight value, named after the weight tensor.

    # Arguments
        filepath: String, path to the file to save the weights to.
        overwrite: Whether to silently overwrite any existing file at the
            target location, or provide the user with a manual prompt.

    # Raises
        ImportError: If h5py is not available.
    """
    if h5py is None:
        raise ImportError('`save_weights` requires h5py.')
    # If file exists and should not be overwritten:
    if not overwrite and os.path.isfile(filepath):
        proceed = ask_to_proceed_with_overwrite(filepath)
        if not proceed:
            return
    with h5py.File(filepath, 'w') as f:
        if multiple_gpu and name_of_model is not None:
            layers = self.get_layer(name_of_model)
            layers = layers.layers
            saving.save_weights_to_hdf5_group(f, layers)
        else:
            saving.save_weights_to_hdf5_group(f, self.layers)
        f.flush()

callback.py: class ModelCheckpoint(Callback): “”"Save the model after every epoch.

`filepath` can contain named formatting options,
which will be filled with the values of `epoch` and
keys in `logs` (passed in `on_epoch_end`).

For example: if `filepath` is `weights.{epoch:02d}-{val_loss:.2f}.hdf5`,
then the model checkpoints will be saved with the epoch number and
the validation loss in the filename.

# Arguments
    filepath: string, path to save the model file.
    monitor: quantity to monitor.
    verbose: verbosity mode, 0 or 1.
    save_best_only: if `save_best_only=True`,
        the latest best model according to
        the quantity monitored will not be overwritten.
    mode: one of {auto, min, max}.
        If `save_best_only=True`, the decision
        to overwrite the current save file is made
        based on either the maximization or the
        minimization of the monitored quantity. For `val_acc`,
        this should be `max`, for `val_loss` this should
        be `min`, etc. In `auto` mode, the direction is
        automatically inferred from the name of the monitored quantity.
    save_weights_only: if True, then only the model's weights will be
        saved (`model.save_weights(filepath)`), else the full model
        is saved (`model.save(filepath)`).
    period: Interval (number of epochs) between checkpoints.
"""

def __init__(self, filepath, monitor='val_loss', verbose=0,
             save_best_only=False, save_weights_only=False,
             mode='auto', period=1,
             multiple_gpu=False, name_of_model=None):
    super(ModelCheckpoint, self).__init__()
    self.monitor = monitor
    self.verbose = verbose
    self.filepath = filepath
    self.save_best_only = save_best_only
    self.save_weights_only = save_weights_only
    self.period = period
    self.epochs_since_last_save = 0
    self.multi_gpu_mode = multiple_gpu
    self.name_of_model = name_of_model

    if mode not in ['auto', 'min', 'max']:
        warnings.warn('ModelCheckpoint mode %s is unknown, '
                      'fallback to auto mode.' % (mode),
                      RuntimeWarning)
        mode = 'auto'

    if mode == 'min':
        self.monitor_op = np.less
        self.best = np.Inf
    elif mode == 'max':
        self.monitor_op = np.greater
        self.best = -np.Inf
    else:
        if 'acc' in self.monitor or self.monitor.startswith('fmeasure'):
            self.monitor_op = np.greater
            self.best = -np.Inf
        else:
            self.monitor_op = np.less
            self.best = np.Inf

def on_epoch_end(self, epoch, logs=None):
    logs = logs or {}
    self.epochs_since_last_save += 1
    if self.epochs_since_last_save >= self.period:
        self.epochs_since_last_save = 0
        filepath = self.filepath.format(epoch=epoch + 1, **logs)
        if self.save_best_only:
            current = logs.get(self.monitor)
            if current is None:
                warnings.warn('Can save best model only with %s available, '
                              'skipping.' % (self.monitor), RuntimeWarning)
            else:
                if self.monitor_op(current, self.best):
                    if self.verbose > 0:
                        print('\nEpoch %05d: %s improved from %0.5f to %0.5f,'
                              ' saving model to %s'
                              % (epoch + 1, self.monitor, self.best,
                                 current, filepath))
                    self.best = current
                    if self.save_weights_only:
                        self.model.save_weights(filepath, overwrite=True, multiple_gpu=self.multi_gpu_mode, name_of_model=self.name_of_model)
                    else:
                        self.model.save(filepath, overwrite=True)
                else:
                    if self.verbose > 0:
                        print('\nEpoch %05d: %s did not improve from %0.5f' %
                              (epoch + 1, self.monitor, self.best))
        else:
            if self.verbose > 0:
                print('\nEpoch %05d: saving model to %s' % (epoch + 1, filepath))
            if self.save_weights_only:
                self.model.save_weights(filepath, overwrite=True)
            else:
                self.model.save(filepath, overwrite=True)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Error occurs when saving model in multi-gpu settings
I'm finetuning a language model on multiple gpus. However, I met some problems with saving the model. After saving the model using ...
Read more >
Training with multiple GPUs and ModelCheckpoint leads to ...
I'm training a 1D CNN with two GPUs (2xK80) with Keras (TensorFlow as backend). The issue I'm having. The issue is (my guess)...
Read more >
Checkpoint in Multi GPU - PyTorch Forums
No. It is not an issue. nn.DataParallel saves the parameters under self.module . For example, let's assume your original single-gpu model had ...
Read more >
Get Started — MMPose 0.29.0 documentation
Train with multiple GPUs ... Launch multiple jobs on a single machine. Benchmark ... Especially, if set to none, it will test in...
Read more >
Getting Started with DeepSpeed for Inferencing Transformer ...
To run inference on multi-GPU for compatible models, provide the model parallelism degree and the checkpoint information or the model which is already ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found