question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ModelCheckpoint not saving best version due to issue with opening h5py file

See original GitHub issue

Having checked that everything is as it should be (latest version of keras, and latest version of tensorflow both installed), I have found that running a model with a model checkpoint callback that saves the best model so far causes an issue with serialisation of the model.

Here’s a script which, when run, shows the issue.

The output during imports and initialisation of the Tensorflow backend is:

2018-10-02 12:34:47.868073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
totalMemory: 7.93GiB freeMemory: 7.09GiB
2018-10-02 12:34:47.868102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-10-02 12:34:48.075527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-02 12:34:48.075556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-10-02 12:34:48.075562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-10-02 12:34:48.075728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6837 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
Using TensorFlow backend.
2018-10-02 12:34:51.635814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-10-02 12:34:51.635853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-02 12:34:51.635859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-10-02 12:34:51.635863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-10-02 12:34:51.636042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6837 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)

The full error traceback is:

Traceback (most recent call last):
  File "selfcontained.py", line 107, in <module>
    print("75th percentile of test predictions is: {:.2e}".format(main(**CNN_params)))
  File "selfcontained.py", line 92, in main
    raise e
  File "selfcontained.py", line 76, in main
    shuffle=True, verbose=0, callbacks=[early_stopping_cb, model_saver_cb, test_csv_cb])
  File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 217, in fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 446, in on_epoch_end
    self.model.save(filepath, overwrite=True)
  File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 1090, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 382, in save_model
    _serialize_model(model, f, include_optimizer)
  File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 78, in _serialize_model
    f['keras_version'] = str(keras_version).encode('utf8')
  File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/utils/io_utils.py", line 214, in __setitem__
    'Group with name "{}" exists.'.format(attr))
KeyError: 'Cannot set attribute. Group with name "keras_version" exists.'

The problem seems to arise from the fact that the mode flag for opening an h5py file is not propagated through the h5dict class in keras/utils/io_utils.py when opening the file, thus the h5py file is opened with default flags that prevent overwriting existing files.

The solution is simple (unless I am missing a key aspect of file management when it comes to serialisation) where line 186 in keras/utils/io_utils.py needs to be changed from

185        elif isinstance(path, str):
>>> 186            self.data = h5py.File(path,)
187            self._is_file = True

to

185        elif isinstance(path, str):
>>> 186            self.data = h5py.File(path,mode)
187            self._is_file = True

Doing this propagates the mode parameter in the init call to the underlying h5py.File object.

As I’m not sure what the best way to submit a code patch is, I thought it would be best to create an issue outlining the problem and a potential solution.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:5
  • Comments:15 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
gabrieldemarmiessecommented, Oct 13, 2018

Thanks @Microno95 for the feedback!

2reactions
Microno95commented, Oct 12, 2018

I can confirm that the bug has been fixed in Keras 2.2.4. I tested the script that I posted initially, and it no longer produces an error.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ModelCheckpoint not saving the hdf5 file - tensorflow
I can recommend using save_weights_only=True option in your ModelCheckpoint callback and use the API model.load_weights(checkpoint_path) .
Read more >
Loading Keras model-best.h5 saved with W&B run
Hi, While using wandb.keras.WandbCallback() I noticed that W&B saves a “model-best.h5” file at every run. However, I run into errors while ...
Read more >
Save and load models | TensorFlow Core
Model progress can be saved during and after training. This means a model can resume where it left off and avoid long training...
Read more >
Model saving & serialization APIs
Saves the model to Tensorflow SavedModel or a single HDF5 file. Please see tf.keras.models.save_model or the Serialization and Saving guide for details.
Read more >
How to Checkpoint Deep Learning Models in Keras
Checkpoint Best Neural Network Model Only. A simpler checkpoint strategy is to save the model weights to the same file if and only...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found