ModelCheckpoint not saving best version due to issue with opening h5py file
See original GitHub issueHaving checked that everything is as it should be (latest version of keras, and latest version of tensorflow both installed), I have found that running a model with a model checkpoint callback that saves the best model so far causes an issue with serialisation of the model.
Here’s a script which, when run, shows the issue.
The output during imports and initialisation of the Tensorflow backend is:
2018-10-02 12:34:47.868073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
totalMemory: 7.93GiB freeMemory: 7.09GiB
2018-10-02 12:34:47.868102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-10-02 12:34:48.075527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-02 12:34:48.075556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-10-02 12:34:48.075562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-10-02 12:34:48.075728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6837 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
Using TensorFlow backend.
2018-10-02 12:34:51.635814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-10-02 12:34:51.635853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-02 12:34:51.635859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-10-02 12:34:51.635863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-10-02 12:34:51.636042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6837 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
The full error traceback is:
Traceback (most recent call last):
File "selfcontained.py", line 107, in <module>
print("75th percentile of test predictions is: {:.2e}".format(main(**CNN_params)))
File "selfcontained.py", line 92, in main
raise e
File "selfcontained.py", line 76, in main
shuffle=True, verbose=0, callbacks=[early_stopping_cb, model_saver_cb, test_csv_cb])
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1039, in fit
validation_steps=validation_steps)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 217, in fit_loop
callbacks.on_epoch_end(epoch, epoch_logs)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/callbacks.py", line 446, in on_epoch_end
self.model.save(filepath, overwrite=True)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 1090, in save
save_model(self, filepath, overwrite, include_optimizer)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 382, in save_model
_serialize_model(model, f, include_optimizer)
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 78, in _serialize_model
f['keras_version'] = str(keras_version).encode('utf8')
File "/home/persephone/anaconda3/lib/python3.6/site-packages/keras/utils/io_utils.py", line 214, in __setitem__
'Group with name "{}" exists.'.format(attr))
KeyError: 'Cannot set attribute. Group with name "keras_version" exists.'
The problem seems to arise from the fact that the mode flag for opening an h5py file is not propagated through the h5dict class in keras/utils/io_utils.py when opening the file, thus the h5py file is opened with default flags that prevent overwriting existing files.
The solution is simple (unless I am missing a key aspect of file management when it comes to serialisation) where line 186 in keras/utils/io_utils.py needs to be changed from
185 elif isinstance(path, str):
>>> 186 self.data = h5py.File(path,)
187 self._is_file = True
to
185 elif isinstance(path, str):
>>> 186 self.data = h5py.File(path,mode)
187 self._is_file = True
Doing this propagates the mode parameter in the init call to the underlying h5py.File object.
As I’m not sure what the best way to submit a code patch is, I thought it would be best to create an issue outlining the problem and a potential solution.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:5
- Comments:15 (6 by maintainers)
Thanks @Microno95 for the feedback!
I can confirm that the bug has been fixed in Keras 2.2.4. I tested the script that I posted initially, and it no longer produces an error.