model.save() throwed out an OSError: Unable to create file(error message = 'resource temporarily unavailable')
See original GitHub issueWell, I build a keras model, and since sometimes my dataset is too large to fit into the memory, and a memoryError
throwed out. Thus, I searched and figured out that I need to implement a generator class inherited from keras.utils.Sequence
, so that I can use model.fit_generator
and model.predict_generator
.
And I have several callbacks in my fit_generator
function, which includes ModelCheckpoint
to save my model as a .hdf5
file, and I use use_multiprocessing=True, workers=16
in my fig_generator` function.
Here is a snippet of this function:
self.model.fit_generator(generator=training_generator,
validation_data=validation_generator,
epochs=epoch,
use_multiprocessing=True,
workers=8,
callbacks=monitor,
verbose=2)
The error message is attached below:
Epoch 00005: val_loss did not improve
Epoch 6/15
- 135s - loss: 52.3622 - val_loss: 74.5698
Epoch 00006: val_loss improved from 74.99819 to 74.56982, saving model to models/002008.SZ/model.hdf5
Epoch 7/15
- 135s - loss: 52.3163 - val_loss: 74.2776
Epoch 00007: val_loss improved from 74.56982 to 74.27758, saving model to models/002008.SZ/model.hdf5
Traceback (most recent call last):
File "/root/.pycharm_helpers/pydev/pydevd.py", line 1664, in <module>
main()
File "/root/.pycharm_helpers/pydev/pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/root/.pycharm_helpers/pydev/pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/root/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/data/CNN_dlw_v0/CNN_dlw/main.py", line 100, in <module>
main()
File "/data/CNN_dlw_v0/CNN_dlw/main.py", line 72, in main
model.train(epoch=15)
File "/data/CNN_dlw_v0/CNN_dlw/Model.py", line 383, in train
verbose=2)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/engine/training.py", line 2280, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/callbacks.py", line 77, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/callbacks.py", line 447, in on_epoch_end
self.model.save(filepath, overwrite=True)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/engine/topology.py", line 2576, in save
save_model(self, filepath, overwrite, include_optimizer)
File "/data/SkyCompute/lib/python3.6/site-packages/keras/models.py", line 106, in save_model
with h5py.File(filepath, mode='w') as f:
File "/data/SkyCompute/lib/python3.6/site-packages/h5py/_hl/files.py", line 271, in __init__
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/data/SkyCompute/lib/python3.6/site-packages/h5py/_hl/files.py", line 107, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 98, in h5py.h5f.create
OSError: Unable to create file (Unable to lock file, errno = 11, error message = 'resource temporarily unavailable')
^C
Process finished with exit code 1
Notice that sometimes the model is successfully saved, but it seems to me that it’s a probability thing, after several save action, it will be more probable to cause this OSError.
I think it could be that I set the use_multiprocessing=True
and workers=16
which makes some process is trying to save the file while another process is also trying to access this same file? I’m not quite sure exactly what happens here. I think the keras should have some internal control which would prevent this.
Edit: When I set use_multiprocessing=False
, this OSError
stopped from popping up, however, since no multiprocessing, the training procesure is much slower now. One not so elegant solution I can think of is not using ModelCheckpoint, and only save the model after training so I can still use multiprocessing in training, but in this way I will not be able to save the model which corresponds to the one with smallest val_loss
. :<
Edit2: I checked that my keras version is 2.1.4
Issue Analytics
- State:
- Created 5 years ago
- Reactions:29
- Comments:75 (6 by maintainers)
Hi,
I have just solved this issue by uninstalling h5py 2.8.0 and re-installing h5py 2.7.1. Now, it does work with keras 2.2.4 and tensorflow-gpu 1.12.0. Please, crosscheck it for other keras and tensorflow versions.
In my case, I encountered this issue when moving my python scripts to another machine. And indeed, as I did blankly the re-installation of all my python libraries without taking care of the individual versions, this error appeared when checkpointing my models:
OSError: Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
As it wasn’t crashing on the previous machine, I began trying all the tips in this post but without any success. Until I found that the only difference in my previous setup was the h5py library’s version.
TIP: if it does continue to crash after the backward installation of h5py 2.7.1, erase your previous *.hdf5 files that you wish to re-write and it should work.
We’re experiencing the same issue using the “model_checkpoint” callback, we also have use_multiprocessing=True
It happens after the second epoch