question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

model.save() throwed out an OSError: Unable to create file(error message = 'resource temporarily unavailable')

See original GitHub issue

Well, I build a keras model, and since sometimes my dataset is too large to fit into the memory, and a memoryError throwed out. Thus, I searched and figured out that I need to implement a generator class inherited from keras.utils.Sequence, so that I can use model.fit_generator and model.predict_generator.

And I have several callbacks in my fit_generator function, which includes ModelCheckpoint to save my model as a .hdf5 file, and I use use_multiprocessing=True, workers=16 in my fig_generator` function. Here is a snippet of this function:

        self.model.fit_generator(generator=training_generator,
                                 validation_data=validation_generator,
                                 epochs=epoch,
                                 use_multiprocessing=True,
                                 workers=8,
                                 callbacks=monitor,
                                 verbose=2)

The error message is attached below:


Epoch 00005: val_loss did not improve
Epoch 6/15
 - 135s - loss: 52.3622 - val_loss: 74.5698

Epoch 00006: val_loss improved from 74.99819 to 74.56982, saving model to models/002008.SZ/model.hdf5
Epoch 7/15
 - 135s - loss: 52.3163 - val_loss: 74.2776

Epoch 00007: val_loss improved from 74.56982 to 74.27758, saving model to models/002008.SZ/model.hdf5
Traceback (most recent call last):
  File "/root/.pycharm_helpers/pydev/pydevd.py", line 1664, in <module>
    main()
  File "/root/.pycharm_helpers/pydev/pydevd.py", line 1658, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/root/.pycharm_helpers/pydev/pydevd.py", line 1068, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/root/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/data/CNN_dlw_v0/CNN_dlw/main.py", line 100, in <module>
    main()
  File "/data/CNN_dlw_v0/CNN_dlw/main.py", line 72, in main
    model.train(epoch=15)
  File "/data/CNN_dlw_v0/CNN_dlw/Model.py", line 383, in train
    verbose=2)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/engine/training.py", line 2280, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/callbacks.py", line 77, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/callbacks.py", line 447, in on_epoch_end
    self.model.save(filepath, overwrite=True)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/engine/topology.py", line 2576, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/data/SkyCompute/lib/python3.6/site-packages/keras/models.py", line 106, in save_model
    with h5py.File(filepath, mode='w') as f:
  File "/data/SkyCompute/lib/python3.6/site-packages/h5py/_hl/files.py", line 271, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/data/SkyCompute/lib/python3.6/site-packages/h5py/_hl/files.py", line 107, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 98, in h5py.h5f.create
OSError: Unable to create file (Unable to lock file, errno = 11, error message = 'resource temporarily unavailable')
^C
Process finished with exit code 1

Notice that sometimes the model is successfully saved, but it seems to me that it’s a probability thing, after several save action, it will be more probable to cause this OSError.

I think it could be that I set the use_multiprocessing=True and workers=16 which makes some process is trying to save the file while another process is also trying to access this same file? I’m not quite sure exactly what happens here. I think the keras should have some internal control which would prevent this.

Edit: When I set use_multiprocessing=False, this OSError stopped from popping up, however, since no multiprocessing, the training procesure is much slower now. One not so elegant solution I can think of is not using ModelCheckpoint, and only save the model after training so I can still use multiprocessing in training, but in this way I will not be able to save the model which corresponds to the one with smallest val_loss. :<

Edit2: I checked that my keras version is 2.1.4

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:29
  • Comments:75 (6 by maintainers)

github_iconTop GitHub Comments

27reactions
GLambardcommented, Nov 29, 2018

Hi,

I have just solved this issue by uninstalling h5py 2.8.0 and re-installing h5py 2.7.1. Now, it does work with keras 2.2.4 and tensorflow-gpu 1.12.0. Please, crosscheck it for other keras and tensorflow versions.

In my case, I encountered this issue when moving my python scripts to another machine. And indeed, as I did blankly the re-installation of all my python libraries without taking care of the individual versions, this error appeared when checkpointing my models: OSError: Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

As it wasn’t crashing on the previous machine, I began trying all the tips in this post but without any success. Until I found that the only difference in my previous setup was the h5py library’s version.

TIP: if it does continue to crash after the backward installation of h5py 2.7.1, erase your previous *.hdf5 files that you wish to re-write and it should work.

18reactions
remomomocommented, Sep 13, 2018

We’re experiencing the same issue using the “model_checkpoint” callback, we also have use_multiprocessing=True

It happens after the second epoch

Read more comments on GitHub >

github_iconTop Results From Across the Web

ray save tensorflow model: OSError: Unable to create file ...
ray save tensorflow model: OSError: Unable to create file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable').
Read more >
OSError: Unable to create file (unable to lock file, errno = 11 ...
Hi, I'm using cellprofiler/cellprofiler:4.0.7 docker and I'm running an analysis pipeline with a CreateBatchFile module at the end.
Read more >
Troubleshooting ML pipelines - Azure Machine Learning
In this article, you learn how to troubleshoot when you get errors running a machine learning pipeline in the Azure Machine Learning SDK...
Read more >
Error messages | BigQuery - Google Cloud
Error message HTTP code Description stopped 200 This status code returns when a job is canceled. timeout 400 The job timed out.
Read more >
Azure - Databricks Knowledge Base
ADLException: Error creating directory / Error fetching access token ... DBFS You need to create a core-site.xml file and save it to DBFS...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found