Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fit_generator seems dead lock

See original GitHub issue

[. ] Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/keras-team/keras.git --upgrade --no-deps
[. ] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
[. ] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with: pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
[. ] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

I’m using fit_generator with multiprocessing. The first epoch is normal, but second epoch did not finish for a long time. The CPU usage of these process is quite low (about 0%). After I kill the process, it is found that main process is waiting for a lock. Ubuntu 16.04 Python 3.6.5 Keras 2.1.6 TensorFlow 1.8.0 with GPU

Some Log:

2018-06-02 10:08:08.448699: Total params: 21,519,259
2018-06-02 10:08:08.448710: Trainable params: 1,514,965
2018-06-02 10:08:08.448718: Non-trainable params: 20,004,294
2018-06-02 10:08:10.959486: Epoch 1/50
2018-06-02 10:28:38.247401:  - 1227s - loss: 0.2332 - rmse: 0.2332 - val_loss: 0.2218 - val_rmse: 0.2218
2018-06-02 10:28:38.248712: 
Epoch 00001: val_rmse improved from inf to 0.22183, saving model to ../log/2018-06-02 10:03:57/keras_nn_0.hdf5
2018-06-02 10:28:40.006534: Epoch 2/50
2018-06-02 11:23:44.784851: Traceback (most recent call last):

Main Process Trackback:

2018-06-02 11:23:44.785217: KeyboardInterrupt
2018-06-02 11:23:44.803122: Traceback (most recent call last):
2018-06-02 11:23:44.803174:   File "/home/**********/image_model.py", line 498, in <module>
2018-06-02 11:23:44.803383: validation_data=CustomSequence(X_valid, y_valid, batch_size),
2018-06-02 11:23:44.803390:   File "/home/****/miniconda3/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
2018-06-02 11:23:44.803792: return func(*args, **kwargs)
2018-06-02 11:23:44.803800:   File "/home/****/miniconda3/lib/python3.6/site-packages/keras/engine/training.py", line 2194, in fit_generator
2018-06-02 11:23:44.805133: generator_output = next(output_generator)
2018-06-02 11:23:44.805141:   File "/home/****/miniconda3/lib/python3.6/site-packages/keras/utils/data_utils.py", line 578, in get
2018-06-02 11:23:44.805721: inputs = self.queue.get(block=True).get()
2018-06-02 11:23:44.805728:   File "/home/****/miniconda3/lib/python3.6/multiprocessing/pool.py", line 638, in get
2018-06-02 11:23:44.805851: self.wait(timeout)
2018-06-02 11:23:44.805857:   File "/home/****/miniconda3/lib/python3.6/multiprocessing/pool.py", line 635, in wait
2018-06-02 11:23:44.805962: self._event.wait(timeout)
2018-06-02 11:23:44.805969:   File "/home/****/miniconda3/lib/python3.6/threading.py", line 551, in wait
2018-06-02 11:23:44.806527: signaled = self._cond.wait(timeout)
2018-06-02 11:23:44.806534:   File "/home/****/miniconda3/lib/python3.6/threading.py", line 295, in wait 
2018-06-02 11:23:44.806606: waiter.acquire()
2018-06-02 11:23:44.806620: KeyboardInterrupt

Releated Code:

class CustomSequence(Sequence):
  def __init__(self, df, y_set=None, batch_size=batch_size, isTrainDataOrTest=True):
    self.X, self.y = df, y_set
    self.batch_size = batch_size
    self.isTrainDataOrTest = isTrainDataOrTest
    self.counter = 0
    self.epoch_counter = 0

  def __len__(self):
    return int(np.ceil(len(self.X) / self.batch_size))

  def __getitem__(self, idx):
    # print(f'Batch #: {idx}')
    # print(f'From {idx * self.batch_size} to {(idx + 1) * self.batch_size}')
    last_index = (idx + 1) * self.batch_size
    if last_index > self.X.shape[0]:
      last_index = self.X.shape[0]
    batch_x = self.X[idx * self.batch_size:last_index]
    title = tokenizer.texts_to_sequences(batch_x['title'])
    title = sequence.pad_sequences(title, maxlen=title_maxlen)
    desp = tokenizer.texts_to_sequences(batch_x['description'])
    desp = sequence.pad_sequences(desp, maxlen=desp_maxlen)
    img_data = np.zeros(shape=(batch_x.shape[0], 128, 128, 3))
    for index, im_id in enumerate(batch_x['image']):
      img_data[index] = get_image_data(im_id, self.isTrainDataOrTest)

    # print(f'Returned batch size: {len(batch_x)}')
    self.counter += 1
    if self.isTrainDataOrTest:
      batch_y = self.y[idx * self.batch_size: last_index]
      return [batch_x[categorical], batch_x[continous], title, desp, img_data], np.array(batch_y)
    else:
      return [batch_x[categorical], batch_x[continous], title, desp, img_data]


  def on_epoch_end(self):
    print('\nEpoch end: ' + str(self.epoch_counter) + ' counter: ' + str(self.counter))
    self.epoch_counter += 1


 model.fit_generator(generator=CustomSequence(X_train, y_train, batch_size),
                      epochs=50,
                      verbose=2,
                      use_multiprocessing=True,
                      workers=4,
                      validation_data=CustomSequence(X_valid, y_valid, batch_size),
                      callbacks=[check_point, early_stop, rlrop])

Issue Analytics

State:
Created 5 years ago
Reactions:5
Comments:24 (4 by maintainers)

Top GitHub Comments

5reactions

jgabritocommented, Oct 5, 2019

I was facing the same issue here. What did the trick for me was switching the multiprocessing library process starting method to ‘spawn’ or ‘forkserver’, from the default ‘fork’ method, as pointed out by @saulvargas. At the end of each epoch, the multiprocessing library complains about some leaked semaphores, which it releases, but the show goes on.

The main script should look like:

if __name__ == '__main__:
  import multiprocessing as mp
  mp.set_start_method('spawn')
  <import everything else and run your code>

I am using Keras 2.3.30 with Tensorflow 1.13.1 (same results with CPU and GPU versions alike) under Linux (Arch). My input pipeline uses only numpy, scipy.ndimage and pydicom. Everything is installed in a python virtual env.

The PR cited above, which is merged into the Keras master branch by now, did not help much. It simply uses a 30 second timeout to detect when the worker process pool hangs for some batch, prints a warning message and falls back to sequential code for that batch.

The fix was designed for the case when the culprit for the hanging is the data generating code in the worker processes for some batch, but that is not the case with me. When hanging does happen, it happens at the very beginning of epochs, or when starting validation, when the worker processes pool is spawned, and they never get to the actual data generating code. They hang even before initialization code in keras/utils/data_utils.py gets called. In that way, with the fix in the PR, you end up getting a sequential input pipeline that sleeps for 30 seconds before producing each batch.

4reactions

fhenneckercommented, Sep 25, 2019

We are running into this issue a lot, after writing a subclass of keras.utils.Sequence which works with np.arrays. Our trainings hang on that same self.queue.get(block=True).get() in data_utils.py. Had the issue on 2.1.6 and 2.2.4

Top Results From Across the Web

python - Tensorflow / Keras Deadlock in fit_generator for data ...

Running keras.model.fit_generator with use_multiprocessing=True and multiple workers on a data generator that itself contains a tensorflow ...

Diagnose keras deadlock: MWE - Kaggle

Diagnose keras deadlock: MWE ... loss = old_loss_fn) history = model.fit_generator( generator = train_generator, validation_data = valid_generator, ...

How to Grid Search Hyperparameters for Deep Learning ...

So I'm not calling model.fit but model.fit_generator for the actual training. This does not seem to be supported through the grid search.

[Example code]-Having issues with TheadPoolExecutor ...

Having issues with TheadPoolExecutor _wait_for_tstate_lock python (thread deadlock?) keras thread safe generator for model.fit_generator with Python 3.6.x ...

Advances in Computer Science for Engineering and Education II

seems to occur approximately on the 14–16th day of operation, and the rest do ... Keras library uses the fit_generator function, which automatically...