question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fit_generator seems dead lock

See original GitHub issue
  • [. ] Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/keras-team/keras.git --upgrade --no-deps

  • [. ] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.

  • [. ] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with: pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps

  • [. ] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).


I’m using fit_generator with multiprocessing. The first epoch is normal, but second epoch did not finish for a long time. The CPU usage of these process is quite low (about 0%). After I kill the process, it is found that main process is waiting for a lock. Ubuntu 16.04 Python 3.6.5 Keras 2.1.6 TensorFlow 1.8.0 with GPU

Some Log:

2018-06-02 10:08:08.448699: Total params: 21,519,259
2018-06-02 10:08:08.448710: Trainable params: 1,514,965
2018-06-02 10:08:08.448718: Non-trainable params: 20,004,294
2018-06-02 10:08:10.959486: Epoch 1/50
2018-06-02 10:28:38.247401:  - 1227s - loss: 0.2332 - rmse: 0.2332 - val_loss: 0.2218 - val_rmse: 0.2218
2018-06-02 10:28:38.248712: 
Epoch 00001: val_rmse improved from inf to 0.22183, saving model to ../log/2018-06-02 10:03:57/keras_nn_0.hdf5
2018-06-02 10:28:40.006534: Epoch 2/50
2018-06-02 11:23:44.784851: Traceback (most recent call last):

Main Process Trackback:

2018-06-02 11:23:44.785217: KeyboardInterrupt
2018-06-02 11:23:44.803122: Traceback (most recent call last):
2018-06-02 11:23:44.803174:   File "/home/**********/image_model.py", line 498, in <module>
2018-06-02 11:23:44.803383: validation_data=CustomSequence(X_valid, y_valid, batch_size),
2018-06-02 11:23:44.803390:   File "/home/****/miniconda3/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
2018-06-02 11:23:44.803792: return func(*args, **kwargs)
2018-06-02 11:23:44.803800:   File "/home/****/miniconda3/lib/python3.6/site-packages/keras/engine/training.py", line 2194, in fit_generator
2018-06-02 11:23:44.805133: generator_output = next(output_generator)
2018-06-02 11:23:44.805141:   File "/home/****/miniconda3/lib/python3.6/site-packages/keras/utils/data_utils.py", line 578, in get
2018-06-02 11:23:44.805721: inputs = self.queue.get(block=True).get()
2018-06-02 11:23:44.805728:   File "/home/****/miniconda3/lib/python3.6/multiprocessing/pool.py", line 638, in get
2018-06-02 11:23:44.805851: self.wait(timeout)
2018-06-02 11:23:44.805857:   File "/home/****/miniconda3/lib/python3.6/multiprocessing/pool.py", line 635, in wait
2018-06-02 11:23:44.805962: self._event.wait(timeout)
2018-06-02 11:23:44.805969:   File "/home/****/miniconda3/lib/python3.6/threading.py", line 551, in wait
2018-06-02 11:23:44.806527: signaled = self._cond.wait(timeout)
2018-06-02 11:23:44.806534:   File "/home/****/miniconda3/lib/python3.6/threading.py", line 295, in wait 
2018-06-02 11:23:44.806606: waiter.acquire()
2018-06-02 11:23:44.806620: KeyboardInterrupt

Releated Code:

class CustomSequence(Sequence):
  def __init__(self, df, y_set=None, batch_size=batch_size, isTrainDataOrTest=True):
    self.X, self.y = df, y_set
    self.batch_size = batch_size
    self.isTrainDataOrTest = isTrainDataOrTest
    self.counter = 0
    self.epoch_counter = 0

  def __len__(self):
    return int(np.ceil(len(self.X) / self.batch_size))

  def __getitem__(self, idx):
    # print(f'Batch #: {idx}')
    # print(f'From {idx * self.batch_size} to {(idx + 1) * self.batch_size}')
    last_index = (idx + 1) * self.batch_size
    if last_index > self.X.shape[0]:
      last_index = self.X.shape[0]
    batch_x = self.X[idx * self.batch_size:last_index]
    title = tokenizer.texts_to_sequences(batch_x['title'])
    title = sequence.pad_sequences(title, maxlen=title_maxlen)
    desp = tokenizer.texts_to_sequences(batch_x['description'])
    desp = sequence.pad_sequences(desp, maxlen=desp_maxlen)
    img_data = np.zeros(shape=(batch_x.shape[0], 128, 128, 3))
    for index, im_id in enumerate(batch_x['image']):
      img_data[index] = get_image_data(im_id, self.isTrainDataOrTest)

    # print(f'Returned batch size: {len(batch_x)}')
    self.counter += 1
    if self.isTrainDataOrTest:
      batch_y = self.y[idx * self.batch_size: last_index]
      return [batch_x[categorical], batch_x[continous], title, desp, img_data], np.array(batch_y)
    else:
      return [batch_x[categorical], batch_x[continous], title, desp, img_data]


  def on_epoch_end(self):
    print('\nEpoch end: ' + str(self.epoch_counter) + ' counter: ' + str(self.counter))
    self.epoch_counter += 1


 model.fit_generator(generator=CustomSequence(X_train, y_train, batch_size),
                      epochs=50,
                      verbose=2,
                      use_multiprocessing=True,
                      workers=4,
                      validation_data=CustomSequence(X_valid, y_valid, batch_size),
                      callbacks=[check_point, early_stop, rlrop])

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:5
  • Comments:24 (4 by maintainers)

github_iconTop GitHub Comments

5reactions
jgabritocommented, Oct 5, 2019

I was facing the same issue here. What did the trick for me was switching the multiprocessing library process starting method to ‘spawn’ or ‘forkserver’, from the default ‘fork’ method, as pointed out by @saulvargas. At the end of each epoch, the multiprocessing library complains about some leaked semaphores, which it releases, but the show goes on.

The main script should look like:

if __name__ == '__main__:
  import multiprocessing as mp
  mp.set_start_method('spawn')
  <import everything else and run your code>

I am using Keras 2.3.30 with Tensorflow 1.13.1 (same results with CPU and GPU versions alike) under Linux (Arch). My input pipeline uses only numpy, scipy.ndimage and pydicom. Everything is installed in a python virtual env.

The PR cited above, which is merged into the Keras master branch by now, did not help much. It simply uses a 30 second timeout to detect when the worker process pool hangs for some batch, prints a warning message and falls back to sequential code for that batch.

The fix was designed for the case when the culprit for the hanging is the data generating code in the worker processes for some batch, but that is not the case with me. When hanging does happen, it happens at the very beginning of epochs, or when starting validation, when the worker processes pool is spawned, and they never get to the actual data generating code. They hang even before initialization code in keras/utils/data_utils.py gets called. In that way, with the fix in the PR, you end up getting a sequential input pipeline that sleeps for 30 seconds before producing each batch.

4reactions
fhenneckercommented, Sep 25, 2019

We are running into this issue a lot, after writing a subclass of keras.utils.Sequence which works with np.arrays. Our trainings hang on that same self.queue.get(block=True).get() in data_utils.py. Had the issue on 2.1.6 and 2.2.4

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - Tensorflow / Keras Deadlock in fit_generator for data ...
Running keras.model.fit_generator with use_multiprocessing=True and multiple workers on a data generator that itself contains a tensorflow ...
Read more >
Diagnose keras deadlock: MWE - Kaggle
Diagnose keras deadlock: MWE ... loss = old_loss_fn) history = model.fit_generator( generator = train_generator, validation_data = valid_generator, ...
Read more >
How to Grid Search Hyperparameters for Deep Learning ...
So I'm not calling model.fit but model.fit_generator for the actual training. This does not seem to be supported through the grid search.
Read more >
[Example code]-Having issues with TheadPoolExecutor ...
Having issues with TheadPoolExecutor _wait_for_tstate_lock python (thread deadlock?) keras thread safe generator for model.fit_generator with Python 3.6.x ...
Read more >
Advances in Computer Science for Engineering and Education II
seems to occur approximately on the 14–16th day of operation, and the rest do ... Keras library uses the fit_generator function, which automatically...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found