fit_generator seems dead lock
See original GitHub issue-
[. ] Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/keras-team/keras.git --upgrade --no-deps
-
[. ] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
-
[. ] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with: pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
-
[. ] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).
I’m using fit_generator with multiprocessing. The first epoch is normal, but second epoch did not finish for a long time. The CPU usage of these process is quite low (about 0%). After I kill the process, it is found that main process is waiting for a lock. Ubuntu 16.04 Python 3.6.5 Keras 2.1.6 TensorFlow 1.8.0 with GPU
Some Log:
2018-06-02 10:08:08.448699: Total params: 21,519,259
2018-06-02 10:08:08.448710: Trainable params: 1,514,965
2018-06-02 10:08:08.448718: Non-trainable params: 20,004,294
2018-06-02 10:08:10.959486: Epoch 1/50
2018-06-02 10:28:38.247401: - 1227s - loss: 0.2332 - rmse: 0.2332 - val_loss: 0.2218 - val_rmse: 0.2218
2018-06-02 10:28:38.248712:
Epoch 00001: val_rmse improved from inf to 0.22183, saving model to ../log/2018-06-02 10:03:57/keras_nn_0.hdf5
2018-06-02 10:28:40.006534: Epoch 2/50
2018-06-02 11:23:44.784851: Traceback (most recent call last):
Main Process Trackback:
2018-06-02 11:23:44.785217: KeyboardInterrupt
2018-06-02 11:23:44.803122: Traceback (most recent call last):
2018-06-02 11:23:44.803174: File "/home/**********/image_model.py", line 498, in <module>
2018-06-02 11:23:44.803383: validation_data=CustomSequence(X_valid, y_valid, batch_size),
2018-06-02 11:23:44.803390: File "/home/****/miniconda3/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
2018-06-02 11:23:44.803792: return func(*args, **kwargs)
2018-06-02 11:23:44.803800: File "/home/****/miniconda3/lib/python3.6/site-packages/keras/engine/training.py", line 2194, in fit_generator
2018-06-02 11:23:44.805133: generator_output = next(output_generator)
2018-06-02 11:23:44.805141: File "/home/****/miniconda3/lib/python3.6/site-packages/keras/utils/data_utils.py", line 578, in get
2018-06-02 11:23:44.805721: inputs = self.queue.get(block=True).get()
2018-06-02 11:23:44.805728: File "/home/****/miniconda3/lib/python3.6/multiprocessing/pool.py", line 638, in get
2018-06-02 11:23:44.805851: self.wait(timeout)
2018-06-02 11:23:44.805857: File "/home/****/miniconda3/lib/python3.6/multiprocessing/pool.py", line 635, in wait
2018-06-02 11:23:44.805962: self._event.wait(timeout)
2018-06-02 11:23:44.805969: File "/home/****/miniconda3/lib/python3.6/threading.py", line 551, in wait
2018-06-02 11:23:44.806527: signaled = self._cond.wait(timeout)
2018-06-02 11:23:44.806534: File "/home/****/miniconda3/lib/python3.6/threading.py", line 295, in wait
2018-06-02 11:23:44.806606: waiter.acquire()
2018-06-02 11:23:44.806620: KeyboardInterrupt
Releated Code:
class CustomSequence(Sequence):
def __init__(self, df, y_set=None, batch_size=batch_size, isTrainDataOrTest=True):
self.X, self.y = df, y_set
self.batch_size = batch_size
self.isTrainDataOrTest = isTrainDataOrTest
self.counter = 0
self.epoch_counter = 0
def __len__(self):
return int(np.ceil(len(self.X) / self.batch_size))
def __getitem__(self, idx):
# print(f'Batch #: {idx}')
# print(f'From {idx * self.batch_size} to {(idx + 1) * self.batch_size}')
last_index = (idx + 1) * self.batch_size
if last_index > self.X.shape[0]:
last_index = self.X.shape[0]
batch_x = self.X[idx * self.batch_size:last_index]
title = tokenizer.texts_to_sequences(batch_x['title'])
title = sequence.pad_sequences(title, maxlen=title_maxlen)
desp = tokenizer.texts_to_sequences(batch_x['description'])
desp = sequence.pad_sequences(desp, maxlen=desp_maxlen)
img_data = np.zeros(shape=(batch_x.shape[0], 128, 128, 3))
for index, im_id in enumerate(batch_x['image']):
img_data[index] = get_image_data(im_id, self.isTrainDataOrTest)
# print(f'Returned batch size: {len(batch_x)}')
self.counter += 1
if self.isTrainDataOrTest:
batch_y = self.y[idx * self.batch_size: last_index]
return [batch_x[categorical], batch_x[continous], title, desp, img_data], np.array(batch_y)
else:
return [batch_x[categorical], batch_x[continous], title, desp, img_data]
def on_epoch_end(self):
print('\nEpoch end: ' + str(self.epoch_counter) + ' counter: ' + str(self.counter))
self.epoch_counter += 1
model.fit_generator(generator=CustomSequence(X_train, y_train, batch_size),
epochs=50,
verbose=2,
use_multiprocessing=True,
workers=4,
validation_data=CustomSequence(X_valid, y_valid, batch_size),
callbacks=[check_point, early_stop, rlrop])
Issue Analytics
- State:
- Created 5 years ago
- Reactions:5
- Comments:24 (4 by maintainers)
I was facing the same issue here. What did the trick for me was switching the multiprocessing library process starting method to ‘spawn’ or ‘forkserver’, from the default ‘fork’ method, as pointed out by @saulvargas. At the end of each epoch, the multiprocessing library complains about some leaked semaphores, which it releases, but the show goes on.
The main script should look like:
I am using Keras 2.3.30 with Tensorflow 1.13.1 (same results with CPU and GPU versions alike) under Linux (Arch). My input pipeline uses only numpy, scipy.ndimage and pydicom. Everything is installed in a python virtual env.
The PR cited above, which is merged into the Keras master branch by now, did not help much. It simply uses a 30 second timeout to detect when the worker process pool hangs for some batch, prints a warning message and falls back to sequential code for that batch.
The fix was designed for the case when the culprit for the hanging is the data generating code in the worker processes for some batch, but that is not the case with me. When hanging does happen, it happens at the very beginning of epochs, or when starting validation, when the worker processes pool is spawned, and they never get to the actual data generating code. They hang even before initialization code in keras/utils/data_utils.py gets called. In that way, with the fix in the PR, you end up getting a sequential input pipeline that sleeps for 30 seconds before producing each batch.
We are running into this issue a lot, after writing a subclass of keras.utils.Sequence which works with np.arrays. Our trainings hang on that same
self.queue.get(block=True).get()
indata_utils.py
. Had the issue on 2.1.6 and 2.2.4