flaky test,10 minutes timeout on travis: inputs = self.queue.get(block=True).get() hangs
See original GitHub issueThis issue is a discussion to track the 10 minutes timeout bug on travis.
On travis, the test suite sometimes hangs and travis gives the message no output has been recieved for the last 10 minutes
. By adding pytest-timeout to our test suite, we were able to get a traceback of the problem. Note that it happens most of the time with tensorflow, but sometimes happens with theano and cntk. Here is this traceback:
=================================== FAILURES ===================================
________________________________ test_warnings _________________________________
[gw0] linux -- Python 3.6.6 /home/travis/miniconda/envs/test-environment/bin/python
@pytest.mark.skipif(sys.version_info < (3,),
reason='Cannot catch warnings in python 2')
def test_warnings():
a = Input(shape=(3,), name='input_a')
b = Input(shape=(3,), name='input_b')
a_2 = Dense(4, name='dense_1')(a)
dp = Dropout(0.5, name='dropout')
b_2 = dp(b)
model = Model([a, b], [a_2, b_2])
optimizer = 'rmsprop'
loss = 'mse'
loss_weights = [1., 0.5]
model.compile(optimizer, loss, metrics=[], loss_weights=loss_weights,
sample_weight_mode=None)
@threadsafe_generator
def gen_data(batch_sz):
while True:
yield ([np.random.random((batch_sz, 3)),
np.random.random((batch_sz, 3))],
[np.random.random((batch_sz, 4)),
np.random.random((batch_sz, 3))])
with pytest.warns(Warning) as w:
out = model.fit_generator(gen_data(4),
steps_per_epoch=10,
use_multiprocessing=True,
> workers=2)
tests/keras/engine/test_training.py:604:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
keras/legacy/interfaces.py:91: in wrapper
return func(*args, **kwargs)
keras/engine/training.py:1418: in fit_generator
initial_epoch=initial_epoch)
keras/engine/training_generator.py:179: in fit_generator
generator_output = next(output_generator)
keras/utils/data_utils.py:685: in get
inputs = self.queue.get(block=True).get()
../../../miniconda/envs/test-environment/lib/python3.6/multiprocessing/pool.py:638: in get
self.wait(timeout)
../../../miniconda/envs/test-environment/lib/python3.6/multiprocessing/pool.py:635: in wait
self._event.wait(timeout)
../../../miniconda/envs/test-environment/lib/python3.6/threading.py:551: in wait
signaled = self._cond.wait(timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Condition(<unlocked _thread.lock object at 0x7fd35883ce18>, 0)>
timeout = None
def wait(self, timeout=None):
"""Wait until notified or until a timeout occurs.
If the calling thread has not acquired the lock when this method is
called, a RuntimeError is raised.
This method releases the underlying lock, and then blocks until it is
awakened by a notify() or notify_all() call for the same condition
variable in another thread, or until the optional timeout occurs. Once
awakened or timed out, it re-acquires the lock and returns.
When the timeout argument is present and not None, it should be a
floating point number specifying a timeout for the operation in seconds
(or fractions thereof).
When the underlying lock is an RLock, it is not released using its
release() method, since this may not actually unlock the lock when it
was acquired multiple times recursively. Instead, an internal interface
of the RLock class is used, which really unlocks it even when it has
been recursively acquired several times. Another internal interface is
then used to restore the recursion level when the lock is reacquired.
"""
if not self._is_owned():
raise RuntimeError("cannot wait on un-acquired lock")
waiter = _allocate_lock()
waiter.acquire()
self._waiters.append(waiter)
saved_state = self._release_save()
gotit = False
try: # restore state no matter what (e.g., KeyboardInterrupt)
if timeout is None:
> waiter.acquire()
E Failed: Timeout >720.0s
../../../miniconda/envs/test-environment/lib/python3.6/threading.py:295: Failed
----------------------------- Captured stdout call -----------------------------
Epoch 1/1
----------------------------- Captured stderr call -----------------------------
+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
To see the full logs, here are two builds which failed with this error:
https://travis-ci.org/keras-team/keras/jobs/444740251 https://travis-ci.org/keras-team/keras/jobs/441817913
This is likely linked to this issue too #11288 .
Issue Analytics
- State:
- Created 5 years ago
- Comments:13 (9 by maintainers)
Top Results From Across the Web
Python Multiprocessing.Queue.get - timeout and blocking not ...
get() method is implemented in an infinite loop, but I need this loop to run every once in a while even though they...
Read more >keras flaky test,10 minutes timeout on travis: inputs = self.queue.get ...
keras flaky test,10 minutes timeout on travis: inputs = self.queue.get(block=True).get() hangs - Python. This issue is a discussion to track the 10 minutes...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Any news here? I found out that in my case it always hangs here: https://github.com/keras-team/keras/blob/d6b5c5ebb410e3366c9d7aca41977a60134bfe10/keras/engine/training_generator.py#L181
We could add a mechanism such that after 5 secs we drop the promise if it’s not done. We would lose a sample, but with the proper warnings it should be easily understandable. Or maybe requeue them at the end of the epoch?
I’ll try to propose something this weekend.
One cool thing is that the pool is restarted at each epoch so if we lose a worker in the middle of an epoch, we will use N-1 workers for the rest of the epoch and then jump back up to N workers after.