ProcessPoolExecutor ends in an unrecoverable state when a worker is terminated abruptly
See original GitHub issueWhen a worker process from a ProcessPoolExecutor
is terminated by an outside actor (e.g. Kubernetes, OOM killer, (possibly) segfault) the Pool will enter a ‘broken’.
When attempting to execute a task in the pool executor, it the pool will throw an concurrent.futures.process.BrokenProcessPool
exception until the pool is stopped and re-created.
Expected Behavior
Either:
- Scheduler terminates completely, allowing something external to restart it, restoring the state of the process pool,
- Executor recovers itself.
Ultimately, allowing new jobs to be executed in the pool
Current Behavior
No further jobs can be executed in the pool until manual intervention.
Steps to Reproduce
Here’s a pytest-test that more or less replicates the issue:
@pytest.fixture
def executor():
return ProcessPoolExecutor()
def test_broken_pool(scheduler, executor):
# Set up some event hooks so we can assert that things don't fail
ev_mock = Mock()
scheduler.add_listener(ev_mock.executed, events.EVENT_JOB_EXECUTED)
scheduler.add_listener(ev_mock.error, events.EVENT_JOB_ERROR)
scheduler.resume()
scheduler.add_job(null_op, 'date')
time.sleep(0.5)
assert ev_mock.executed.call_count == 1
assert ev_mock.error.call_count == 0
# Terminate a process in the pool abruptly/externally
pid = list(executor._pool._processes.keys())[0]
os.kill(pid, signal.SIGKILL)
time.sleep(0.5)
# re-sechedule a job
scheduler.add_job(null_op, 'date')
time.sleep(0.5)
assert ev_mock.executed.call_count == 2
assert ev_mock.error.call_count == 0
Context (Environment)
Python 3.6.6 :: Anaconda, Inc.
$ pip freeze | grep "APS"
APScheduler==3.5.3
This issue was discovered when something (actual cause is unknown) terminates one of the processes in the ProcessPool. The only mechanism we have (currently) to recover from the error is to restart the service running APScheduler.
Ideally, either the error should be raised to the parent process, or the executor handles the fault itself.
Detailed Description
I have worked-around the issue myself with the following. But it’s not the cleanest solution.
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures.process import BrokenProcessPool
from apscheduler.executors.pool import ProcessPoolExecutor as _PoolExecutor
class FixedPoolExecutor(_PoolExecutor):
def __init__(self, max_workers=10):
self._max_workers = max_workers
super().__init__(max_workers)
def _do_submit_job(self, job, run_times):
try:
return super()._do_submit_job(job, run_times)
except BrokenProcessPool:
self._logger.warning('Process pool is broken. Restarting executor.')
self._pool.shutdown(wait=True)
self._pool = ProcessPoolExecutor(int(self._max_workers))
return super()._do_submit_job(job, run_times)
Issue Analytics
- State:
- Created 5 years ago
- Reactions:11
- Comments:6 (2 by maintainers)
Top GitHub Comments
I have also discovered there is another situation where a
BrokenProcessPool
exception is raised but no events are triggered. That is if the process is terminated while the job is running. My solution to this is a more complex work-around based on @xlevus’s code:I’m afraid not. A proper solution is coming in v4.0 (not yet available in 4.0.0a2 but likely in 4.0.0a3).