question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ProcessPoolExecutor ends in an unrecoverable state when a worker is terminated abruptly

See original GitHub issue

When a worker process from a ProcessPoolExecutor is terminated by an outside actor (e.g. Kubernetes, OOM killer, (possibly) segfault) the Pool will enter a ‘broken’.

When attempting to execute a task in the pool executor, it the pool will throw an concurrent.futures.process.BrokenProcessPool exception until the pool is stopped and re-created.

Expected Behavior

Either:

  1. Scheduler terminates completely, allowing something external to restart it, restoring the state of the process pool,
  2. Executor recovers itself.

Ultimately, allowing new jobs to be executed in the pool

Current Behavior

No further jobs can be executed in the pool until manual intervention.

Steps to Reproduce

Here’s a pytest-test that more or less replicates the issue:

@pytest.fixture
def executor():
    return ProcessPoolExecutor()

def test_broken_pool(scheduler, executor):
    # Set up some event hooks so we can assert that things don't fail
    ev_mock = Mock()
    scheduler.add_listener(ev_mock.executed, events.EVENT_JOB_EXECUTED)
    scheduler.add_listener(ev_mock.error, events.EVENT_JOB_ERROR)

    scheduler.resume()
    scheduler.add_job(null_op, 'date')

    time.sleep(0.5)
    assert ev_mock.executed.call_count == 1
    assert ev_mock.error.call_count == 0

    # Terminate a process in the pool abruptly/externally
    pid = list(executor._pool._processes.keys())[0]
    os.kill(pid, signal.SIGKILL)
    time.sleep(0.5)

    # re-sechedule a job
    scheduler.add_job(null_op, 'date')

    time.sleep(0.5)
    assert ev_mock.executed.call_count == 2
    assert ev_mock.error.call_count == 0

Context (Environment)

Python 3.6.6 :: Anaconda, Inc.
$ pip freeze | grep "APS"                                                                                                                                                             
APScheduler==3.5.3

This issue was discovered when something (actual cause is unknown) terminates one of the processes in the ProcessPool. The only mechanism we have (currently) to recover from the error is to restart the service running APScheduler.

Ideally, either the error should be raised to the parent process, or the executor handles the fault itself.

Detailed Description

I have worked-around the issue myself with the following. But it’s not the cleanest solution.


from concurrent.futures import ProcessPoolExecutor
from concurrent.futures.process import BrokenProcessPool
from apscheduler.executors.pool import ProcessPoolExecutor as _PoolExecutor


class FixedPoolExecutor(_PoolExecutor):
    def __init__(self, max_workers=10):
        self._max_workers = max_workers
        super().__init__(max_workers)

    def _do_submit_job(self, job, run_times):
        try:
            return super()._do_submit_job(job, run_times)
        except BrokenProcessPool:
            self._logger.warning('Process pool is broken. Restarting executor.')
            self._pool.shutdown(wait=True)
            self._pool = ProcessPoolExecutor(int(self._max_workers))

            return super()._do_submit_job(job, run_times)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:11
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
pdscopescommented, Mar 16, 2020

I have also discovered there is another situation where a BrokenProcessPool exception is raised but no events are triggered. That is if the process is terminated while the job is running. My solution to this is a more complex work-around based on @xlevus’s code:

class FixedProcessPoolExecutor(_ProcessPoolExecutor):
    def __init__(self, max_workers=10):
        self._max_workers = max_workers
        self._store = dict()
        super().__init__(max_workers)

    def _do_submit_job(self, job, run_times):
        try:
            self._store[job.id] = (job._jobstore_alias, run_times)
            return super()._do_submit_job(job, run_times)
        except BrokenProcessPool:
            self._logger.warning('Process pool is broken. Restarting executor.')
            self._pool.shutdown(wait=True)
            self._pool = ProcessPoolExecutor(int(self._max_workers))

            return super()._do_submit_job(job, run_times)

    def _run_job_success(self, job_id, events):
        # Call to handle job success as normal
        super()._run_job_success(job_id, events)

        # Tidy up the store
        self._store.pop(job_id, None)

    def _run_job_error(self, job_id, exc, traceback=None):
        # Call to handle job error as normal
        super()._run_job_error(job_id, exc, traceback)

        # Fire an event to say the job failed
        jobstore, run_times = self._store.get(job_id, ('default', [dt_now()]))
        event = apscheduler.events.JobExecutionEvent(apscheduler.events.EVENT_JOB_ERROR, job_id,
                                                     jobstore, run_times[0],
                                                     exception=exc, traceback=traceback)
        self._scheduler._dispatch_event(event)

        # If this was a BrokenProcessPool exception
        if isinstance(exc, BrokenProcessPool):
            self._logger.warning('Process pool broke during execution. Restarting executor.')
            self._pool.shutdown(wait=True)
            self._pool = ProcessPoolExecutor(int(self._max_workers))

        # Tidy up the store
        self._store.pop(job_id, None)
1reaction
agronholmcommented, Nov 16, 2022

I’m afraid not. A proper solution is coming in v4.0 (not yet available in 4.0.0a2 but likely in 4.0.0a3).

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - processpoolexecutor subprocess suddenly stops
I discovered from my logs that sometimes the run ends without all the subprocesses completing. Does someone has any idea on why that...
Read more >
Issue 9205: Parent process hanging in multiprocessing if ...
msg109585 ‑ (view) Author: Greg Brockman (gdb) Date: 2010‑07‑08 20:00 msg109867 ‑ (view) Author: Jesse Noller (jnoller) * Date: 2010‑07‑10 12:52 msg109885 ‑ (view) Author:...
Read more >
7 ProcessPoolExecutor Common Errors in Python
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Read more >
concurrent.futures — Manage Pools of Concurrent Tasks
Executors are used for managing pools of workers, and futures are used for ... result: <Future at 0x1034e1ef0 state=finished returned float> ...
Read more >
concurrent.futures.process — PandExo documentation
However, # allowing workers to die with the interpreter has two undesirable ... Raised when a process in a ProcessPoolExecutor terminated abruptly while...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found