Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

loky TerminatedWorkerError

See original GitHub issue

I run a script with the parallel loop in it every 5 minutes to update some work assignments stored in the cloud. This needs to happen in real time so the script is set up in the task scheduler on a windows server. This needs to be a very fault tolerant script because an entire department runs on it.

I use Anaconda3 and have gotten this error with version 0.13.0 and 0.12.5 of joblib.

Probably every other run of this script throws this exception (calling defs removed):

  File "C:\Workforce Scripts\MU_Sync.py", line 187, in syncCounty
    Parallel(n_jobs=n_jobs)(delayed(syncWorkOrderId)(workOrderId) for workOrderId in workOrderIds)
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 930, in __call__
    self.retrieve()
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 521, in wrap_future_result
    return future.result(timeout=timeout)
  File "C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py", line 432, in result
    return self.__get_result()
  File "C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py", line 384, in __get_result
    raise self._exception
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\externals\loky\_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 309, in __call__
    self.parallel.dispatch_next()
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 731, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 510, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\externals\loky\reusable_executor.py", line 151, in submit
    fn, *args, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 1022, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

I could not find a reference to this issue anywhere else. Can someone help me figure out what is causing the issue so we can repair it? Let me know what else you need me to attach.

Issue Analytics

State:
Created 5 years ago
Reactions:3
Comments:13 (4 by maintainers)

Top GitHub Comments

1reaction

LJingnancommented, Feb 28, 2022

I solved this problem by setting max_nbytes=5000(large number), like Parallel(n_jobs=n_cpu, max_nbytes=5000)

0reactions

drfrasercommented, Jun 30, 2021

I patched sklearn’s _bagging.py / _parallel_predict_proba() to use faulthandler to dump tracebacks to a file so that the worker processes that die could output some information.

Fatal Python error: Segmentation fault Current thread 0x00000001113b6000 (most recent call first): File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/sklearn/svm/_base.py”, line 361 in _dense_predict File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/sklearn/svm/_base.py”, line 344 in predict File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/sklearn/svm/_base.py”, line 624 in predict File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/sklearn/ensemble/_bagging.py”, line 145 in _parallel_predict_proba File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/sklearn/utils/fixes.py”, line 222 in call File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/parallel.py”, line 262 in <listcomp> File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/parallel.py”, line 262 in call File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/_parallel_backends.py”, line 595 in call File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py”, line 285 in call File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py”, line 431 in _process_worker File “/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py”, line 108 in run File “/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py”, line 315 in _bootstrap File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/externals/loky/backend/popen_loky_posix.py”, line 203 in <module> File “/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py”, line 87 in _run_code File “/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py”, line 194 in _run_module_as_main

So the fundamental problem looks to be in sklearn’s SVM - the train/test data I am using is sparse (I think) so seeing the last line refer to _dense_predict gives me a clue as to what is happening. I assume the segfault shouldn’t happen at all, so this is not an issue with joblib per se, and it (with loky) is working as intended. With multiprocessing, I guess both child processes fall over and thus execution gets suspended.

Hopefully what I’ve written up helps someone else determine where their segfault is happening and why.

UPDATE: I thought of something and created a small test program to recreate this problem. It seems whether the ndarrays being passed around are memmapped or not is the real factor - I can make different test datasets break the sklearn code (or not) based on the max_nbytes parameter of Parallel. With it set to None, even the largest test datasets work and as quickly as I’d expect. So this is more a joblib issue and not sklearn. I will create a joblib issue for this if I don’t find anything that seems related.

Top Results From Across the Web

joblib.externals.loky.process_executor.TerminatedWorkerError

_exception joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated.

python - What is causing my random: "joblib.externals.loky ...

TerminatedWorkerError : A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while ...

1837012 – python-pyriemann fails to build with Python 3.9 ...

_exception joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated.

A worker process managed by the executor was unexpectedly ...

loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a ...

FEDORA-2020-4f07527b3e — bugfix update for python-joblib ...

... fails to build with Python 3.9: joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly ...