loky TerminatedWorkerError
See original GitHub issueI run a script with the parallel loop in it every 5 minutes to update some work assignments stored in the cloud. This needs to happen in real time so the script is set up in the task scheduler on a windows server. This needs to be a very fault tolerant script because an entire department runs on it.
I use Anaconda3 and have gotten this error with version 0.13.0 and 0.12.5 of joblib.
Probably every other run of this script throws this exception (calling defs removed):
File "C:\Workforce Scripts\MU_Sync.py", line 187, in syncCounty
Parallel(n_jobs=n_jobs)(delayed(syncWorkOrderId)(workOrderId) for workOrderId in workOrderIds)
File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 930, in __call__
self.retrieve()
File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 833, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 521, in wrap_future_result
return future.result(timeout=timeout)
File "C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py", line 432, in result
return self.__get_result()
File "C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py", line 384, in __get_result
raise self._exception
File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\externals\loky\_base.py", line 625, in _invoke_callbacks
callback(self)
File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 309, in __call__
self.parallel.dispatch_next()
File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 731, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 510, in apply_async
future = self._workers.submit(SafeFunction(func))
File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\externals\loky\reusable_executor.py", line 151, in submit
fn, *args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\joblib\externals\loky\process_executor.py", line 1022, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
I could not find a reference to this issue anywhere else. Can someone help me figure out what is causing the issue so we can repair it? Let me know what else you need me to attach.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:13 (4 by maintainers)
Top Results From Across the Web
joblib.externals.loky.process_executor.TerminatedWorkerError
_exception joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated.
Read more >python - What is causing my random: "joblib.externals.loky ...
TerminatedWorkerError : A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while ...
Read more >1837012 – python-pyriemann fails to build with Python 3.9 ...
_exception joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated.
Read more >A worker process managed by the executor was unexpectedly ...
loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a ...
Read more >FEDORA-2020-4f07527b3e — bugfix update for python-joblib ...
... fails to build with Python 3.9: joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I solved this problem by setting max_nbytes=5000(large number), like
Parallel(n_jobs=n_cpu, max_nbytes=5000)
I patched sklearn’s _bagging.py / _parallel_predict_proba() to use faulthandler to dump tracebacks to a file so that the worker processes that die could output some information.
Fatal Python error: Segmentation fault Current thread 0x00000001113b6000 (most recent call first): File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/sklearn/svm/_base.py”, line 361 in _dense_predict File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/sklearn/svm/_base.py”, line 344 in predict File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/sklearn/svm/_base.py”, line 624 in predict File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/sklearn/ensemble/_bagging.py”, line 145 in _parallel_predict_proba File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/sklearn/utils/fixes.py”, line 222 in call File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/parallel.py”, line 262 in <listcomp> File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/parallel.py”, line 262 in call File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/_parallel_backends.py”, line 595 in call File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py”, line 285 in call File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py”, line 431 in _process_worker File “/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py”, line 108 in run File “/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py”, line 315 in _bootstrap File “/Volumes/Phil/projects/research/holistic/venv/lib/python3.8/site-packages/joblib/externals/loky/backend/popen_loky_posix.py”, line 203 in <module> File “/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py”, line 87 in _run_code File “/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py”, line 194 in _run_module_as_main
So the fundamental problem looks to be in sklearn’s SVM - the train/test data I am using is sparse (I think) so seeing the last line refer to _dense_predict gives me a clue as to what is happening. I assume the segfault shouldn’t happen at all, so this is not an issue with joblib per se, and it (with loky) is working as intended. With multiprocessing, I guess both child processes fall over and thus execution gets suspended.
Hopefully what I’ve written up helps someone else determine where their segfault is happening and why.
UPDATE: I thought of something and created a small test program to recreate this problem. It seems whether the ndarrays being passed around are memmapped or not is the real factor - I can make different test datasets break the sklearn code (or not) based on the max_nbytes parameter of Parallel. With it set to None, even the largest test datasets work and as quickly as I’d expect. So this is more a joblib issue and not sklearn. I will create a joblib issue for this if I don’t find anything that seems related.