Interpreter hangs when running parallel function in subprocess with workqueue backend
See original GitHub issueReporting a bug
- I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
 - I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.
 
Description
The Python interpreter hangs when using Numba in the following way:
- Select the “forksafe” 
workqueuethreading backend - Run a Numba parallel function
 - Start a subprocess (e.g. using Python 
multiprocessing) - Run a Numba parallel function in the subprocess
 
If I understand correctly, this pattern should be allowed when selecting a forksafe threading layer. Is this correct? Or is one only allowed to run parallel functions in a subprocess, but not mix parent and subprocess work?
Versions
I ran into the issue on Numba 0.55.1 (installed from pip), and then I installed from master (0.56.0dev0+298.g53ea89fee) and found the same issue. I thought it might be a specific issue to my local machine, but it also fails on GitHub CI in the same way.
I’ve been testing using Python 3.9 on Linux. GitHub CI fails on Python 3.7-3.10.
https://github.com/numba/numba/issues/5890 seemed related, so I tried installing the PR from https://github.com/numba/numba/pull/7625, but it has no effect on this problem.
I haven’t tried a conda install or a TBB threading layer yet; will try that and report back.
Debugging Attempts
Valgrind gives a pleasantly coherent view of what’s happening. It appears the subprocess segfaults inside a workqueue mutex, thus leaving the parent process hanging, waiting for the child to return its result. This segfault story also meshes with the GitHub CI result from above, where pytest seems to notice that something segfaulted.
Note the PID changes after a few lines, as Valgrind starts tracking the child.
(venv) lgarrison@ccalin008:~/abacusutils$ valgrind --suppressions=valgrind-python.supp python3 repro_gh47.py
==1400719== Memcheck, a memory error detector
==1400719== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1400719== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==1400719== Command: python3 repro_gh47.py
==1400719== 
Single-process result: [ 0  2  4  6  8 10 12 14 16 18]
numba.threading_layer()='workqueue'
==1401030== Invalid read of size 4
==1401030==    at 0x58B7D00: pthread_mutex_lock (in /usr/lib64/libpthread-2.17.so)
==1401030==    by 0xE845DA63: queue_condition_lock (workqueue.c:89)
==1401030==    by 0xE845DA63: queue_state_wait (workqueue.c:260)
==1401030==    by 0xE845DA63: ready (workqueue.c:554)
==1401030==    by 0xE845DA63: parallel_for (workqueue.c:401)
==1401030==    by 0x404E19A: ???
==1401030==    by 0x3: ???
==1401030==    by 0x17: ???
==1401030==    by 0xFEFF706F: ???
==1401030==    by 0x8: ???
==1401030==  Address 0x40 is not stack'd, malloc'd or (recently) free'd
==1401030== 
==1401030== 
==1401030== Process terminating with default action of signal 11 (SIGSEGV)
==1401030==  Access not within mapped region at address 0x40
==1401030==    at 0x58B7D00: pthread_mutex_lock (in /usr/lib64/libpthread-2.17.so)
==1401030==    by 0xE845DA63: queue_condition_lock (workqueue.c:89)
==1401030==    by 0xE845DA63: queue_state_wait (workqueue.c:260)
==1401030==    by 0xE845DA63: ready (workqueue.c:554)
==1401030==    by 0xE845DA63: parallel_for (workqueue.c:401)
==1401030==    by 0x404E19A: ???
==1401030==    by 0x3: ???
==1401030==    by 0x17: ???
==1401030==    by 0xFEFF706F: ???
==1401030==    by 0x8: ???
==1401030==  If you believe this happened as a result of a stack
==1401030==  overflow in your program's main thread (unlikely but
==1401030==  possible), you can try to increase the size of the
==1401030==  main thread stack using the --main-stacksize= flag.
==1401030==  The main thread stack size used in this run was 8388608.
==1401030== 
==1401030== HEAP SUMMARY:
==1401030==     in use at exit: 24,269,054 bytes in 24,666 blocks
==1401030==   total heap usage: 639,280 allocs, 614,614 frees, 1,540,075,539 bytes allocated
==1401030== 
==1401030== LEAK SUMMARY:
==1401030==    definitely lost: 576 bytes in 12 blocks
==1401030==    indirectly lost: 0 bytes in 0 blocks
==1401030==      possibly lost: 985,934 bytes in 4,064 blocks
==1401030==    still reachable: 23,282,544 bytes in 20,590 blocks
==1401030==                       of which reachable via heuristic:
==1401030==                         multipleinheritance: 6,896 bytes in 8 blocks
==1401030==         suppressed: 0 bytes in 0 blocks
==1401030== Rerun with --leak-check=full to see details of leaked memory
==1401030== 
==1401030== For lists of detected and suppressed errors, rerun with: -s
==1401030== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 84136 from 1044)
Reproducer
#!/usr/bin/env python3
import numba
numba.config.THREADING_LAYER = 'workqueue'
import numpy as np
# === Run a Numba parallel function in a single process ===
@numba.njit(parallel=True)
def f(a):
    return 2*a
res = f(np.arange(10))
print(f'Single-process result: {res}', flush=True)
print(f'{numba.threading_layer()=}')
# === Now run a Numba parallel function in a forked process ===
@numba.njit(parallel=True)
def g(a):
    return 3*a
    
import multiprocessing
with multiprocessing.Pool(1) as p:
    mres = p.map(g, [np.arange(10)])  # hangs
print(f'Forked process result: {mres}', flush=True)
Issue Analytics
- State:
 - Created 2 years ago
 - Comments:12 (8 by maintainers)
 

Top Related StackOverflow Question
I think I’ve got a patch for this, it’s written on top of https://github.com/numba/numba/pull/7625 with view of getting that merged shortly.
I think https://github.com/numba/numba/pull/7625 ought to go in first else the conflicts both in terms of moving the same areas of the code base and in handling the complexity of the change will end up prohibitively hard. Will hopefully finish reviewing that next week and then propose a patch on top of that.