Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Interpreter hangs when running parallel function in subprocess with workqueue backend

See original GitHub issue

Reporting a bug

I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.

Description

The Python interpreter hangs when using Numba in the following way:

Select the “forksafe” workqueue threading backend
Run a Numba parallel function
Start a subprocess (e.g. using Python multiprocessing)
Run a Numba parallel function in the subprocess

If I understand correctly, this pattern should be allowed when selecting a forksafe threading layer. Is this correct? Or is one only allowed to run parallel functions in a subprocess, but not mix parent and subprocess work?

Versions

I ran into the issue on Numba 0.55.1 (installed from pip), and then I installed from master (0.56.0dev0+298.g53ea89fee) and found the same issue. I thought it might be a specific issue to my local machine, but it also fails on GitHub CI in the same way.

I’ve been testing using Python 3.9 on Linux. GitHub CI fails on Python 3.7-3.10.

https://github.com/numba/numba/issues/5890 seemed related, so I tried installing the PR from https://github.com/numba/numba/pull/7625, but it has no effect on this problem.

I haven’t tried a conda install or a TBB threading layer yet; will try that and report back.

Debugging Attempts

Valgrind gives a pleasantly coherent view of what’s happening. It appears the subprocess segfaults inside a workqueue mutex, thus leaving the parent process hanging, waiting for the child to return its result. This segfault story also meshes with the GitHub CI result from above, where pytest seems to notice that something segfaulted.

Note the PID changes after a few lines, as Valgrind starts tracking the child.

(venv) lgarrison@ccalin008:~/abacusutils$ valgrind --suppressions=valgrind-python.supp python3 repro_gh47.py
==1400719== Memcheck, a memory error detector
==1400719== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1400719== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==1400719== Command: python3 repro_gh47.py
==1400719== 
Single-process result: [ 0  2  4  6  8 10 12 14 16 18]
numba.threading_layer()='workqueue'
==1401030== Invalid read of size 4
==1401030==    at 0x58B7D00: pthread_mutex_lock (in /usr/lib64/libpthread-2.17.so)
==1401030==    by 0xE845DA63: queue_condition_lock (workqueue.c:89)
==1401030==    by 0xE845DA63: queue_state_wait (workqueue.c:260)
==1401030==    by 0xE845DA63: ready (workqueue.c:554)
==1401030==    by 0xE845DA63: parallel_for (workqueue.c:401)
==1401030==    by 0x404E19A: ???
==1401030==    by 0x3: ???
==1401030==    by 0x17: ???
==1401030==    by 0xFEFF706F: ???
==1401030==    by 0x8: ???
==1401030==  Address 0x40 is not stack'd, malloc'd or (recently) free'd
==1401030== 
==1401030== 
==1401030== Process terminating with default action of signal 11 (SIGSEGV)
==1401030==  Access not within mapped region at address 0x40
==1401030==    at 0x58B7D00: pthread_mutex_lock (in /usr/lib64/libpthread-2.17.so)
==1401030==    by 0xE845DA63: queue_condition_lock (workqueue.c:89)
==1401030==    by 0xE845DA63: queue_state_wait (workqueue.c:260)
==1401030==    by 0xE845DA63: ready (workqueue.c:554)
==1401030==    by 0xE845DA63: parallel_for (workqueue.c:401)
==1401030==    by 0x404E19A: ???
==1401030==    by 0x3: ???
==1401030==    by 0x17: ???
==1401030==    by 0xFEFF706F: ???
==1401030==    by 0x8: ???
==1401030==  If you believe this happened as a result of a stack
==1401030==  overflow in your program's main thread (unlikely but
==1401030==  possible), you can try to increase the size of the
==1401030==  main thread stack using the --main-stacksize= flag.
==1401030==  The main thread stack size used in this run was 8388608.
==1401030== 
==1401030== HEAP SUMMARY:
==1401030==     in use at exit: 24,269,054 bytes in 24,666 blocks
==1401030==   total heap usage: 639,280 allocs, 614,614 frees, 1,540,075,539 bytes allocated
==1401030== 
==1401030== LEAK SUMMARY:
==1401030==    definitely lost: 576 bytes in 12 blocks
==1401030==    indirectly lost: 0 bytes in 0 blocks
==1401030==      possibly lost: 985,934 bytes in 4,064 blocks
==1401030==    still reachable: 23,282,544 bytes in 20,590 blocks
==1401030==                       of which reachable via heuristic:
==1401030==                         multipleinheritance: 6,896 bytes in 8 blocks
==1401030==         suppressed: 0 bytes in 0 blocks
==1401030== Rerun with --leak-check=full to see details of leaked memory
==1401030== 
==1401030== For lists of detected and suppressed errors, rerun with: -s
==1401030== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 84136 from 1044)

Reproducer

#!/usr/bin/env python3

import numba
numba.config.THREADING_LAYER = 'workqueue'

import numpy as np

# === Run a Numba parallel function in a single process ===

@numba.njit(parallel=True)
def f(a):
    return 2*a

res = f(np.arange(10))
print(f'Single-process result: {res}', flush=True)
print(f'{numba.threading_layer()=}')

# === Now run a Numba parallel function in a forked process ===

@numba.njit(parallel=True)
def g(a):
    return 3*a
    
import multiprocessing
with multiprocessing.Pool(1) as p:
    mres = p.map(g, [np.arange(10)])  # hangs

print(f'Forked process result: {mres}', flush=True)

x-ref: https://github.com/abacusorg/abacusutils/issues/47

Issue Analytics

State:
Created 2 years ago
Comments:12 (8 by maintainers)

Top GitHub Comments

1reaction

stuartarchibaldcommented, Jun 17, 2022

I think I’ve got a patch for this, it’s written on top of https://github.com/numba/numba/pull/7625 with view of getting that merged shortly.

1reaction

stuartarchibaldcommented, Apr 1, 2022

I think https://github.com/numba/numba/pull/7625 ought to go in first else the conflicts both in terms of moving the same areas of the code base and in handling the complexity of the change will end up prohibitively hard. Will hopefully finish reviewing that next week and then propose a patch on top of that.

Top Results From Across the Web

Using Python subprocess for parallel processing - Shuzhan Fan

First, we search the current directory and obtain a list of all the compressed files. Next, we create a list of the sequence...

Getting Started With Async Features in Python

This step-by-step tutorial gives you the tools you need to start making asynchronous programming techniques a part of your repertoire.

Bizarre hang in python when subprocess returns an error

The analysis is conducted in parallel with a program called "mpirun"; my python script runs mpirun via subprocess.Popen. Nothing unusual so far.

Performance Tuning Guide - PyTorch

Performance Tuning Guide. Author: Szymon Migacz. Performance Tuning Guide is a set of optimizations and best practices which can accelerate training and ...

Parsl Documentation

Decorated functions can run in parallel when all ... includes a Python interpreter, and Work Queue does not require Python to run. ......