question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Seemingly random segfault on macOS if function is in larger library

See original GitHub issue

Reporting a bug

Hi,

first of all, sorry for this small report and few examples, but at this point this seems to untraceable to me that im hoping for any input to trace down the issue. Maybe its even a severely stupid mistake by myself that I just can’t find. I was getting segfaults from a Numba function and traced it down to the state I will outline here, but at this point I can’t find anything anymore.

I have a function which operates on arrays. I have simplified it very far so I know its not making so much sense - but here it goes.

import numpy as np
import numba

@numba.jit("f8[:,:](f8[:,:,:],f8[:,:],f8,f8,f8[:])", nopython=True, parallel=True,
           nogil=True)
def evalManyIndividual(individuals, X,p1, p2, p3):
    fitnesses = np.zeros((individuals.shape[0], 4))
    nF = individuals.shape[0]
    for i in numba.prange(nF):
        individual = individuals[i]
        P = np.random.random((3,4))
        fitnesses[i] = np.random.random((4,))
    return fitnesses

For debugging, Im using synthetic input data

n = 15
m = 3
inputInd = np.random.random((500, n, m))
inputArray = np.random.random((n, m))
p1 = 25e-3
p2 = 55.
p3 = np.array([320., 240.])

ret = evalManyIndividualQ3D(inputInd, inputArray, p1, p2, p3)

Running this as a small script works. Running this from interactive works. However, I have a large library with Numba functions in which the one above is included. Just somewhere in there. Same syntax, copy & paste. If I then add the execution with the same synthetic input data after the library (compiling the full library including the above function), only calling the function as above

ret = evalManyIndividualQ3D(inputInd, inputArray, p1, p2, p3)

Im getting a segfault. No traceback, no nothing. In terminal its zsh: segmentation fault, Jupyter just hangs completely.

This happens on macOS 10.15. A difference to mention would be that during compilation of the library, Im getting some warnings

NumbaPerformanceWarning: '@' is faster on contiguous arrays, called on (array(float64, 2d, A), array(float64, 2d, A))
  warnings.warn(NumbaPerformanceWarning(msg))

Those products are not in the function or connected to the function that crashes!

At this point, Im happy for any type of input since I can’t find a reason.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:58 (31 by maintainers)

github_iconTop GitHub Comments

2reactions
stuartarchibaldcommented, Jul 21, 2020

I think the problem here is as follows… using this code as an example:

import numpy as np
import numba
from numba import njit
import numpy as np


@numba.njit(parallel=True, debug=True)
def f(x):
    x[:] = 1
    return


if __name__ == '__main__':
    numba.config.NUMBA_NUM_THREADS = 2
    f(np.ones(100))

    from numba import threading_layer
    print(threading_layer())

When this script is run the following sequence occurs:

  1. Near the top, import numba, via it’s __init__, https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/__init__.py#L38-L39 imports vectorize which goes via numba.np.ufunc.__init__ https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/np/ufunc/__init__.py#L3 which has the side effect of this import too: https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/np/ufunc/__init__.py#L6-L7
  2. As a result of 1. and numba.np.ufunc.parallel being imported as part of numba.__init__ this module global is evaluated: https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/np/ufunc/parallel.py#L47 the result being that NUM_THREADS is e.g. 4 for a 4 core machine.
  3. Python continues with the script and runs if __name__ == "__main__", this making a call first to set numba.config.NUMBA_NUM_THREADS=2 and then to call the @numba.njit(parallel=True, debug=True) decorated function f.
  4. As part of the compilation of f, _launch_threads is called to start the actual thread pool, this is done here https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/parfors/parfor_lowering.py#L1419 and is set with the value NUM_THREADS from above here: https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/np/ufunc/parallel.py#L493 as a result there’s a threadpool of size e.g. 4. then _load_num_threads_funcs() is called here: https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/np/ufunc/parallel.py#L495 which then calls the backend specific _set_num_threads function such that the main thread has NUM_THREADS as the number of threads in the pool in its TLS slot, here: https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/np/ufunc/parallel.py#L511 and here (for OpenMP): https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/np/ufunc/omppool.cpp#L59-L63
  5. Further on in the compilation of f, the parfors lowering queries the python function numba.np.ufunc.parallel.get_thread_count from here https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/parfors/parfor_lowering.py#L1503 and this function in turn looks like: https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/np/ufunc/parallel.py#L37-L44 which results in the sched_size ending up based on the value 2 as it’s read from the numba.config variable. However, later, when the memory allocated to the sched_size size is used at run time in a call to do_scheduling: https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/parfors/parfor_lowering.py#L1530-L1535 the number of threads used also comes from a call made at runtime from here: https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/parfors/parfor_lowering.py#L1521 the value of which at runtime is e.g. 4 as it’s from the TLS slot in the threading backend, e.g. for OpenMP: https://github.com/numba/numba/blob/b4badb5f0ecae44ce3fbc57d83a85d24488699a3/numba/np/ufunc/omppool.cpp#L65-L76
  6. The result of all this is a schedule based on size 2 is baked in at compiled time and a thread count of size e.g. 4 is present at runtime, this result in invalid access, which subsequently, assuming I’ve got this right, is probably the cause of the somewhat hard to trace segfault.
1reaction
joseph-longcommented, Jul 20, 2021

Enough of the keywords in this issue line up with things going on in my debugging that I thought I’d chime in (and watch for further updates). I have been using @njit(parallel=True, cache=True) on a function and the test case intermittently fails by hanging the pytest process so I can’t interrupt it and have to kill it. Triggering recompilation of the njited function returns it to working.

I’ve been unable to track down the root cause but @stuartarchibald’s analysis above was comprehensive (seriously impressive!) and it seems likely the “wrong” value is getting baked in somewhere in my case. As a workaround I’m not using cache=True on those functions and I’m no longer calling numba.set_num_threads at all.

Is #6025 the best hope for resolving this on macOS?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Segmentation Fault using Python Shared Memory
Basically the problem seems to be that the underlying mmap'ed file (owned by shm within read_from_shm ) is being closed when shm is...
Read more >
Identify what's causing segmentation faults (segfaults)
Segfaults are caused by a program trying to read or write an illegal memory location.
Read more >
Crash starting app after upgrade to Monterey
Need some help debugging an issue that only started after upgrading to Monterey. Our application is failing to start resulting in a segmentation...
Read more >
4. Memcheck: a memory error detector
Passing a fishy (presumably negative) value to the size parameter of a memory allocation function. Memory leaks. Problems like these can be difficult...
Read more >
#32600 (GEOS Polygons and Collections (across versions) ...
I got new Macbook Air (M1, 2020) and when I'm trying to use a project transferred from previous Intel Macbook Pro, I get...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found