Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NPY_SIGINT_{ON,OFF} have a race condition that can cause control-C to segfault Python

See original GitHub issue

In the FFTPACK code, we have some “clever” code that uses the pattern NPY_SIGINT_ON, then run some inner loop, then NPY_SIGINT_OFF. And what this does is:

NPY_SIGINT_ON uses setjmp to make a jump buffer target, saves the current signal handler for SIGINT (control-C), and then registers a new SIGINT handler that attempts to handle control-C by longjmping out to the setjmp buffer. The idea is that this allows us to cancel long-running calculations.

The actual implementations of these macros is:

#    define NPY_SIGINT_ON {                                             \
                   PyOS_sighandler_t _npy_sig_save;                     \
                   _npy_sig_save = PyOS_setsig(SIGINT, _PyArray_SigintHandler);\
 \
                   if (NPY_SIGSETJMP(*((NPY_SIGJMP_BUF *)_PyArray_GetSigintBuf(\
)), \
                                 1) == 0) {

#    define NPY_SIGINT_OFF }                                      \
        PyOS_setsig(SIGINT, _npy_sig_save);                       \
        }

So this has two race conditions:

Minor problem: If we receive a control-C in between the call to PyOS_setsig and the call to NPY_SIGSETJMP, then we’ll longjmp out of the signal handler into a uninitialized buffer, and will not go to space today.

Major problem: suppose that the following sequence occurs:

thread 1 enters NPY_SIGINT_ON, stashes Python’s default sigint handler in its _npy_sig_save
thread 2 enters NPY_SIGINT_ON, stashes our sigint handler in its _npy_sig_save
thread 1 leaves via NPY_SIGINT_OFF, restores Python’s default sigint handler from its _npy_sig_save
thread 2 leaves via NPY_SIGINT_OFF, then restores our sigint handler from its _npy_sig_save

Now our signal handler gets left installed indefinitely, and eventually when control-C gets hit it will attempt to jump out to a stack frame that was reused ages ago. This is a bad problem and you will not go to space today.

By adding some printf checks to NPY_SIGINT_OFF, I’ve confirmed that this is the cause of the segfault that Oscar reported on numpy-discussion. It’s very easy to replicate even from Python code:

import numpy
numpy.test()
import os
import signal
os.kill(os.getpid(), signal.SIGINT)
# -> segfault

Also, the whole architecture here is somewhat screwed up, because the Python model is that all signals are directed to the main thread (by setting them to masked on all other threads). So it doesn’t even make sense to be installing a signal handler to interrupt operations on a thread that can’t receive signals.

(If you look at the definition of _PyArray_SigintHandler and _PyArray_GetSigintBuf in multiarraymodule.c, then you’ll see that there’s some basic defenses against this: we keep our jump buffer in TLS, and also a flag sigint_buf_init, so that idea is that if we start running the signal handler in a thread that hasn’t actually allocated a jump buffer, then we throw away the signal. This is also pretty buggy: (1) nothing ever sets the flag back to 0, so if a thread has ever had a valid jump buffer on its stack then we will happily jump to it, even if it’s no longer valid. (2) we shouldn’t throw away the signal, we should pass it on to Python.)

So the simple solution is (1) check whether we are actually the one thread that can receive signals, (2) if so, then register our handler etc.; (3) if not, then make these macros into no-ops.

The one wrinkle here is that in theory it’s possible for someone to call sigprocmask themselves and enable signal receiving on a non-default thread. We might want to be robust to this. A nice extra constraint is that there’s no C API to check whether we are in the main thread; the only way I can think of to check that is actually to call sigprocmask and check whether SIGINT delivery is enabled for this thread.

Issue Analytics

State:
Created 7 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

sebergcommented, Nov 21, 2022

I will make a mini PR to just delete all traces of interrupt handling stuff (I must have missed two lines). I honestly would be 👍 for just deleting the header:

This should hit exceedingly few users
vendoring the header is a simple enough fix

But if not now… let’s mark it to maybe do when a major release comes around (i.e. any moment).

0reactions

sebergcommented, Nov 21, 2022

As to this issue, closing as a duplicate of the newer (and more focused) gh-12541.

Top Results From Across the Web

crash due to race condition in updating tstate->id - Python ...

After some testing and debugging I was able to reproduce it without our own code using only pybind11 library to simplify embedding (in...

What causes a Python segmentation fault? - Stack Overflow

This happens when a python extension (written in C) tries to access a memory beyond reach. You can trace it in following ways....

1773520 – Segfault in python bindings for ... - Red Hat Bugzilla

I have a race condition where I get segmentation faults from time to time (depends on how I change my code) and I...

Segmentation fault of Py_CLEAR in python-h5py, race ...

Segmentation fault of Py_CLEAR in python-h5py, race-condition or mis-use? 95 views ... It should be the line 0xYYYYYacc causing the error, ...

Handling Segfaults in Python that occur in custom C++ libraries

I am using a rather large C++ library to run some simulations. I have wrapped a function that calls this library which results...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

NPY_SIGINT_{ON,OFF} have a race condition that can cause control-C to segfault Python

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Crash on `array_dealloc` when running test suite

Pickle is significantly slower than a memory copy