NPY_SIGINT_{ON,OFF} have a race condition that can cause control-C to segfault Python
See original GitHub issueIn the FFTPACK code, we have some “clever” code that uses the pattern NPY_SIGINT_ON
, then run some inner loop, then NPY_SIGINT_OFF
. And what this does is:
NPY_SIGINT_ON
uses setjmp
to make a jump buffer target, saves the current signal handler for SIGINT (control-C), and then registers a new SIGINT handler that attempts to handle control-C by longjmp
ing out to the setjmp
buffer. The idea is that this allows us to cancel long-running calculations.
The actual implementations of these macros is:
# define NPY_SIGINT_ON { \
PyOS_sighandler_t _npy_sig_save; \
_npy_sig_save = PyOS_setsig(SIGINT, _PyArray_SigintHandler);\
\
if (NPY_SIGSETJMP(*((NPY_SIGJMP_BUF *)_PyArray_GetSigintBuf(\
)), \
1) == 0) {
# define NPY_SIGINT_OFF } \
PyOS_setsig(SIGINT, _npy_sig_save); \
}
So this has two race conditions:
Minor problem: If we receive a control-C in between the call to PyOS_setsig
and the call to NPY_SIGSETJMP
, then we’ll longjmp out of the signal handler into a uninitialized buffer, and will not go to space today.
Major problem: suppose that the following sequence occurs:
- thread 1 enters
NPY_SIGINT_ON
, stashes Python’s default sigint handler in its_npy_sig_save
- thread 2 enters
NPY_SIGINT_ON
, stashes our sigint handler in its_npy_sig_save
- thread 1 leaves via
NPY_SIGINT_OFF
, restores Python’s default sigint handler from its_npy_sig_save
- thread 2 leaves via
NPY_SIGINT_OFF
, then restores our sigint handler from its_npy_sig_save
Now our signal handler gets left installed indefinitely, and eventually when control-C gets hit it will attempt to jump out to a stack frame that was reused ages ago. This is a bad problem and you will not go to space today.
By adding some printf
checks to NPY_SIGINT_OFF
, I’ve confirmed that this is the cause of the segfault that Oscar reported on numpy-discussion. It’s very easy to replicate even from Python code:
import numpy
numpy.test()
import os
import signal
os.kill(os.getpid(), signal.SIGINT)
# -> segfault
Also, the whole architecture here is somewhat screwed up, because the Python model is that all signals are directed to the main thread (by setting them to masked on all other threads). So it doesn’t even make sense to be installing a signal handler to interrupt operations on a thread that can’t receive signals.
(If you look at the definition of _PyArray_SigintHandler
and _PyArray_GetSigintBuf
in multiarraymodule.c
, then you’ll see that there’s some basic defenses against this: we keep our jump buffer in TLS, and also a flag sigint_buf_init
, so that idea is that if we start running the signal handler in a thread that hasn’t actually allocated a jump buffer, then we throw away the signal. This is also pretty buggy: (1) nothing ever sets the flag back to 0, so if a thread has ever had a valid jump buffer on its stack then we will happily jump to it, even if it’s no longer valid. (2) we shouldn’t throw away the signal, we should pass it on to Python.)
So the simple solution is (1) check whether we are actually the one thread that can receive signals, (2) if so, then register our handler etc.; (3) if not, then make these macros into no-ops.
The one wrinkle here is that in theory it’s possible for someone to call sigprocmask
themselves and enable signal receiving on a non-default thread. We might want to be robust to this. A nice extra constraint is that there’s no C API to check whether we are in the main thread; the only way I can think of to check that is actually to call sigprocmask
and check whether SIGINT delivery is enabled for this thread.
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
I will make a mini PR to just delete all traces of interrupt handling stuff (I must have missed two lines). I honestly would be 👍 for just deleting the header:
But if not now… let’s mark it to maybe do when a major release comes around (i.e. any moment).
As to this issue, closing as a duplicate of the newer (and more focused) gh-12541.