Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Compilation with cuda.jit randomly fails with segfault

See original GitHub issue

I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/master/CHANGE_LOG).
I have included below a minimal working reproducer (if you are unsure how to write one see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).

I’m porting a code over to python and using numba with cuda. I’m getting random segfaults that appear to be during the cuda compilation. It usually runs, but maybe one out of 5 times it segfaults.

This is an example that (sometimes) reproduces the problem.

import numpy as np
import math
from numba import cuda, float32

NUM_SAMPLES = 2
NUM_THREADS = 128

@cuda.jit
def computeValues(data_input, angle, total_size):
  threadID = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
  if threadID >= total_size:
    return

  s1 = cuda.local.array(shape=(NUM_SAMPLES,4), dtype=float32)

  for i in range(NUM_SAMPLES):
    for j in range(4):
      s1[i,j] = data_input[threadID,i,j]

    s1[i,2] = math.sin(angle) * s1[i,1] + math.cos(angle) * s1[i,2]


if __name__ == "__main__":
    total_size = 200

    data_input = np.zeros((total_size, NUM_SAMPLES, 4), dtype='float32')

    BlockSize = int(math.ceil(total_size / NUM_THREADS))

    num_gpu = len(cuda.gpus)
    print(f"number of CUDA devices: {num_gpu}")
    cuda.select_device(2)

    for value in [280.0, 285.0, 290.0]:
      print(f"value = {value} starting\n")
      computeValues[BlockSize, NUM_THREADS](data_input, 0.0, total_size)

Most times it works as expected:

$ PYTHONFAULTHANDLER=1 python3 ../test.py
number of CUDA devices: 4
value = 280.0 starting

value = 285.0 starting

value = 290.0 starting

Other times it fails like this:

$ PYTHONFAULTHANDLER=1 python3 ../test.py 
number of CUDA devices: 4
value = 280.0 starting

Fatal Python error: Segmentation fault

Current thread 0x00007f169eaf4740 (most recent call first):
  File "/lib/python3.7/site-packages/numba/cuda/cudadrv/nvvm.py", line 230 in compile
  File "/lib/python3.7/site-packages/numba/cuda/cudadrv/nvvm.py", line 512 in llvm_to_ptx
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 451 in get
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 480 in get
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 603 in bind
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 862 in compile
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 843 in specialize
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 832 in __call__
  File "../test.py", line 38 in <module>
Segmentation fault (core dumped)

It looks like this is happening during the cuda compilation. Is there anything I can change in my code to fix this?

Issue Analytics

State:
Created 3 years ago
Comments:11 (7 by maintainers)

Top GitHub Comments

1reaction

manyfeaturescommented, Nov 11, 2020

Actually the valgrind error might gave been spurious / a red herring. I notice that prior to optimization, there are no memset intrinsics in the IR. However, after optimization, we have:
  call void @llvm.memset.p0i8.i64(i8* nonnull align 4 %0, i8 0, i64 32, i1 false)
The LLVM 3.4 IR specification (upon which NVVM is based) expects 5 parameters to the memset instrinsic (ref):
declare void @llvm.memset.p0i8.i32(i8* <dest>, i8 <val>,
                                   i32 <len>, i32 <align>, i1 <isvolatile>)
declare void @llvm.memset.p0i8.i64(i8* <dest>, i8 <val>,
                                   i64 <len>, i32 <align>, i1 <isvolatile>)
Whereas LLVM 9 only has 4 arguments for memset (ref):
declare void @llvm.memset.p0i8.i32(i8* <dest>, i8 <val>,
                                   i32 <len>, i1 <isvolatile>)
declare void @llvm.memset.p0i8.i64(i8* <dest>, i8 <val>,
                                   i64 <len>, i1 <isvolatile>)
My understanding right now is that NVVM is parsing the optimized IR, but its last parameter is junk, leading to an occasional segfault.

Also, the segfault goes away if I disable optimization prior to sending the IR to NVVM, as in #5576 (comment)

I think this is another argument for not optimizing the IR with llvmlite’s LLVM prior to sending the IR to NVVM.

That was really helpful!

1reaction

gmarkallcommented, Jul 23, 2020

Actually the valgrind error might gave been spurious / a red herring. I notice that prior to optimization, there are no memset intrinsics in the IR. However, after optimization, we have:

  call void @llvm.memset.p0i8.i64(i8* nonnull align 4 %0, i8 0, i64 32, i1 false)

The LLVM 3.4 IR specification (upon which NVVM is based) expects 5 parameters to the memset instrinsic (ref):

declare void @llvm.memset.p0i8.i32(i8* <dest>, i8 <val>,
                                   i32 <len>, i32 <align>, i1 <isvolatile>)
declare void @llvm.memset.p0i8.i64(i8* <dest>, i8 <val>,
                                   i64 <len>, i32 <align>, i1 <isvolatile>)

Whereas LLVM 9 only has 4 arguments for memset (ref):

declare void @llvm.memset.p0i8.i32(i8* <dest>, i8 <val>,
                                   i32 <len>, i1 <isvolatile>)
declare void @llvm.memset.p0i8.i64(i8* <dest>, i8 <val>,
                                   i64 <len>, i1 <isvolatile>)

My understanding right now is that NVVM is parsing the optimized IR, but its last parameter is junk, leading to an occasional segfault.

Also, the segfault goes away if I disable optimization prior to sending the IR to NVVM, as in https://github.com/numba/numba/issues/5576#issuecomment-646548553

I think this is another argument for not optimizing the IR with llvmlite’s LLVM prior to sending the IR to NVVM.

Top Results From Across the Web

Troubleshooting and tips - Numba documentation

The most common reason for slowness of a compiled JIT function is that compiling in nopython mode has failed and the Numba compiler...

Why does my code (linked with CUDA) occasionally cause a ...

This is not actually a solution to your segfault problem, but a way to finding the actual culprit behind the segfault.

numba/numba - Gitter

I'd also really love to know why fork() is failing on your machine?! ... Hi, I want to use carray in the numba...

Release Notes - ArrayFire

Fixed launch configuration issues in CUDA JIT. Fixed segfaults and "Pure Virtual Call" error warnings when exiting on Windows.

CuArrays v2.0.1 results in segmentation fault for ...

The following code with CuArrays v2.0.1 results in segmentation fault on my ... Corporation Built on Fri_Feb__8_19:08:17_PST_2019 Cuda compilation tools, ...