question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Compilation with cuda.jit randomly fails with segfault

See original GitHub issue

I’m porting a code over to python and using numba with cuda. I’m getting random segfaults that appear to be during the cuda compilation. It usually runs, but maybe one out of 5 times it segfaults.

This is an example that (sometimes) reproduces the problem.

import numpy as np
import math
from numba import cuda, float32

NUM_SAMPLES = 2
NUM_THREADS = 128

@cuda.jit
def computeValues(data_input, angle, total_size):
  threadID = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
  if threadID >= total_size:
    return

  s1 = cuda.local.array(shape=(NUM_SAMPLES,4), dtype=float32)

  for i in range(NUM_SAMPLES):
    for j in range(4):
      s1[i,j] = data_input[threadID,i,j]

    s1[i,2] = math.sin(angle) * s1[i,1] + math.cos(angle) * s1[i,2]


if __name__ == "__main__":
    total_size = 200

    data_input = np.zeros((total_size, NUM_SAMPLES, 4), dtype='float32')

    BlockSize = int(math.ceil(total_size / NUM_THREADS))

    num_gpu = len(cuda.gpus)
    print(f"number of CUDA devices: {num_gpu}")
    cuda.select_device(2)

    for value in [280.0, 285.0, 290.0]:
      print(f"value = {value} starting\n")
      computeValues[BlockSize, NUM_THREADS](data_input, 0.0, total_size)     

Most times it works as expected:

$ PYTHONFAULTHANDLER=1 python3 ../test.py
number of CUDA devices: 4
value = 280.0 starting

value = 285.0 starting

value = 290.0 starting

Other times it fails like this:

$ PYTHONFAULTHANDLER=1 python3 ../test.py 
number of CUDA devices: 4
value = 280.0 starting

Fatal Python error: Segmentation fault

Current thread 0x00007f169eaf4740 (most recent call first):
  File "/lib/python3.7/site-packages/numba/cuda/cudadrv/nvvm.py", line 230 in compile
  File "/lib/python3.7/site-packages/numba/cuda/cudadrv/nvvm.py", line 512 in llvm_to_ptx
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 451 in get
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 480 in get
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 603 in bind
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 862 in compile
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 843 in specialize
  File "/lib/python3.7/site-packages/numba/cuda/compiler.py", line 832 in __call__
  File "../test.py", line 38 in <module>
Segmentation fault (core dumped)

It looks like this is happening during the cuda compilation. Is there anything I can change in my code to fix this?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
manyfeaturescommented, Nov 11, 2020

Actually the valgrind error might gave been spurious / a red herring. I notice that prior to optimization, there are no memset intrinsics in the IR. However, after optimization, we have:

  call void @llvm.memset.p0i8.i64(i8* nonnull align 4 %0, i8 0, i64 32, i1 false)

The LLVM 3.4 IR specification (upon which NVVM is based) expects 5 parameters to the memset instrinsic (ref):

declare void @llvm.memset.p0i8.i32(i8* <dest>, i8 <val>,
                                   i32 <len>, i32 <align>, i1 <isvolatile>)
declare void @llvm.memset.p0i8.i64(i8* <dest>, i8 <val>,
                                   i64 <len>, i32 <align>, i1 <isvolatile>)

Whereas LLVM 9 only has 4 arguments for memset (ref):

declare void @llvm.memset.p0i8.i32(i8* <dest>, i8 <val>,
                                   i32 <len>, i1 <isvolatile>)
declare void @llvm.memset.p0i8.i64(i8* <dest>, i8 <val>,
                                   i64 <len>, i1 <isvolatile>)

My understanding right now is that NVVM is parsing the optimized IR, but its last parameter is junk, leading to an occasional segfault.

Also, the segfault goes away if I disable optimization prior to sending the IR to NVVM, as in #5576 (comment)

I think this is another argument for not optimizing the IR with llvmlite’s LLVM prior to sending the IR to NVVM.

That was really helpful!

1reaction
gmarkallcommented, Jul 23, 2020

Actually the valgrind error might gave been spurious / a red herring. I notice that prior to optimization, there are no memset intrinsics in the IR. However, after optimization, we have:

  call void @llvm.memset.p0i8.i64(i8* nonnull align 4 %0, i8 0, i64 32, i1 false)

The LLVM 3.4 IR specification (upon which NVVM is based) expects 5 parameters to the memset instrinsic (ref):

declare void @llvm.memset.p0i8.i32(i8* <dest>, i8 <val>,
                                   i32 <len>, i32 <align>, i1 <isvolatile>)
declare void @llvm.memset.p0i8.i64(i8* <dest>, i8 <val>,
                                   i64 <len>, i32 <align>, i1 <isvolatile>)

Whereas LLVM 9 only has 4 arguments for memset (ref):

declare void @llvm.memset.p0i8.i32(i8* <dest>, i8 <val>,
                                   i32 <len>, i1 <isvolatile>)
declare void @llvm.memset.p0i8.i64(i8* <dest>, i8 <val>,
                                   i64 <len>, i1 <isvolatile>)

My understanding right now is that NVVM is parsing the optimized IR, but its last parameter is junk, leading to an occasional segfault.

Also, the segfault goes away if I disable optimization prior to sending the IR to NVVM, as in https://github.com/numba/numba/issues/5576#issuecomment-646548553

I think this is another argument for not optimizing the IR with llvmlite’s LLVM prior to sending the IR to NVVM.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting and tips - Numba documentation
The most common reason for slowness of a compiled JIT function is that compiling in nopython mode has failed and the Numba compiler...
Read more >
Why does my code (linked with CUDA) occasionally cause a ...
This is not actually a solution to your segfault problem, but a way to finding the actual culprit behind the segfault.
Read more >
numba/numba - Gitter
I'd also really love to know why fork() is failing on your machine?! ... Hi, I want to use carray in the numba...
Read more >
Release Notes - ArrayFire
Fixed launch configuration issues in CUDA JIT. Fixed segfaults and "Pure Virtual Call" error warnings when exiting on Windows.
Read more >
CuArrays v2.0.1 results in segmentation fault for ...
The following code with CuArrays v2.0.1 results in segmentation fault on my ... Corporation Built on Fri_Feb__8_19:08:17_PST_2019 Cuda compilation tools, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found