Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Numba CUDA kernel very slow compared to CuPy fuse

See original GitHub issue

Using the Numba vectorize decorator I defined an element-wise function for a CUDA device:

@vectorize([types.float32(types.float32, types.float32, types.float32)], target="cuda")
def nb_function(xx, yy, xy):
    sqrt_term = math.sqrt(max(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
    return .5 * (xx + yy - sqrt_term)

Being just a simple element-wise function, the definition is pretty much the same in CuPy:

@cp.fuse
def cp_function(xx, yy, xy):
    sqrt_term = cp.sqrt(cp.maximum(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
    return .5 * (xx + yy - sqrt_term)

Oddly enough, the Numba kernel is between 5x - 10x slower than the CuPy kernel (pure kernel execution time, no memory allocations/transfers). Timings with Cuda 10.2, NVidia 2070, size 4048x4048:

Numba 5.1ms
CuPy 0.8ms

I tried implementing the method with numba.cuda.jit (manual thread / block handling), but the timing was almost identical to the vectorize version.

Does this indicate that something “bad” is happening to the element-wise-code that is generated for the CUDA compiler? Naively I wouldn’t expect any significant difference.

Here is the full benchmark code for Numba and CuPy:

# Numba version
import math
import numpy as np
from numba import cuda, vectorize, types


@vectorize([types.float32(types.float32, types.float32, types.float32)], target="cuda")
def nb_function(xx, yy, xy):
    sqrt_term = math.sqrt(max(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
    return .5 * (xx + yy - sqrt_term)


sz = 4096
a1, a2, a3, a_out = (cuda.device_array((sz, sz), dtype=np.float32) for _ in range(4))

# Warmup
for _ in range(3):
    nb_function(a1, a2, a3, out=a_out)

# Timeit
e1, e2, stream = cuda.event(), cuda.event(), cuda.stream()
e1.record(stream)
nb_function(a1, a2, a3, out=a_out, stream=stream)
e2.record(stream)
e2.synchronize()
print(f"Numba: {e1.elapsed_time(e2):.2f} ms")

# CuPy version
import numpy as np
import cupy as cp


@cp.fuse
def cp_function(xx, yy, xy):
    sqrt_term = cp.sqrt(cp.maximum(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
    return .5 * (xx + yy - sqrt_term)

sz = 4096
a1, a2, a3 = (cp.empty((sz, sz), dtype=np.float32) for _ in range(3))

# Warmup
for _ in range(3):
    cp_function(a1, a2, a3)

# Timeit
e1, e2, stream = cp.cuda.Event(), cp.cuda.Event(), cp.cuda.Stream()
with stream:
    e1.record(stream)
    cp_function(a1, a2, a3)
    e2.record(stream)
e2.synchronize()
print(f"Cupy: {cp.cuda.get_elapsed_time(e1, e2):.2f} ms")

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

pwuertzcommented, May 22, 2020

All right, here is what fuse is doing: The expression is transformed into ~100loc with all operations explicitly written down and assigned to temporary variables, one by one. Also, every single value that goes in/out of each intermediate operation is being wrapped with static_cast<float>.

After burning some time trying to figure out some deeper meaning behind the specific order the operations were being transformed to… the solution was dead simple:

@vectorize([types.float32(types.float32, types.float32, types.float32)], target="cuda")
def nb_function(xx, yy, xy):
    sqrt_term = math.sqrt(max(nb.float32(0.), xx * xx - nb.float32(2.) * xx * yy + nb.float32(4.) * xy * xy + yy * yy))
    return nb.float32(.5) * (xx + yy - sqrt_term)

Same execution time. Conclusion, always typecast your constants!

I suppose there isn’t an easy way to prevent such accidents? Require constants being typed?

0reactions

esccommented, May 25, 2020

@pwuertz thanks for following up and closing this issue!

Top Results From Across the Web

Need help understanding Kernel Transport speed on GPU ...

I'm using cupy and numba. THe first time I execute a function call that is using cupy's GPU version of numpy it is...

Random array generation : numba cuda slower than cupy?

I realized that what slowed down my program was the data storage step in the device memory between cupy and numba. cuda and...

Why Is This Cuda Kernel Slow

Fusing CUDA kernels to optimize userdefined calculation. Customizable memory It may make things slower at the first kernel call though this slow down...

HSF/PyHEP

Anything in math is slow on an array. ... Numpy: 322 ms, CuPy: 31.7 ms, Cupy RawKernel: 4.07 ms. cupy.fuse() doesn't seem to...

CuPy and Numba on the GPU – Lesson Title

NumPy can be used for array math on the CPU. Array operations are very amenable to execution on a massively parallel GPU. We...