question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Numba CUDA kernel very slow compared to CuPy fuse

See original GitHub issue

Using the Numba vectorize decorator I defined an element-wise function for a CUDA device:

@vectorize([types.float32(types.float32, types.float32, types.float32)], target="cuda")
def nb_function(xx, yy, xy):
    sqrt_term = math.sqrt(max(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
    return .5 * (xx + yy - sqrt_term)

Being just a simple element-wise function, the definition is pretty much the same in CuPy:

@cp.fuse
def cp_function(xx, yy, xy):
    sqrt_term = cp.sqrt(cp.maximum(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
    return .5 * (xx + yy - sqrt_term)

Oddly enough, the Numba kernel is between 5x - 10x slower than the CuPy kernel (pure kernel execution time, no memory allocations/transfers). Timings with Cuda 10.2, NVidia 2070, size 4048x4048:

  • Numba 5.1ms
  • CuPy 0.8ms

I tried implementing the method with numba.cuda.jit (manual thread / block handling), but the timing was almost identical to the vectorize version.

Does this indicate that something “bad” is happening to the element-wise-code that is generated for the CUDA compiler? Naively I wouldn’t expect any significant difference.

Here is the full benchmark code for Numba and CuPy:

# Numba version
import math
import numpy as np
from numba import cuda, vectorize, types


@vectorize([types.float32(types.float32, types.float32, types.float32)], target="cuda")
def nb_function(xx, yy, xy):
    sqrt_term = math.sqrt(max(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
    return .5 * (xx + yy - sqrt_term)


sz = 4096
a1, a2, a3, a_out = (cuda.device_array((sz, sz), dtype=np.float32) for _ in range(4))

# Warmup
for _ in range(3):
    nb_function(a1, a2, a3, out=a_out)

# Timeit
e1, e2, stream = cuda.event(), cuda.event(), cuda.stream()
e1.record(stream)
nb_function(a1, a2, a3, out=a_out, stream=stream)
e2.record(stream)
e2.synchronize()
print(f"Numba: {e1.elapsed_time(e2):.2f} ms")
# CuPy version
import numpy as np
import cupy as cp


@cp.fuse
def cp_function(xx, yy, xy):
    sqrt_term = cp.sqrt(cp.maximum(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
    return .5 * (xx + yy - sqrt_term)

sz = 4096
a1, a2, a3 = (cp.empty((sz, sz), dtype=np.float32) for _ in range(3))

# Warmup
for _ in range(3):
    cp_function(a1, a2, a3)

# Timeit
e1, e2, stream = cp.cuda.Event(), cp.cuda.Event(), cp.cuda.Stream()
with stream:
    e1.record(stream)
    cp_function(a1, a2, a3)
    e2.record(stream)
e2.synchronize()
print(f"Cupy: {cp.cuda.get_elapsed_time(e1, e2):.2f} ms")

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
pwuertzcommented, May 22, 2020

All right, here is what fuse is doing: The expression is transformed into ~100loc with all operations explicitly written down and assigned to temporary variables, one by one. Also, every single value that goes in/out of each intermediate operation is being wrapped with static_cast<float>.

After burning some time trying to figure out some deeper meaning behind the specific order the operations were being transformed to… the solution was dead simple:

@vectorize([types.float32(types.float32, types.float32, types.float32)], target="cuda")
def nb_function(xx, yy, xy):
    sqrt_term = math.sqrt(max(nb.float32(0.), xx * xx - nb.float32(2.) * xx * yy + nb.float32(4.) * xy * xy + yy * yy))
    return nb.float32(.5) * (xx + yy - sqrt_term)

Same execution time. Conclusion, always typecast your constants!

I suppose there isn’t an easy way to prevent such accidents? Require constants being typed?

0reactions
esccommented, May 25, 2020

@pwuertz thanks for following up and closing this issue!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Need help understanding Kernel Transport speed on GPU ...
I'm using cupy and numba. THe first time I execute a function call that is using cupy's GPU version of numpy it is...
Read more >
Random array generation : numba cuda slower than cupy?
I realized that what slowed down my program was the data storage step in the device memory between cupy and numba. cuda and...
Read more >
Why Is This Cuda Kernel Slow
Fusing CUDA kernels to optimize userdefined calculation. Customizable memory It may make things slower at the first kernel call though this slow down...
Read more >
HSF/PyHEP
Anything in math is slow on an array. ... Numpy: 322 ms, CuPy: 31.7 ms, Cupy RawKernel: 4.07 ms. cupy.fuse() doesn't seem to...
Read more >
CuPy and Numba on the GPU – Lesson Title
NumPy can be used for array math on the CPU. Array operations are very amenable to execution on a massively parallel GPU. We...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found