Numba CUDA kernel very slow compared to CuPy fuse
See original GitHub issueUsing the Numba vectorize
decorator I defined an element-wise function for a CUDA device:
@vectorize([types.float32(types.float32, types.float32, types.float32)], target="cuda")
def nb_function(xx, yy, xy):
sqrt_term = math.sqrt(max(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
return .5 * (xx + yy - sqrt_term)
Being just a simple element-wise function, the definition is pretty much the same in CuPy:
@cp.fuse
def cp_function(xx, yy, xy):
sqrt_term = cp.sqrt(cp.maximum(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
return .5 * (xx + yy - sqrt_term)
Oddly enough, the Numba kernel is between 5x - 10x slower than the CuPy kernel (pure kernel execution time, no memory allocations/transfers). Timings with Cuda 10.2, NVidia 2070, size 4048x4048:
- Numba 5.1ms
- CuPy 0.8ms
I tried implementing the method with numba.cuda.jit
(manual thread / block handling), but the timing was almost identical to the vectorize
version.
Does this indicate that something “bad” is happening to the element-wise-code that is generated for the CUDA compiler? Naively I wouldn’t expect any significant difference.
Here is the full benchmark code for Numba and CuPy:
# Numba version
import math
import numpy as np
from numba import cuda, vectorize, types
@vectorize([types.float32(types.float32, types.float32, types.float32)], target="cuda")
def nb_function(xx, yy, xy):
sqrt_term = math.sqrt(max(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
return .5 * (xx + yy - sqrt_term)
sz = 4096
a1, a2, a3, a_out = (cuda.device_array((sz, sz), dtype=np.float32) for _ in range(4))
# Warmup
for _ in range(3):
nb_function(a1, a2, a3, out=a_out)
# Timeit
e1, e2, stream = cuda.event(), cuda.event(), cuda.stream()
e1.record(stream)
nb_function(a1, a2, a3, out=a_out, stream=stream)
e2.record(stream)
e2.synchronize()
print(f"Numba: {e1.elapsed_time(e2):.2f} ms")
# CuPy version
import numpy as np
import cupy as cp
@cp.fuse
def cp_function(xx, yy, xy):
sqrt_term = cp.sqrt(cp.maximum(0., xx * xx - 2. * xx * yy + 4. * xy * xy + yy * yy))
return .5 * (xx + yy - sqrt_term)
sz = 4096
a1, a2, a3 = (cp.empty((sz, sz), dtype=np.float32) for _ in range(3))
# Warmup
for _ in range(3):
cp_function(a1, a2, a3)
# Timeit
e1, e2, stream = cp.cuda.Event(), cp.cuda.Event(), cp.cuda.Stream()
with stream:
e1.record(stream)
cp_function(a1, a2, a3)
e2.record(stream)
e2.synchronize()
print(f"Cupy: {cp.cuda.get_elapsed_time(e1, e2):.2f} ms")
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Need help understanding Kernel Transport speed on GPU ...
I'm using cupy and numba. THe first time I execute a function call that is using cupy's GPU version of numpy it is...
Read more >Random array generation : numba cuda slower than cupy?
I realized that what slowed down my program was the data storage step in the device memory between cupy and numba. cuda and...
Read more >Why Is This Cuda Kernel Slow
Fusing CUDA kernels to optimize userdefined calculation. Customizable memory It may make things slower at the first kernel call though this slow down...
Read more >HSF/PyHEP
Anything in math is slow on an array. ... Numpy: 322 ms, CuPy: 31.7 ms, Cupy RawKernel: 4.07 ms. cupy.fuse() doesn't seem to...
Read more >CuPy and Numba on the GPU – Lesson Title
NumPy can be used for array math on the CPU. Array operations are very amenable to execution on a massively parallel GPU. We...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
All right, here is what fuse is doing: The expression is transformed into ~100loc with all operations explicitly written down and assigned to temporary variables, one by one. Also, every single value that goes in/out of each intermediate operation is being wrapped with
static_cast<float>
.After burning some time trying to figure out some deeper meaning behind the specific order the operations were being transformed to… the solution was dead simple:
Same execution time. Conclusion, always typecast your constants!
I suppose there isn’t an easy way to prevent such accidents? Require constants being typed?
@pwuertz thanks for following up and closing this issue!