`cupy.convolve` is slow compared to `cupyx.scipy.ndimage.convolve1d`
See original GitHub issue- Conditions:
CuPy Version : 8.0.0b4
CUDA Root : /usr/local/cuda
CUDA Build Version : 10020
CUDA Driver Version : 10020
CUDA Runtime Version : 10020
cuBLAS Version : 10202
cuFFT Version : 10102
cuRAND Version : 10102
cuSOLVER Version : (10, 3, 0)
cuSPARSE Version : 10301
NVRTC Version : (10, 2)
Thrust Version : 100907
cuDNN Build Version : 7605
cuDNN Version : 7605
NCCL Build Version : 2604
NCCL Runtime Version : 2604
CUB Version : None
cuTENSOR Version : 10001
- Code to reproduce:
for a_len in [1000, 10000, 100000]:
for v_len in [10, 25, 100, 1000]:
print(a_len, v_len)
a = cupy.random.rand(a_len)
v = cupy.random.rand(v_len)
%timeit cupy.convolve(a, v, 'same'); cupy.cuda.Stream.null.synchronize()
%timeit cupyx.scipy.ndimage.convolve1d(a, v, mode='constant'); cupy.cuda.Stream.null.synchronize()
- Generates the timings: (a few extra were added)
a_len |
v_len |
cupy.convolve |
ndimage.convolve |
---|---|---|---|
1000 | 10 | 3.7 ms | 43 µs |
1000 | 25 | 878 µs | 45 µs |
1000 | 100 | 3.8 ms | 56 µs |
1000 | 1000 | 3.8 ms | 171 µs |
10000 | 10 | 1.9 ms | 43 µs |
10000 | 25 | 1.8 ms | 46 µs |
10000 | 100 | 1.9 ms | 57 µs |
10000 | 1000 | 2 ms | 173 µs |
10000 | 10000 | 2.7 ms | 1.4 ms |
100000 | 10 | 4.7 ms | 53 µs |
100000 | 25 | 5.1 ms | 57 µs |
100000 | 100 | 4.7 ms | 73 µs |
100000 | 1000 | 4.7 ms | 234 µs |
100000 | 10000 | 4.4 ms | 1.8 ms |
100000 | 100000 | 6 ms | 17 ms |
Except for the last entries, cupyx.scipy.ndimage.convolve
is 1 to 2 orders of magnitude faster (with a few that are only twice as fast) than cupy.convolve
. The last one however is actually much faster with cupy.convolve
.
So it seems that except for some very large inputs and kernels, cupy.convolve
is very slow. I think that transition could be taken care of with cupyx.scipy.signal.choose_conv_method
and thus using 'fft'
be used for the large inputs.
Note that this does not test the other modes of cupy.convolve
. The 'valid'
mode should also be faster with ndimage since it could be implemented by slicing the output. The 'full'
mode will be closer in times and is the main drawback of the ndimage version since it would require padding the image ahead of time (incurring a large duplication of data). However, some adjustments to the algorithm could be made to be able to pad while computing and that would then accommodate that issue.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:8 (8 by maintainers)
Top GitHub Comments
Agreed that when both inputs are large, the “dot” approach is slow and an FFT-based approach will be faster. For
ndimage
applications it is typical that you have a “large” image and a “small” filter kernel so the dot approach seems appropriate forcupyx.scipy.ndimage
, but I agree that dynamically choosing seems preferable for generalcupy.convolve
.There are two issues to resolve these performance differences.
cupy.convolve
always uses _fft_convolve for float inputs and_dot_convolve
for integer inputs, but it should switch between a dot convolution kernel and FFT by the input sizes as @leofang commented in https://github.com/cupy/cupy/issues/3526#issuecomment-653139480.cupyx.scipy.ndimage.convolve1d
has only dot convolution kernel. So it is slow for large inputs.