Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`cupy.convolve` is slow compared to `cupyx.scipy.ndimage.convolve1d`

See original GitHub issue

Conditions:

CuPy Version          : 8.0.0b4
CUDA Root             : /usr/local/cuda
CUDA Build Version    : 10020
CUDA Driver Version   : 10020
CUDA Runtime Version  : 10020
cuBLAS Version        : 10202
cuFFT Version         : 10102
cuRAND Version        : 10102
cuSOLVER Version      : (10, 3, 0)
cuSPARSE Version      : 10301
NVRTC Version         : (10, 2)
Thrust Version        : 100907
cuDNN Build Version   : 7605
cuDNN Version         : 7605
NCCL Build Version    : 2604
NCCL Runtime Version  : 2604
CUB Version           : None
cuTENSOR Version      : 10001

Code to reproduce:

for a_len in [1000, 10000, 100000]:
    for v_len in [10, 25, 100, 1000]:
        print(a_len, v_len)
        a = cupy.random.rand(a_len)
        v = cupy.random.rand(v_len)
        %timeit cupy.convolve(a, v, 'same'); cupy.cuda.Stream.null.synchronize()
        %timeit cupyx.scipy.ndimage.convolve1d(a, v, mode='constant'); cupy.cuda.Stream.null.synchronize()

Generates the timings: (a few extra were added)

`a_len`	`v_len`	`cupy.convolve`	`ndimage.convolve`
1000	10	3.7 ms	43 µs
1000	25	878 µs	45 µs
1000	100	3.8 ms	56 µs
1000	1000	3.8 ms	171 µs
10000	10	1.9 ms	43 µs
10000	25	1.8 ms	46 µs
10000	100	1.9 ms	57 µs
10000	1000	2 ms	173 µs
10000	10000	2.7 ms	1.4 ms
100000	10	4.7 ms	53 µs
100000	25	5.1 ms	57 µs
100000	100	4.7 ms	73 µs
100000	1000	4.7 ms	234 µs
100000	10000	4.4 ms	1.8 ms
100000	100000	6 ms	17 ms

Except for the last entries, cupyx.scipy.ndimage.convolve is 1 to 2 orders of magnitude faster (with a few that are only twice as fast) than cupy.convolve. The last one however is actually much faster with cupy.convolve.

So it seems that except for some very large inputs and kernels, cupy.convolve is very slow. I think that transition could be taken care of with cupyx.scipy.signal.choose_conv_method and thus using 'fft' be used for the large inputs.

Note that this does not test the other modes of cupy.convolve. The 'valid' mode should also be faster with ndimage since it could be implemented by slicing the output. The 'full' mode will be closer in times and is the main drawback of the ndimage version since it would require padding the image ahead of time (incurring a large duplication of data). However, some adjustments to the algorithm could be made to be able to pad while computing and that would then accommodate that issue.

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:8 (8 by maintainers)

Top GitHub Comments

3reactions

grlee77commented, Jul 7, 2020

cupyx.scipy.ndimage.convolve1d has only dot convolution kernel. So it is slow for large inputs.

Agreed that when both inputs are large, the “dot” approach is slow and an FFT-based approach will be faster. For ndimage applications it is typical that you have a “large” image and a “small” filter kernel so the dot approach seems appropriate for cupyx.scipy.ndimage, but I agree that dynamically choosing seems preferable for general cupy.convolve.

2reactions

asi1024commented, Jul 7, 2020

There are two issues to resolve these performance differences.

Current cupy.convolve always uses _fft_convolve for float inputs and _dot_convolve for integer inputs, but it should switch between a dot convolution kernel and FFT by the input sizes as @leofang commented in https://github.com/cupy/cupy/issues/3526#issuecomment-653139480.
cupyx.scipy.ndimage.convolve1d has only dot convolution kernel. So it is slow for large inputs.

Top Results From Across the Web

cupyx.scipy.ndimage.convolve — CuPy 11.4.0 documentation

cupyx.scipy.ndimage.convolve# ... Multi-dimensional convolution. The array is convolved with the given kernel. ... When the output data type is integral (or when no ......

cupy.pdf - Read the Docs

It may make things slower at the first kernel call, though this slow ... cupyx.scipy.ndimage and cupyx.scipy.signal (#4878, #4879, #4880).

Only GPU to CPU transfer with cupy is incredible slow

import cupy as cp from cupyx.scipy.ndimage import convolve import numpy as np import time # Fast... xt = np.random.randint(0, 255, (20, 256, ...

Multidimensional image processing (scipy.ndimage)

The standard deviation of the Gaussian filter is passed through the parameter sigma. Setting order = 0 corresponds to convolution with a Gaussian...

scipy.signal.fftconvolve — SciPy v1.9.3 Manual

Convolve two N-dimensional arrays using FFT. Convolve in1 and in2 using the fast Fourier transform method, with the output size determined by the...