question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`cupy.convolve` is slow compared to `cupyx.scipy.ndimage.convolve1d`

See original GitHub issue
  • Conditions:
CuPy Version          : 8.0.0b4
CUDA Root             : /usr/local/cuda
CUDA Build Version    : 10020
CUDA Driver Version   : 10020
CUDA Runtime Version  : 10020
cuBLAS Version        : 10202
cuFFT Version         : 10102
cuRAND Version        : 10102
cuSOLVER Version      : (10, 3, 0)
cuSPARSE Version      : 10301
NVRTC Version         : (10, 2)
Thrust Version        : 100907
cuDNN Build Version   : 7605
cuDNN Version         : 7605
NCCL Build Version    : 2604
NCCL Runtime Version  : 2604
CUB Version           : None
cuTENSOR Version      : 10001
  • Code to reproduce:
for a_len in [1000, 10000, 100000]:
    for v_len in [10, 25, 100, 1000]:
        print(a_len, v_len)
        a = cupy.random.rand(a_len)
        v = cupy.random.rand(v_len)
        %timeit cupy.convolve(a, v, 'same'); cupy.cuda.Stream.null.synchronize()
        %timeit cupyx.scipy.ndimage.convolve1d(a, v, mode='constant'); cupy.cuda.Stream.null.synchronize()
  • Generates the timings: (a few extra were added)
a_len v_len cupy.convolve ndimage.convolve
1000 10 3.7 ms 43 µs
1000 25 878 µs 45 µs
1000 100 3.8 ms 56 µs
1000 1000 3.8 ms 171 µs
10000 10 1.9 ms 43 µs
10000 25 1.8 ms 46 µs
10000 100 1.9 ms 57 µs
10000 1000 2 ms 173 µs
10000 10000 2.7 ms 1.4 ms
100000 10 4.7 ms 53 µs
100000 25 5.1 ms 57 µs
100000 100 4.7 ms 73 µs
100000 1000 4.7 ms 234 µs
100000 10000 4.4 ms 1.8 ms
100000 100000 6 ms 17 ms

Except for the last entries, cupyx.scipy.ndimage.convolve is 1 to 2 orders of magnitude faster (with a few that are only twice as fast) than cupy.convolve. The last one however is actually much faster with cupy.convolve.

So it seems that except for some very large inputs and kernels, cupy.convolve is very slow. I think that transition could be taken care of with cupyx.scipy.signal.choose_conv_method and thus using 'fft' be used for the large inputs.

Note that this does not test the other modes of cupy.convolve. The 'valid' mode should also be faster with ndimage since it could be implemented by slicing the output. The 'full' mode will be closer in times and is the main drawback of the ndimage version since it would require padding the image ahead of time (incurring a large duplication of data). However, some adjustments to the algorithm could be made to be able to pad while computing and that would then accommodate that issue.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
grlee77commented, Jul 7, 2020

cupyx.scipy.ndimage.convolve1d has only dot convolution kernel. So it is slow for large inputs.

Agreed that when both inputs are large, the “dot” approach is slow and an FFT-based approach will be faster. For ndimage applications it is typical that you have a “large” image and a “small” filter kernel so the dot approach seems appropriate for cupyx.scipy.ndimage, but I agree that dynamically choosing seems preferable for general cupy.convolve.

2reactions
asi1024commented, Jul 7, 2020

There are two issues to resolve these performance differences.

  1. Current cupy.convolve always uses _fft_convolve for float inputs and _dot_convolve for integer inputs, but it should switch between a dot convolution kernel and FFT by the input sizes as @leofang commented in https://github.com/cupy/cupy/issues/3526#issuecomment-653139480.
  2. cupyx.scipy.ndimage.convolve1d has only dot convolution kernel. So it is slow for large inputs.
Read more comments on GitHub >

github_iconTop Results From Across the Web

cupyx.scipy.ndimage.convolve — CuPy 11.4.0 documentation
cupyx.scipy.ndimage.convolve# ... Multi-dimensional convolution. The array is convolved with the given kernel. ... When the output data type is integral (or when no ......
Read more >
cupy.pdf - Read the Docs
It may make things slower at the first kernel call, though this slow ... cupyx.scipy.ndimage and cupyx.scipy.signal (#4878, #4879, #4880).
Read more >
Only GPU to CPU transfer with cupy is incredible slow
import cupy as cp from cupyx.scipy.ndimage import convolve import numpy as np import time # Fast... xt = np.random.randint(0, 255, (20, 256, ...
Read more >
Multidimensional image processing (scipy.ndimage)
The standard deviation of the Gaussian filter is passed through the parameter sigma. Setting order = 0 corresponds to convolution with a Gaussian...
Read more >
scipy.signal.fftconvolve — SciPy v1.9.3 Manual
Convolve two N-dimensional arrays using FFT. Convolve in1 and in2 using the fast Fourier transform method, with the output size determined by the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found