Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

convolve_fft is >10x slower than scipy.signal.fftconvolve

See original GitHub issue

It has been reported in the statmorph package that astropy.convolution.convolve_fft is at least an order of magnitude slower than scipy.signal.fftconvolve. Plot of runtimes vs. kernel size:

https://github.com/vrodgom/statmorph/commit/1fdd41033cf86952ac6c878dfc2685f333e241ef#commitcomment-43813226

CC: @vrodgom

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:12 (11 by maintainers)

Top GitHub Comments

2reactions

dhomeiercommented, Apr 12, 2021

https://github.com/scipy/scipy/blob/master/scipy/signal/signaltools.py#L453 is what scipy’s doing at the bottom.

I’m really surprised at the order-of-magnitude difference, though. I don’t think performance that bad is possible to achieve without having to overflow to disk…

I have been looking in a bit more detail at what astropy is doing for fft_pad compared to scipy, and it seems the padding scheme is partly outdated and partly simply mistaken. From the Notes on https://docs.scipy.org/doc/scipy/reference/generated/scipy.fft.fft.html

The symmetry is highest when n is a power of 2, and the transform is therefore most efficient for these sizes. For poorly factorizable sizes, scipy.fft uses Bluestein’s algorithm [2] and so is never worse than O(n log n). Further performance improvements may be seen by zero-padding the input using next_fast_len.

Bottom line I read from that is that the scipy.fft implementation is still reasonably fast for any random size, and more importantly, scipy.fft.next_fast_len will in general find a size that is much closer to the original than the next power of 2. In contrast, our convolve_fft

pads to the next power of 2, potentially almost doubling array size in one dimension (especially if an original 2^N-size array is first padded by a small amount by psf_pad)
then expands every dimension to the size from (1) or the largest dimension.

I simply could not find the rationale for the second step; as I understand the scipy docs, n-dimensional FFTs are basically done recursively over the dimensions, and _freq_domain_conv is separately padding each dimension to its optimal size:

# Speed up FFT by padding to optimal size.
        fshape = [sp_fft.next_fast_len(shape[a], not complex_result) for a in axes]

Timing tests indicate that that is the most efficient way for np.fft.*fft, too (as well as scipy.fftpack).

For convolve_fft this becomes really obvious with non-square inputs. Modifying the above tests for ny, nx = 1229, 4099 (using prime number-sized arrays, which supposedly have the poorest performance), I have made the corresponding timings for

%timeit -n 3 result3 = scipy.signal.fftconvolve(image, kernel, mode='same')
%timeit -n 3 result4 = astropy.convolution.convolve_fft(image, kernel, normalize_kernel=False, psf_pad=False, fft_pad=False, nan_treatment='fill')
# Only made a single test for this, which is creating a 8192*8192 complex128 array of 1 GB,
# but actual memory footprint is increasing to > 10 GB, time ~ 40 s.
%timeit -n 3 result5 = astropy.convolution.convolve_fft(image, kernel, normalize_kernel=False, psf_pad=True, fft_pad=True, nan_treatment='fill', allow_huge=True)

Next I tested a modified version setting newshape to next powers of 2, but individually for each axis. This creates a more manageable 2048*8192 array, but still doubles the time compared to no padding at all. Finally, I replaced the powers of 2 by values from scipy.fft.next_fast_len() (there are two options for complex or real input which seem to produce very similar results – possibly real=True only brings real advantages for fft.rfft, which we currently cannot use since the input is always unconditionally cast to complex). That brings the time within a factor of 2 or so of fftconvolve, and still closer to fftconvolve for complex input; using the scipy.fft functions brings some 10-20 % further speedup, so the remaining difference can probably be attributed to the extra pre- and post-processing. Note that disabling psf_pad had little impact in these latter tests (psf_pad=False rather becomes slower with fft_pad=False for this case, possibly because the padding makes the sizes at least non-prime), but introduces noticeable differences in the result close to the borders. convolve-timings

2reactions

keflavichcommented, Jan 14, 2021

OK, as I said, this is because of the selected options. psf_pad is required to avoid edge-wrapping with fft convolution. With that off, the performance improves ~10x because the convolution is of a 1024^2 image rather than a 1536^2 image, which is what was being used in the previous tests.

As far as I can tell, the difference for small kernels - which is a large factor but a small time - is from overheads, possibly from promoting the kernel to complex data type

Also, scipy defaults to using rfft, which is faster, if both the image and kernel are real. We could add that check to astropy convolution, that would get us a little boost.

scipy handles array padding in a different way that might actually be better, particularly for small kernels. I haven’t figured out what it does yet because it’s buried down close to the C layer. They may be taking the fft of the kernel, then padding, instead of padding, then taking the fft, which is clever. I’m not willing to commit to that being correct until I’ve had more coffee.

scipy’s fftconvolve is doing one other thing I don’t understand that is enabling them to avoid edge-wrapping without using the same padding approach we are. I suspect it’s what I just mentioned, and it may be responsible for scipy being just a smidgen slower for the biggest kernel size.

astropy_vs_scipy_performance

import numpy as np
import matplotlib.pyplot as plt
import time
import scipy.ndimage
import scipy.signal
import astropy.convolution
from astropy.modeling import models

plt.clf()

# Create base image
ny, nx = 1024, 1024
y, x = np.mgrid[0:ny, 0:nx]
sersic_model = models.Sersic2D(amplitude=1, r_eff=80, n=1.5, x_0=0.5*nx, y_0=0.4*ny,
                               ellip=0.5, theta=0.5)
image = sersic_model(x, y)

kernel_sizes = [3, 5, 9, 15, 25, 51, 101, 251, 501, ]

# Benchmark scipy.signal.fftconvolve
times_scipy = []
for kernel_size in kernel_sizes:
    kernel = np.ones((kernel_size, kernel_size), dtype=np.float64)
    start = time.time()
    result3 = scipy.signal.fftconvolve(image, kernel, mode='same')
    times_scipy.append(time.time() - start)

times_scipy_complex = []
for kernel_size in kernel_sizes:
    kernel = np.ones((kernel_size, kernel_size), dtype=np.complex64)
    start = time.time()
    result3 = scipy.signal.fftconvolve(image, kernel, mode='same')
    times_scipy_complex.append(time.time() - start)

# Benchmark astropy.convolution.convolve_fft:
times_astropy_default = []
for kernel_size in kernel_sizes:
    kernel = np.ones((kernel_size, kernel_size), dtype=np.float64)
    start = time.time()
    result4 = astropy.convolution.convolve_fft(image, kernel, normalize_kernel=False)
    times_astropy_default.append(time.time() - start)

times_astropy_no_psf_pad = []
for kernel_size in kernel_sizes:
    kernel = np.ones((kernel_size, kernel_size), dtype=np.float64)
    start = time.time()
    result4 = astropy.convolution.convolve_fft(image, kernel, normalize_kernel=False, psf_pad=False)
    times_astropy_no_psf_pad.append(time.time() - start)

times_astropy_no_psf_pad_nanfill = []
for kernel_size in kernel_sizes:
    kernel = np.ones((kernel_size, kernel_size), dtype=np.float64)
    start = time.time()
    result4 = astropy.convolution.convolve_fft(image, kernel, normalize_kernel=False, psf_pad=False, nan_treatment='fill')
    times_astropy_no_psf_pad_nanfill.append(time.time() - start)

times_astropy_no_psf_pad_nanfill_nofftpad = []
for kernel_size in kernel_sizes:
    kernel = np.ones((kernel_size, kernel_size), dtype=np.float64)
    start = time.time()
    result4 = astropy.convolution.convolve_fft(image, kernel, normalize_kernel=False, psf_pad=False, nan_treatment='fill', fft_pad=False)
    times_astropy_no_psf_pad_nanfill_nofftpad.append(time.time() - start)


plt.plot(kernel_sizes, times_astropy_default, label='astropy.convolution.convolve_fft default')
plt.plot(kernel_sizes, times_astropy_no_psf_pad, label='astropy.convolution.convolve_fft psf_pad=False')
plt.plot(kernel_sizes, times_astropy_no_psf_pad_nanfill, label='astropy.convolution.convolve_fft psf_pad=False fill nans')
plt.plot(kernel_sizes, times_astropy_no_psf_pad_nanfill_nofftpad, label='astropy.convolution.convolve_fft psf_pad=False fft_pad=False fill nans')
plt.plot(kernel_sizes, times_scipy, label='scipy.signal.fftconvolve')
plt.plot(kernel_sizes, times_scipy_complex, label='scipy.signal.fftconvolve complex')
plt.xscale('log')
plt.yscale('log')
plt.title('Image Size = %dx%d' % (ny, nx))
plt.xlabel('Kernel Size')
plt.ylabel('Time [s]')
plt.legend(bbox_to_anchor=(0.0,-0.45), loc="lower left")
plt.show()
plt.savefig("astropy_vs_scipy_performance.png", bbox_inches='tight')

Top Results From Across the Web

Convolution Product In Pyfftw Different From Scipy - ADocLib

I'm trying to implement a FFT convolution that mimics scipy.fftconvolve using pyfftw ... signal.fftconvolve 10 slower than custom real convolution #8154.

SciPy.signal.py - gists · GitHub

the direct method implemented by convolveND will be slow for large data. -- though it currently could use some optimizations)." scipy.signal.fftconvolve is ...

Convolution Filters - vigra - GitHub Pages

Convolve an array with a kernel by means of the Fourier transform. ... The filter responses may be used to calculate the monogenic...

radis package — RADIS 0.13.1 documentation

When individual evaluations are very fast, dispatching calls to workers can be slower than sequential computation because of the overhead.

an application to simulations of massive prestellar cores - arXiv

ACA observations improve in signal-to-noise ratio and lead to better column density ... the convolve fft3 function from the Astropy Python.