Need in-place and planned C2C & Z2Z FFT for better performance
See original GitHub issueHello it’s me again. I am migrating another set of codes to CuPy. This program heavily relies on C2C & Z2Z (complex-to-complex, single/double precision) FFT (we spend >50% of runtime in FFT), and I notice that CuPy’s FFT wrapper always does an “out of place” transform; that is, all routines in the cupy.fft
family allocate an additional buffer during FFT at this line https://github.com/cupy/cupy/blob/5e17c157faa60bedacd6d6cbdc51d63e7145c80d/cupy/fft/fft.py#L74
, causing highly redundant data movement in our code (nvprof shows ~25% of time spent here).
However, according to the cuFFT doc it supports in-place C2C & Z2Z FFT, so it’d be nice if this feature can be added. Perhaps we can have an additional flag for users to make their own decision, or we can also compare the in and out buffers to decide if in-place or not under the hood, see for example scikit-cuda’s implementation at https://github.com/lebedov/scikit-cuda/blob/87f34fbe09d45825b2214665d5aa8c4da9e2ffdb/skcuda/fft.py#L195 (this may not be easy to achieve based on the current API design though). If you do not have the bandwidth to support this in the near future, I might be able to create a PR. Let me know what you think.
Thank you very much.
UPDATE on Feb 16, 2019: See https://github.com/cupy/cupy/issues/1669#issuecomment-454488982 for a todo list.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:20 (20 by maintainers)
Top GitHub Comments
I would think having separate PRs for adding the plan argument and the plan caching would be appropriate. The former should be fairly simple to implement and review while there are a wider range of possibilities for the caching.
Hi @leofang I recently implemented this for
fftn
andifftn
and was planning to make a PR. If you want to test it or collaborate on improvements, that would be welcome.I am not sure exactly how it can best fit into
cupy
as it introduces a couple of keyword arguments that are not part of the NumPy API.Specifically, if you want to try it you can use the following branch: https://github.com/grlee77/cupy/tree/cufftn/ I have only modified
cupy.fft.fftn
andcupy.fft.ifftn
to use n-dimensional plans and potential in-place operation. To try it, you need to setplan_type='nd'
and pass in your preallocated array via theout
kwarg. The PR also allows precomputing and storing the plan via a new functioncupy.fft.get_cufft_plan_nd
which can also be passed in via theplan
kwarg. For the 2D and 3D FFTs I have been using, the n-d planning and preallocating the plan make a much bigger difference than the use of an in-place array. I have seen an order of magnitude improvement in some situations.