question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Need in-place and planned C2C & Z2Z FFT for better performance

See original GitHub issue

Hello it’s me again. I am migrating another set of codes to CuPy. This program heavily relies on C2C & Z2Z (complex-to-complex, single/double precision) FFT (we spend >50% of runtime in FFT), and I notice that CuPy’s FFT wrapper always does an “out of place” transform; that is, all routines in the cupy.fft family allocate an additional buffer during FFT at this line https://github.com/cupy/cupy/blob/5e17c157faa60bedacd6d6cbdc51d63e7145c80d/cupy/fft/fft.py#L74 , causing highly redundant data movement in our code (nvprof shows ~25% of time spent here).

However, according to the cuFFT doc it supports in-place C2C & Z2Z FFT, so it’d be nice if this feature can be added. Perhaps we can have an additional flag for users to make their own decision, or we can also compare the in and out buffers to decide if in-place or not under the hood, see for example scikit-cuda’s implementation at https://github.com/lebedov/scikit-cuda/blob/87f34fbe09d45825b2214665d5aa8c4da9e2ffdb/skcuda/fft.py#L195 (this may not be easy to achieve based on the current API design though). If you do not have the bandwidth to support this in the near future, I might be able to create a PR. Let me know what you think.

Thank you very much.

UPDATE on Feb 16, 2019: See https://github.com/cupy/cupy/issues/1669#issuecomment-454488982 for a todo list.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:3
  • Comments:20 (20 by maintainers)

github_iconTop GitHub Comments

2reactions
grlee77commented, Jan 7, 2019

I would think having separate PRs for adding the plan argument and the plan caching would be appropriate. The former should be fairly simple to implement and review while there are a wider range of possibilities for the caching.

2reactions
grlee77commented, Sep 30, 2018

Hi @leofang I recently implemented this for fftn and ifftn and was planning to make a PR. If you want to test it or collaborate on improvements, that would be welcome.

I am not sure exactly how it can best fit into cupy as it introduces a couple of keyword arguments that are not part of the NumPy API.

Specifically, if you want to try it you can use the following branch: https://github.com/grlee77/cupy/tree/cufftn/ I have only modified cupy.fft.fftn and cupy.fft.ifftn to use n-dimensional plans and potential in-place operation. To try it, you need to set plan_type='nd' and pass in your preallocated array via the out kwarg. The PR also allows precomputing and storing the plan via a new function cupy.fft.get_cufft_plan_nd which can also be passed in via the plan kwarg. For the 2D and 3D FFTs I have been using, the n-d planning and preallocating the plan make a much bigger difference than the use of an in-place array. I have seen an order of magnitude improvement in some situations.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Problem with CUFFT Z2Z - CUDA - NVIDIA Developer Forums
Hi, I am performing FFT (Z2Z) on an image of NXN size; ... I am doing an in-place C2C or Z2Z, then I...
Read more >
GPU Computing with CUDA Lecture 8 - CUFFT, PyCUDA
‣CUFFT: A CUDA based FFT library ... CUFFT is good for larger, power of two sized FFTs ... Use the CUFFT plan to...
Read more >
CUDA cufft library 2D FFT only the left half plane correct
Thus, it is normal to only get "half" the output of transform, because the other "half" is identical. This is not unique to...
Read more >
CUDA 5 and Beyond - SIE
More Efficient Multiprocessors. 135% performance/core vs. ... Eliminate need for cudaMemcpy() ... C2C/Z2Z Forward and Inverse, in-place only. 1D, 2D, 3D.
Read more >
(PDF) CUFFT Library User's Guide | Leonardo Suriano
The cuFFT library is designed to provide high performance on NVIDIA GPUs. ... below: FFT type input data size output data size C2C...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found