Compilation time is inconsistent between different environments
See original GitHub issueDescription
I’m working on a project that uses CuPy to accelerate quantum computing simulations. We employ CuPy with custom CUDA kernels loaded with RawModule. We are trying to benchmark our code to assess the JIT approach with respect to alternatives. During benchmarks, we found out that the compilation times are inconsistent between different versions of CuPy and CUDA.
For example, I’ve written a gist https://gist.github.com/mlazzarin/1e2128e90d78c4cb1a220075f64bc297 that loads our custom kernels with cp.RawModule
and compile them with .compile()
method.
I tried to run such example with different versions of CuPy and CUDA toolkit:
Environment | Compilation time |
---|---|
System installation of CUDA 11.5, cupy-cuda115 from pip |
~ 3.2 s |
cudatoolkit=11.5.0 and cupy=9.6.0 from conda-forge |
~ 3.2 s |
cudatoolkit=11.4.2 and cupy=9.6.0 from conda-forge |
~ 3.2 s |
cudatoolkit=11.3.1 and cupy=9.6.0 from conda-forge |
~ 3.2 s |
cudatoolkit=11.2.2 and cupy=9.6.0 from conda-forge |
~ 3.4 s |
cudatoolkit=11.1.1 and cupy=9.6.0 from conda-forge |
~ 1.6 s |
cudatoolkit=11.5.0 and cupy=9.5.0 from conda-forge |
~ 2.1 s (3.2 s first exec) |
cudatoolkit=11.4.2 and cupy=9.5.0 from conda-forge |
~ 2.1 s (3.2 s first exec) |
cudatoolkit=11.3.1 and cupy=9.5.0 from conda-forge |
~ 2.1 s (3.2 s first exec) |
cudatoolkit=11.2.2 and cupy=9.5.0 from conda-forge |
~ 2.3 s (3.5 s first exec) |
cudatoolkit=11.1.1 and cupy=9.5.0 from conda-forge |
~ 1.0 s (1.7 s first exec) |
We also found out that the first execution with CuPy v9.5.0 is slower than the following ones, while this doesn’t happen with CuPy v9.6.0. This holds for the very first execution in a new environment.
Is this expected?
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (8 by maintainers)
Top GitHub Comments
The only situation disk cache in CuPy does not work is
nvrtc
backend ANDname_expressions
are specified.I know that nvcc comes with CUDA 11.1 behaves differently than the one that comes with 11.0 and 11.2+. Interestingly, nvcc 11.1 builds faster but generates a larger binary than others. I m unsure this applies to NVRTC, but I guess so according to the benchmark.
@mlazzarin Anyway, the codepath in CuPy is the same between CUDA 11.1 and onwards, so I think this is unlikely a CuPy bug. I’d suggest to forget about past CUDA releases 😃
Ok, thank you very much!