cupy garbage collection is 100-time slower than GPU computing
See original GitHub issueI found an interesting phenomenon about cupy. If python’s garbage collection is on, the computation elapsed time is 100 longer than when it is off.
This may be a known issue, but it is just enough to surprise me that [a: cp.ndarray * 3.0 for i in range(2)]
is 100-time slower than [a: cp.ndarray * 3.0 for i in range(1)]
.
(It should be noted that in %timeit
loop, garbage collection is forced to off, so the loop made by %timeit
does not slow down the computation.)
$ pip freeze | grep cupy
cupy==5.0.0b1
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import cupy as cp
In [2]: a = cp.arange(8*16*100*100*100, dtype=cp.float32).reshape((8, 16, 100, 100, 100))
In [3]: %timeit -n100 -r10 c = [a * 2.0 for i in range(1)]
The slowest run took 5.26 times longer than the fastest. This could mean that an intermediate result is being cached.
110 us +- 89.9 us per loop (mean +- std. dev. of 10 runs, 100 loops each)
In [4]: %timeit -n100 -r10 c = [a * 2.0 for i in range(2)]
The slowest run took 40.09 times longer than the fastest. This could mean that an intermediate result is being cached.
2.66 ms +- 2.53 ms per loop (mean +- std. dev. of 10 runs, 100 loops each)
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (3 by maintainers)
Top Results From Across the Web
Intermittent OutOfMemoryError in Cupy - Stack Overflow
I'm limited by memory and keep loading data to the GPU using ... print('GCed Objects:', gc.collect()) after cupy_array = None to explicitly ...
Read more >Why modern software is slow | Hacker News
Cache misses, garbage collection, streaming, object pooling, ... directly into the GPU instead of first go to the CPU+RAM and then the GPU....
Read more >Is Python really 'too slow'? - Reddit
in pure (C-)Python it is very slow - often a factor of 100 slower than in fast compiled languages. An issue with Cython...
Read more >If the laptop is slow, how do I know if it's because of RAM or ...
If you consistently see CPU near 100% and RAM, storage, and GPU are well below that mark when you feel it is slowest,...
Read more >Chapter 6. GPU Programming with Accelerate - O'Reilly
Modern graphics processing units (GPUs) usually have something on the order of 10 to 100 times more raw compute power than the general-purpose...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@brandondube I totally understood why the queue was stacked by the 1024~ kernels. Thank you very much for your kind reply and thoughtful explanation.
@kmaehashi As I understood why my code is so slow, so this issue can be closed. (I’ll close this issue.)
Thanks for the heads-up! We’re going to release Windows wheels for CUDA 9.2 in the next release.