Performance of CuPy `cp.random.rand` and other ufuncs when run on a new stream
See original GitHub issueDescription
Hi!
When profiling a threaded version of an app, I noticed that I was getting significantly slower random number gen when running with a newly created user (non-null) stream.
At first I thought this was just the overhead of creating the stream but the overhead scales with problem size.
n = 1M
On New: 0.00053140998352319
On Default: 0.00011886999709531665
n = 10M
On New: 0.0020523089915513992
On Default: 0.0002732370048761368
Whats going on here?
To Reproduce
import cupy as cp
import time
n = 10000000
for i in range(10):
with cp.cuda.Device(0) as device:
A = cp.zeros([n, 1])
B = cp.zeros([n, 1])
stream = cp.cuda.Stream()
device.synchronize()
start = time.perf_counter()
with stream as stream:
A[:] = cp.random.rand(n, 1)
device.synchronize()
end = time.perf_counter()
print("On New: ", end - start)
start = time.perf_counter()
A[:] = cp.random.rand(n, 1)
device.synchronize()
end = time.perf_counter()
print("On Default: ", end - start)
Installation
Conda-Forge (conda install ...
)
Environment
OS : Linux-4.14.0-115.35.1.3chaos.ch6a.ppc64le-ppc64le-with-glibc2.17
Python Version : 3.8.13
CuPy Version : 11.1.0
CuPy Platform : NVIDIA CUDA
NumPy Version : 1.22.4
SciPy Version : 1.9.1
Cython Build Version : 0.29.32
Cython Runtime Version : 0.29.32
CUDA Root : /user/.conda/envs/test
nvcc PATH : None
CUDA Build Version : 10020
CUDA Driver Version : 11060
CUDA Runtime Version : 10020
cuBLAS Version : (available)
cuFFT Version : 10102
cuRAND Version : 10102
cuSOLVER Version : (10, 3, 0)
cuSPARSE Version : (available)
NVRTC Version : (10, 2)
Thrust Version : 100907
CUB Build Version : <unknown>
Jitify Build Version : 3ecec55
cuDNN Build Version : None
cuDNN Version : None
NCCL Build Version : None
NCCL Runtime Version : None
cuTENSOR Version : None
cuSPARSELt Build Version : None
Device 0 Name : Tesla V100-SXM2-16GB
Device 0 Compute Capability : 70
Device 0 PCI Bus ID : 0004:04:00.0
Device 1 Name : Tesla V100-SXM2-16GB
Device 1 Compute Capability : 70
Device 1 PCI Bus ID : 0004:05:00.0
Device 2 Name : Tesla V100-SXM2-16GB
Device 2 Compute Capability : 70
Device 2 PCI Bus ID : 0035:03:00.0
Device 3 Name : Tesla V100-SXM2-16GB
Device 3 Compute Capability : 70
Device 3 PCI Bus ID : 0035:04:00.0
Additional Information
No response
Issue Analytics
- State:
- Created a year ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Interoperability — CuPy 11.4.0 documentation
CuPy can be used in conjunction with other libraries. ... For using CUDA streams created in foreign libraries in CuPy, see Streams and...
Read more >Issues · cupy/cupy - GitHub
Performance of CuPy cp.random.rand and other ufuncs when run on a new stream cat:bug Bugs. #7110 opened on Oct 12 by dialecticDolt.
Read more >CuPy Documentation - Read the Docs
CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing ...
Read more >cupy - Bountysource
NumPy and SciPy FFT functions show the expected result, and the first run of CuPy always agrees with these. The second run is...
Read more >Chainer Documentation
It might hurt performance, since the interme- diate temporary arrays are read and written by possibly different GPU cores, which consumes ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ok yep, that each stream has a separate memory pool appears to be it. Explicitly clearing all active memory pools before the kernels leads to the same time (0.002 seconds) for all regardless of where and how they are launched. Thanks!
Using the new/experimental
MemoryAsyncPool
is also a workaround. For the random kernel it can closely match performance:CuPy manages cuBALS handles per-thread, so this should be a penalty only for the first iteration. https://github.com/cupy/cupy/blob/9d3bf3f208e40eff8bea14a7b0b2b3bf14fb488e/cupy/cuda/device.pyx#L259-L285
This sounds like a cause. CuPy has a separate memory pool per stream to avoid invalid reallocation across streams, so recreating a stream will not benefit from the pool.