question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance of CuPy `cp.random.rand` and other ufuncs when run on a new stream

See original GitHub issue

Description

Hi!

When profiling a threaded version of an app, I noticed that I was getting significantly slower random number gen when running with a newly created user (non-null) stream.

At first I thought this was just the overhead of creating the stream but the overhead scales with problem size.

n = 1M
On New:  0.00053140998352319
On Default:  0.00011886999709531665

n = 10M
On New:  0.0020523089915513992
On Default:  0.0002732370048761368

Whats going on here?

To Reproduce

import cupy as cp
import time

n = 10000000

for i in range(10):
    with cp.cuda.Device(0) as device:
        A = cp.zeros([n, 1])
        B = cp.zeros([n, 1])

        stream = cp.cuda.Stream()

        device.synchronize()

        start = time.perf_counter()
        with stream as stream:
            A[:] = cp.random.rand(n, 1)
        device.synchronize()

        end = time.perf_counter()
        print("On New: ", end - start)

        start = time.perf_counter()
        A[:] = cp.random.rand(n, 1)
        device.synchronize()
        end = time.perf_counter()

        print("On Default: ", end - start)

Installation

Conda-Forge (conda install ...)

Environment

OS                           : Linux-4.14.0-115.35.1.3chaos.ch6a.ppc64le-ppc64le-with-glibc2.17
Python Version               : 3.8.13
CuPy Version                 : 11.1.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.22.4
SciPy Version                : 1.9.1
Cython Build Version         : 0.29.32
Cython Runtime Version       : 0.29.32
CUDA Root                    : /user/.conda/envs/test
nvcc PATH                    : None
CUDA Build Version           : 10020
CUDA Driver Version          : 11060
CUDA Runtime Version         : 10020
cuBLAS Version               : (available)
cuFFT Version                : 10102
cuRAND Version               : 10102
cuSOLVER Version             : (10, 3, 0)
cuSPARSE Version             : (available)
NVRTC Version                : (10, 2)
Thrust Version               : 100907
CUB Build Version            : <unknown>
Jitify Build Version         : 3ecec55
cuDNN Build Version          : None
cuDNN Version                : None
NCCL Build Version           : None
NCCL Runtime Version         : None
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : Tesla V100-SXM2-16GB
Device 0 Compute Capability  : 70
Device 0 PCI Bus ID          : 0004:04:00.0
Device 1 Name                : Tesla V100-SXM2-16GB
Device 1 Compute Capability  : 70
Device 1 PCI Bus ID          : 0004:05:00.0
Device 2 Name                : Tesla V100-SXM2-16GB
Device 2 Compute Capability  : 70
Device 2 PCI Bus ID          : 0035:03:00.0
Device 3 Name                : Tesla V100-SXM2-16GB
Device 3 Compute Capability  : 70
Device 3 PCI Bus ID          : 0035:04:00.0

Additional Information

No response

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
dialecticDoltcommented, Oct 20, 2022

Ok yep, that each stream has a separate memory pool appears to be it. Explicitly clearing all active memory pools before the kernels leads to the same time (0.002 seconds) for all regardless of where and how they are launched. Thanks!

Using the new/experimental MemoryAsyncPool is also a workaround. For the random kernel it can closely match performance:

On Stream:  0.0003380347043275833
On Default:  0.00033360160887241364
2reactions
kmaehashicommented, Oct 17, 2022

Although it could potentially be resetting any workspaces associated with the handle (https://github.com/cupy/cupy/issues/4676)?

CuPy manages cuBALS handles per-thread, so this should be a penalty only for the first iteration. https://github.com/cupy/cupy/blob/9d3bf3f208e40eff8bea14a7b0b2b3bf14fb488e/cupy/cuda/device.pyx#L259-L285

For the Random Fill example, the stream version does show 9 more calls to cudaMalloc which (I think) accounts for 0.0016 seconds and the bulk of the overhead.

This sounds like a cause. CuPy has a separate memory pool per stream to avoid invalid reallocation across streams, so recreating a stream will not benefit from the pool.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Interoperability — CuPy 11.4.0 documentation
CuPy can be used in conjunction with other libraries. ... For using CUDA streams created in foreign libraries in CuPy, see Streams and...
Read more >
Issues · cupy/cupy - GitHub
Performance of CuPy cp.random.rand and other ufuncs when run on a new stream cat:bug Bugs. #7110 opened on Oct 12 by dialecticDolt.
Read more >
CuPy Documentation - Read the Docs
CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing ...
Read more >
cupy - Bountysource
NumPy and SciPy FFT functions show the expected result, and the first run of CuPy always agrees with these. The second run is...
Read more >
Chainer Documentation
It might hurt performance, since the interme- diate temporary arrays are read and written by possibly different GPU cores, which consumes ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found