Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance of CuPy `cp.random.rand` and other ufuncs when run on a new stream

See original GitHub issue

Description

Hi!

When profiling a threaded version of an app, I noticed that I was getting significantly slower random number gen when running with a newly created user (non-null) stream.

At first I thought this was just the overhead of creating the stream but the overhead scales with problem size.

n = 1M
On New:  0.00053140998352319
On Default:  0.00011886999709531665

n = 10M
On New:  0.0020523089915513992
On Default:  0.0002732370048761368

Whats going on here?

To Reproduce

import cupy as cp
import time

n = 10000000

for i in range(10):
    with cp.cuda.Device(0) as device:
        A = cp.zeros([n, 1])
        B = cp.zeros([n, 1])

        stream = cp.cuda.Stream()

        device.synchronize()

        start = time.perf_counter()
        with stream as stream:
            A[:] = cp.random.rand(n, 1)
        device.synchronize()

        end = time.perf_counter()
        print("On New: ", end - start)

        start = time.perf_counter()
        A[:] = cp.random.rand(n, 1)
        device.synchronize()
        end = time.perf_counter()

        print("On Default: ", end - start)

Installation

Conda-Forge (conda install ...)

Environment

OS                           : Linux-4.14.0-115.35.1.3chaos.ch6a.ppc64le-ppc64le-with-glibc2.17
Python Version               : 3.8.13
CuPy Version                 : 11.1.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.22.4
SciPy Version                : 1.9.1
Cython Build Version         : 0.29.32
Cython Runtime Version       : 0.29.32
CUDA Root                    : /user/.conda/envs/test
nvcc PATH                    : None
CUDA Build Version           : 10020
CUDA Driver Version          : 11060
CUDA Runtime Version         : 10020
cuBLAS Version               : (available)
cuFFT Version                : 10102
cuRAND Version               : 10102
cuSOLVER Version             : (10, 3, 0)
cuSPARSE Version             : (available)
NVRTC Version                : (10, 2)
Thrust Version               : 100907
CUB Build Version            : <unknown>
Jitify Build Version         : 3ecec55
cuDNN Build Version          : None
cuDNN Version                : None
NCCL Build Version           : None
NCCL Runtime Version         : None
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : Tesla V100-SXM2-16GB
Device 0 Compute Capability  : 70
Device 0 PCI Bus ID          : 0004:04:00.0
Device 1 Name                : Tesla V100-SXM2-16GB
Device 1 Compute Capability  : 70
Device 1 PCI Bus ID          : 0004:05:00.0
Device 2 Name                : Tesla V100-SXM2-16GB
Device 2 Compute Capability  : 70
Device 2 PCI Bus ID          : 0035:03:00.0
Device 3 Name                : Tesla V100-SXM2-16GB
Device 3 Compute Capability  : 70
Device 3 PCI Bus ID          : 0035:04:00.0

Additional Information

No response

Issue Analytics

State:
Created a year ago
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

dialecticDoltcommented, Oct 20, 2022

Ok yep, that each stream has a separate memory pool appears to be it. Explicitly clearing all active memory pools before the kernels leads to the same time (0.002 seconds) for all regardless of where and how they are launched. Thanks!

Using the new/experimental MemoryAsyncPool is also a workaround. For the random kernel it can closely match performance:

On Stream:  0.0003380347043275833
On Default:  0.00033360160887241364

2reactions

kmaehashicommented, Oct 17, 2022

Although it could potentially be resetting any workspaces associated with the handle (https://github.com/cupy/cupy/issues/4676)?

CuPy manages cuBALS handles per-thread, so this should be a penalty only for the first iteration. https://github.com/cupy/cupy/blob/9d3bf3f208e40eff8bea14a7b0b2b3bf14fb488e/cupy/cuda/device.pyx#L259-L285

For the Random Fill example, the stream version does show 9 more calls to cudaMalloc which (I think) accounts for 0.0016 seconds and the bulk of the overhead.

This sounds like a cause. CuPy has a separate memory pool per stream to avoid invalid reallocation across streams, so recreating a stream will not benefit from the pool.