Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance measurements - `cp.matmul` slower than `torch.matmul`

See original GitHub issue

I just installed cupy and did some simple performance benchmarks for comparisons. I chose matrix multiplication since it’s the simplest problem to start with.

I measure cupy time with a following code snipped:

@contextmanager
def timing():
    import cupy
    class Foo: # dummy class to pass results out of contextmanager
        pass
    res = Foo()
    start = cupy.cuda.Event(disable_timing=False)
    end = cupy.cuda.Event(disable_timing=False)
    start.record()
    yield res
    end.record()

    #I'm not sure about this line, just guessed by analogy from torch
    # Without it raises DeviceNotReady erorr
    end.synchronize()
    res.result = cupy.cuda.get_elapsed_time(start, end)/1000

I have similar script for measuring time in torch, based on this link: https://discuss.pytorch.org/t/how-to-measure-time-in-pytorch/26964 The above contextmanager is used in the following way:

x, y = cupy.random.rand(N,N, dtype=cupy.float32), cupy.random.rand(N,N, dtype=cupy.float32)
with timing() as t:
    z = cupy.matmul(x,y)
print('time', t.result)

Link to the full measurement script: https://github.com/danlkv/QTensor/tree/merged_ix/scratchpad/bench/matmul

I know that torch also uses Cuda, so I would expect the time for torch and cupy to be similar, since most of the work is done by same (?) cuda backend function. I do 10 runs of square NxN matrix, here are some of the results I get:

library	size of matrix	FLOP/s	average time of 10 runs
torch,	2000,	879.23G	0.00909883852005005,
torch,	3000,	1.06T	0.02553908138275146,
torch,	3001,	1.22T	0.02214127025604248,
cupy,	2000,	690.78G	0.011581078433990479,
cupy,	3000,	741.55G	0.036410070610046384,
cupy,	3001,	720.97G	0.037487194061279296,

So it looks like torch somehow gets ~50% faster… Also it gets 15% faster for size 3000 vs 3001, which is strange, but not related to cupy I guess.

My guess would be that some time is spent on data transfer, to the GPU, and while I don’t include .on('cuda') in torch measures, cupy does the tensor movements inside cupy.matmul.

System specs

System: WSL Ubuntu 20.04
Cupy

» python -c 'import cupy; cupy.show_config()'
OS                           : Linux-4.19.128-microsoft-standard-x86_64-with-glibc2.29
CuPy Version                 : 8.6.0
NumPy Version                : 1.19.4
SciPy Version                : 1.3.3
Cython Build Version         : 0.29.22
CUDA Root                    : /usr/local/cuda
CUDA Build Version           : 11020
CUDA Driver Version          : 11030
CUDA Runtime Version         : 11020
cuBLAS Version               : 11401
cuFFT Version                : 10401
cuRAND Version               : 10203
cuSOLVER Version             : (11, 1, 0)
cuSPARSE Version             : 11401
NVRTC Version                : (11, 2)
Thrust Version               : 101000
CUB Build Version            : 101000
cuDNN Build Version          : None
cuDNN Version                : None
NCCL Build Version           : 2804
NCCL Runtime Version         : 2804
cuTENSOR Version             : None
Device 0 Name                : NVIDIA GeForce GTX 1650 with Max-Q Design
Device 0 Compute Capability  : 75

Cuda 11.2, nvidia driver 470.14

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:37 (21 by maintainers)

Top GitHub Comments

3reactions

leofangcommented, Jul 1, 2021

Just a wild thought: Could it be possible that you have multiple CUDA installations on your system, and CuPy and PyTorch accidentally picked up different versions? How were CuPy and PyTorch installed?

btw @huaxuan250 your PyTorch and CuPy versions do not exist…maybe typos?

3reactions

emcastillocommented, Apr 14, 2021

I just traced both matmuls and they take the same amount of time.

I run your script and saw the time discrepancies, however, these are gone if you create the cublas handler beforehand. In the first iteration, in CuPy the cublas handler is created, and it takes a lot of time. We defer this creation because the handler eats up a significant amount of GPU memory, while in pytorch all these handlers are created on import time, (this is part of why PyTorch consumes a lot of GPU memory once you import it).

If you create it with before calling the actual matmul cupy.cuda.device.get_cublas_handle()

Your script will get better timings.