question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance measurements - `cp.matmul` slower than `torch.matmul`

See original GitHub issue

I just installed cupy and did some simple performance benchmarks for comparisons. I chose matrix multiplication since it’s the simplest problem to start with.

I measure cupy time with a following code snipped:

@contextmanager
def timing():
    import cupy
    class Foo: # dummy class to pass results out of contextmanager
        pass
    res = Foo()
    start = cupy.cuda.Event(disable_timing=False)
    end = cupy.cuda.Event(disable_timing=False)
    start.record()
    yield res
    end.record()

    #I'm not sure about this line, just guessed by analogy from torch
    # Without it raises DeviceNotReady erorr
    end.synchronize()
    res.result = cupy.cuda.get_elapsed_time(start, end)/1000

I have similar script for measuring time in torch, based on this link: https://discuss.pytorch.org/t/how-to-measure-time-in-pytorch/26964 The above contextmanager is used in the following way:

x, y = cupy.random.rand(N,N, dtype=cupy.float32), cupy.random.rand(N,N, dtype=cupy.float32)
with timing() as t:
    z = cupy.matmul(x,y)
print('time', t.result)

Link to the full measurement script: https://github.com/danlkv/QTensor/tree/merged_ix/scratchpad/bench/matmul

I know that torch also uses Cuda, so I would expect the time for torch and cupy to be similar, since most of the work is done by same (?) cuda backend function. I do 10 runs of square NxN matrix, here are some of the results I get:

library size of matrix FLOP/s average time of 10 runs
torch, 2000, 879.23G 0.00909883852005005,
torch, 3000, 1.06T 0.02553908138275146,
torch, 3001, 1.22T 0.02214127025604248,
cupy, 2000, 690.78G 0.011581078433990479,
cupy, 3000, 741.55G 0.036410070610046384,
cupy, 3001, 720.97G 0.037487194061279296,

So it looks like torch somehow gets ~50% faster… Also it gets 15% faster for size 3000 vs 3001, which is strange, but not related to cupy I guess.

My guess would be that some time is spent on data transfer, to the GPU, and while I don’t include .on('cuda') in torch measures, cupy does the tensor movements inside cupy.matmul.

System specs

  1. System: WSL Ubuntu 20.04
  2. Cupy
» python -c 'import cupy; cupy.show_config()'
OS                           : Linux-4.19.128-microsoft-standard-x86_64-with-glibc2.29
CuPy Version                 : 8.6.0
NumPy Version                : 1.19.4
SciPy Version                : 1.3.3
Cython Build Version         : 0.29.22
CUDA Root                    : /usr/local/cuda
CUDA Build Version           : 11020
CUDA Driver Version          : 11030
CUDA Runtime Version         : 11020
cuBLAS Version               : 11401
cuFFT Version                : 10401
cuRAND Version               : 10203
cuSOLVER Version             : (11, 1, 0)
cuSPARSE Version             : 11401
NVRTC Version                : (11, 2)
Thrust Version               : 101000
CUB Build Version            : 101000
cuDNN Build Version          : None
cuDNN Version                : None
NCCL Build Version           : 2804
NCCL Runtime Version         : 2804
cuTENSOR Version             : None
Device 0 Name                : NVIDIA GeForce GTX 1650 with Max-Q Design
Device 0 Compute Capability  : 75
  1. Cuda 11.2, nvidia driver 470.14

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:37 (21 by maintainers)

github_iconTop GitHub Comments

3reactions
leofangcommented, Jul 1, 2021

Just a wild thought: Could it be possible that you have multiple CUDA installations on your system, and CuPy and PyTorch accidentally picked up different versions? How were CuPy and PyTorch installed?

btw @huaxuan250 your PyTorch and CuPy versions do not exist…maybe typos?

3reactions
emcastillocommented, Apr 14, 2021

I just traced both matmuls and they take the same amount of time.

I run your script and saw the time discrepancies, however, these are gone if you create the cublas handler beforehand. In the first iteration, in CuPy the cublas handler is created, and it takes a lot of time. We defer this creation because the handler eats up a significant amount of GPU memory, while in pytorch all these handlers are created on import time, (this is part of why PyTorch consumes a lot of GPU memory once you import it).

If you create it with before calling the actual matmul cupy.cuda.device.get_cublas_handle()

Your script will get better timings.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why matrix multiplication is much slower than PyTorch
matmul (A, B) torch.matmul(A, B). It took only ~9s in average (7 runs, 3 loops each). I'm so confused ...
Read more >
Why multiplication on GPU is slower than on CPU?
The reason can be a lot of things: Your model is simple. For GPU calculation there is the cost of memory transfer to...
Read more >
numpy.matmul — NumPy v1.24 Manual
If the first argument is 1-D, it is promoted to a matrix by prepending a 1 to its dimensions. After matrix multiplication the...
Read more >
Slow Big Matrix Multiplication Slower on GPU - PyTorch Forums
I'm using basic matrix multiplication for two 2D vectors. But the GPU version is much slower than CPU. Is that an expected performance?...
Read more >
MATMUL: An Interactive Matrix Multiplication Benchmark
But unrolling the inner, vectorized, loop reduces performance to 95 MegaFLOPS. This effect persisted for a wide range of problem sizes. This ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found