Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`cupy.vdot` is way slower than naive implementation

See original GitHub issue

I am on CUDA 10.0 + GTX 2080 Ti + master branch, and this is what I see (with CUPY_ACCELERATORS=cub):

>>> import cupy as cp
>>> from cupyx.time import repeat
>>>
>>> a = cp.random.random(1000000)
>>> b = cp.random.random(1000000)
>>> def my_vdot(a, b):
...     return cp.sum(a * b)
... 
>>> print(repeat(cp.vdot, (a, b)))
vdot                :    CPU:   17.272 us   +/- 0.938 (min:   16.300 / max:   39.416) us     GPU-0:  640.646 us   +/-18.809 (min:  630.496 / max:  774.144) us
>>> print(repeat(my_vdot, (a, b)))
my_vdot             :    CPU:   26.312 us   +/- 3.240 (min:   25.204 / max:   97.959) us     GPU-0:   74.100 us   +/-12.585 (min:   70.560 / max:  274.432) us
>>> my_vdot(a, b)  == cp.vdot(a, b)
array(True)

I could probably look into this myself, just opening a ticket in case I forget and someone can pick up 😅

Issue Analytics

State:
Created 3 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

asi1024commented, Aug 3, 2020

CUPY_ACCELERATORS=cub is not set in the above performance numbers. If set, tensordot_naive in any case as far as I checked.

tensordot_kernel    :    CPU:   12.353 us   +/-12.194 (min:   11.217 / max:  429.384) us     GPU-0:  838.373 us   +/-26.176 (min:  820.512 / max: 1243.168) us
tensordot_naive     :    CPU:   40.067 us   +/- 1.072 (min:   38.911 / max:   65.483) us     GPU-0:   44.603 us   +/- 1.460 (min:   42.528 / max:   80.096) us
tensordot_kernel    :    CPU:   11.715 us   +/- 0.469 (min:   11.194 / max:   19.401) us     GPU-0:  438.515 us   +/- 1.323 (min:  430.464 / max:  446.176) us
tensordot_naive     :    CPU:   39.927 us   +/- 0.948 (min:   38.747 / max:   59.839) us     GPU-0:   44.409 us   +/- 1.052 (min:   42.496 / max:   64.640) us

1reaction

asi1024commented, Aug 2, 2020

@leofang Your my_vdot compute a + b into contiguous memory space and then call reduction operation, whereas cupy.vdot does not so. Your implementation seems faster in many cases, but sometimes the current implementation is faster in some memory layout.

import cupy
import cupyx.time


tensordot_kernel = cupy.ReductionKernel(
    'S x, T y', 'U out',
    'static_cast<U>(x) * static_cast<U>(y)',
    'a + b', 'out = a', '0',
    'tensordot_kernel')


def tensordot_naive(a, b, out):
    return (a.ravel() * b.ravel()).sum(out=out)


x = cupy.arange(2 ** 20, dtype='float32')
y = cupy.arange(2 ** 20, dtype='float32')
out = cupy.empty((), dtype='float32')

print(cupyx.time.repeat(tensordot_kernel, (x, y, out), max_duration=1))
print(cupyx.time.repeat(tensordot_naive, (x, y, out), max_duration=1))

print(cupyx.time.repeat(tensordot_kernel, (x, x, out), max_duration=1))
print(cupyx.time.repeat(tensordot_naive, (x, x, out), max_duration=1))

tensordot_kernel    :    CPU:   11.923 us   +/-12.392 (min:   10.815 / max:  437.032) us     GPU-0:  836.028 us   +/-25.847 (min:  820.384 / max: 1331.936) us
tensordot_naive     :    CPU:   29.429 us   +/- 2.777 (min:   28.100 / max:   83.901) us     GPU-0:  576.043 us   +/- 3.707 (min:  567.008 / max:  624.992) us
tensordot_kernel    :    CPU:   11.374 us   +/- 0.452 (min:   10.863 / max:   18.317) us     GPU-0:  438.565 us   +/- 1.325 (min:  429.984 / max:  444.256) us
tensordot_naive     :    CPU:   29.346 us   +/- 5.537 (min:   28.081 / max:  262.398) us     GPU-0:  537.016 us   +/- 5.473 (min:  533.664 / max:  765.152) us

The above tensordot_naive requires an additional temporary memory space, but I personally agree with the change in #3678. How do you think? @kmaehashi @emcastillo

Top Results From Across the Web

latest PDF - CuPy Documentation

CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy ...

Cupy is slower than numpy - python - Stack Overflow

Your code is not slow because numpy is slow but because you call many (python) functions, and calling functions (and iterating and accessing ......

final report - Virginia Department of Transportation

The purpose of traffic calming is to slow speeders in residential ... subdivisions should be designed in such a way as to prevent...

Virginia Quiet Pavement Implementation Program

Report to the Governor and General Assembly of Virginia ... Transportation (VDOT) Materials Division with guidance from the Quiet Pavement ...

OPERATIONS DIVISION - Virginia Department of Transportation

establish procedures to ensure uniformity in usage and messages on all CMS located. VDOT maintained roadways and within VDOT right-of-way ...