question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`cupy.vdot` is way slower than naive implementation

See original GitHub issue

I am on CUDA 10.0 + GTX 2080 Ti + master branch, and this is what I see (with CUPY_ACCELERATORS=cub):

>>> import cupy as cp
>>> from cupyx.time import repeat
>>>
>>> a = cp.random.random(1000000)
>>> b = cp.random.random(1000000)
>>> def my_vdot(a, b):
...     return cp.sum(a * b)
... 
>>> print(repeat(cp.vdot, (a, b)))
vdot                :    CPU:   17.272 us   +/- 0.938 (min:   16.300 / max:   39.416) us     GPU-0:  640.646 us   +/-18.809 (min:  630.496 / max:  774.144) us
>>> print(repeat(my_vdot, (a, b)))
my_vdot             :    CPU:   26.312 us   +/- 3.240 (min:   25.204 / max:   97.959) us     GPU-0:   74.100 us   +/-12.585 (min:   70.560 / max:  274.432) us
>>> my_vdot(a, b)  == cp.vdot(a, b)
array(True)

I could probably look into this myself, just opening a ticket in case I forget and someone can pick up 😅

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
asi1024commented, Aug 3, 2020

CUPY_ACCELERATORS=cub is not set in the above performance numbers. If set, tensordot_naive in any case as far as I checked.

tensordot_kernel    :    CPU:   12.353 us   +/-12.194 (min:   11.217 / max:  429.384) us     GPU-0:  838.373 us   +/-26.176 (min:  820.512 / max: 1243.168) us
tensordot_naive     :    CPU:   40.067 us   +/- 1.072 (min:   38.911 / max:   65.483) us     GPU-0:   44.603 us   +/- 1.460 (min:   42.528 / max:   80.096) us
tensordot_kernel    :    CPU:   11.715 us   +/- 0.469 (min:   11.194 / max:   19.401) us     GPU-0:  438.515 us   +/- 1.323 (min:  430.464 / max:  446.176) us
tensordot_naive     :    CPU:   39.927 us   +/- 0.948 (min:   38.747 / max:   59.839) us     GPU-0:   44.409 us   +/- 1.052 (min:   42.496 / max:   64.640) us
1reaction
asi1024commented, Aug 2, 2020

@leofang Your my_vdot compute a + b into contiguous memory space and then call reduction operation, whereas cupy.vdot does not so. Your implementation seems faster in many cases, but sometimes the current implementation is faster in some memory layout.

import cupy
import cupyx.time


tensordot_kernel = cupy.ReductionKernel(
    'S x, T y', 'U out',
    'static_cast<U>(x) * static_cast<U>(y)',
    'a + b', 'out = a', '0',
    'tensordot_kernel')


def tensordot_naive(a, b, out):
    return (a.ravel() * b.ravel()).sum(out=out)


x = cupy.arange(2 ** 20, dtype='float32')
y = cupy.arange(2 ** 20, dtype='float32')
out = cupy.empty((), dtype='float32')

print(cupyx.time.repeat(tensordot_kernel, (x, y, out), max_duration=1))
print(cupyx.time.repeat(tensordot_naive, (x, y, out), max_duration=1))

print(cupyx.time.repeat(tensordot_kernel, (x, x, out), max_duration=1))
print(cupyx.time.repeat(tensordot_naive, (x, x, out), max_duration=1))
tensordot_kernel    :    CPU:   11.923 us   +/-12.392 (min:   10.815 / max:  437.032) us     GPU-0:  836.028 us   +/-25.847 (min:  820.384 / max: 1331.936) us
tensordot_naive     :    CPU:   29.429 us   +/- 2.777 (min:   28.100 / max:   83.901) us     GPU-0:  576.043 us   +/- 3.707 (min:  567.008 / max:  624.992) us
tensordot_kernel    :    CPU:   11.374 us   +/- 0.452 (min:   10.863 / max:   18.317) us     GPU-0:  438.565 us   +/- 1.325 (min:  429.984 / max:  444.256) us
tensordot_naive     :    CPU:   29.346 us   +/- 5.537 (min:   28.081 / max:  262.398) us     GPU-0:  537.016 us   +/- 5.473 (min:  533.664 / max:  765.152) us

The above tensordot_naive requires an additional temporary memory space, but I personally agree with the change in #3678. How do you think? @kmaehashi @emcastillo

Read more comments on GitHub >

github_iconTop Results From Across the Web

latest PDF - CuPy Documentation
CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy ...
Read more >
Cupy is slower than numpy - python - Stack Overflow
Your code is not slow because numpy is slow but because you call many (python) functions, and calling functions (and iterating and accessing ......
Read more >
final report - Virginia Department of Transportation
The purpose of traffic calming is to slow speeders in residential ... subdivisions should be designed in such a way as to prevent...
Read more >
Virginia Quiet Pavement Implementation Program
Report to the Governor and General Assembly of Virginia ... Transportation (VDOT) Materials Division with guidance from the Quiet Pavement ...
Read more >
OPERATIONS DIVISION - Virginia Department of Transportation
establish procedures to ensure uniformity in usage and messages on all CMS located. VDOT maintained roadways and within VDOT right-of-way ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found