`cupy.vdot` is way slower than naive implementation
See original GitHub issueI am on CUDA 10.0 + GTX 2080 Ti + master branch, and this is what I see (with CUPY_ACCELERATORS=cub
):
>>> import cupy as cp
>>> from cupyx.time import repeat
>>>
>>> a = cp.random.random(1000000)
>>> b = cp.random.random(1000000)
>>> def my_vdot(a, b):
... return cp.sum(a * b)
...
>>> print(repeat(cp.vdot, (a, b)))
vdot : CPU: 17.272 us +/- 0.938 (min: 16.300 / max: 39.416) us GPU-0: 640.646 us +/-18.809 (min: 630.496 / max: 774.144) us
>>> print(repeat(my_vdot, (a, b)))
my_vdot : CPU: 26.312 us +/- 3.240 (min: 25.204 / max: 97.959) us GPU-0: 74.100 us +/-12.585 (min: 70.560 / max: 274.432) us
>>> my_vdot(a, b) == cp.vdot(a, b)
array(True)
I could probably look into this myself, just opening a ticket in case I forget and someone can pick up 😅
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
latest PDF - CuPy Documentation
CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy ...
Read more >Cupy is slower than numpy - python - Stack Overflow
Your code is not slow because numpy is slow but because you call many (python) functions, and calling functions (and iterating and accessing ......
Read more >final report - Virginia Department of Transportation
The purpose of traffic calming is to slow speeders in residential ... subdivisions should be designed in such a way as to prevent...
Read more >Virginia Quiet Pavement Implementation Program
Report to the Governor and General Assembly of Virginia ... Transportation (VDOT) Materials Division with guidance from the Quiet Pavement ...
Read more >OPERATIONS DIVISION - Virginia Department of Transportation
establish procedures to ensure uniformity in usage and messages on all CMS located. VDOT maintained roadways and within VDOT right-of-way ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
CUPY_ACCELERATORS=cub
is not set in the above performance numbers. If set,tensordot_naive
in any case as far as I checked.@leofang Your
my_vdot
computea + b
into contiguous memory space and then call reduction operation, whereascupy.vdot
does not so. Your implementation seems faster in many cases, but sometimes the current implementation is faster in some memory layout.The above
tensordot_naive
requires an additional temporary memory space, but I personally agree with the change in #3678. How do you think? @kmaehashi @emcastillo