`cupy.cumsum()` doesn't work in ROCm 5.0
See original GitHub issueRel #6459.
cupy.cumsum()
that CuPy independently implements without calling CUB does not work in ROCm 5.0:
$ python -c 'import cupy; print(cupy.cumsum(cupy.array([0,0])))'
[ 0 1304722626565153334]
The implementation uses shared memory to communicate across threads. As discussed in #4366, in ROCm, threads in a warp run in lock-step at all time so synchronizing instructions in a warp is not required, but we still need to use memory fence to enforce the ordering on access to shared memory.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Support ROCm 5.0 · Issue #6459 · cupy/cupy - GitHub
The half float representation issue which we were facing on ROCm 4.5 looks gone. With the change in #6466, I'm running CuPy's test...
Read more >latest PDF - CuPy Documentation
CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy ...
Read more >Release Notes - Numba documentation - Read the Docs
Version 0.52.0 (30 November, 2020)¶. This release focuses on performance improvements, but also adds some new features and contains numerous bug fixes and ......
Read more >CS312 Course Introduction - UT Computer Science
2 years Round Rock ... 6 reading assignments, 5 points each, 30 points total ... struggles with the loops, logic, etc. does not...
Read more >Phase 2 Report - Review Copy, Further Site Characterization ...
3-124 3.3.5 Estimated Historical Water Column Loadings Based on USGS ... recent work included construction of an inclined borehole through rock in a ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
No, I understand it doesn’t. In the following part of cumsum, for example, the data stored by thread 0 is observed being written by thread 1 is not necessarily before thread 1 observes the memory by the load instruction. https://github.com/cupy/cupy/blob/812b0f5301de8896f105ed974d84b03fcb331d91/cupy/_core/_routines_math.pyx#L307-L309
I couldn’t find its exact explanation in ROCm documentation, but CUDA tells that here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions
Thanks! If hip supports memory fences maybe it is better to redefine it as such …, it will keep the semantics since the warp advances in lock-step