Use workspace for performance of reduction operations
See original GitHub issueIdea:
import cupy
import cupyx
x = cupy.arange(2 ** 24)
perf = cupyx.time.repeat(cupy.sum, (x,), max_duration=2)
print(perf)
def fast_sum(x):
workspace_size = 2 ** 10
workspace = cupy.empty(workspace_size, dtype=x.dtype)
x = x.reshape(workspace_size, x.size // workspace_size)
x.sum(axis=1, out=workspace)
return workspace.sum()
perf = cupyx.time.repeat(fast_sum, (x,), max_duration=2)
print(perf)
cupy._core.set_routine_accelerators(['cub'])
perf = cupyx.time.repeat(cupy.sum, (x,), max_duration=2)
print(perf)
Result (CUDA 11.2, NVIDIA A100):
sum : CPU: 19.114 us +/-11.902 (min: 16.481 / max: 118.542) us GPU-0:16290.183 us +/-273.802 (min:16187.391 / max:17508.352) us
fast_sum : CPU: 38.679 us +/- 2.807 (min: 37.246 / max: 292.944) us GPU-0: 134.390 us +/- 2.664 (min: 126.976 / max: 372.736) us
sum : CPU: 15.940 us +/- 2.425 (min: 15.333 / max: 185.851) us GPU-0: 111.900 us +/- 2.404 (min: 109.568 / max: 272.384) us
TODO: more benchmarks for various shapes and contiguity of inputs.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Workspaces
Operating excellence Deliver long-term, strategic value and reduce risk by connecting your operations. Strengthen common services and meet changing ...
Read more >Design a Log Analytics workspace architecture
Your design should always start with a single workspace to reduce the complexity of managing multiple workspaces and in querying data from them....
Read more >Workspace Utilization and Allocation Benchmark
facility mangers are forced to let go of their old workspace and try new ways to use less space, increase operation efficiency, and...
Read more >Optimize training performance with Reduction Server ...
By optimizing bandwidth usage and latency of the all-reduce collective operation used by these frameworks, Reduction Server can decrease ...
Read more >How To Reduce Workspace Email and Boost Productivity
At Crozdesk, we incorporated the use of Slack that proved to be faster and more efficient than email for internal communication. The results ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@leofang
Yes. That’s right!
This algorithm can be used for non-contiguous input. Here is the example for reduction of non-contiguous axis:
@leofang For input of ndarray of shape
(n, m)
, current CuPy’s naive reduction operationcupy.sum(x, axis=1)
uses at mostn
(= the length of elementwise-axis) blocks. So only 1 block is used for the full-reduction operation. Thefast_sum
function in my sample program splits the operation into 2 kernel calls and avoids full-reduction kernel.