question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use workspace for performance of reduction operations

See original GitHub issue

Idea:

import cupy
import cupyx

x = cupy.arange(2 ** 24)
perf = cupyx.time.repeat(cupy.sum, (x,), max_duration=2)
print(perf)

def fast_sum(x):
    workspace_size = 2 ** 10
    workspace = cupy.empty(workspace_size, dtype=x.dtype)
    x = x.reshape(workspace_size, x.size // workspace_size)
    x.sum(axis=1, out=workspace)
    return workspace.sum()

perf = cupyx.time.repeat(fast_sum, (x,), max_duration=2)
print(perf)

cupy._core.set_routine_accelerators(['cub'])
perf = cupyx.time.repeat(cupy.sum, (x,), max_duration=2)
print(perf)

Result (CUDA 11.2, NVIDIA A100):

sum                 :    CPU:   19.114 us   +/-11.902 (min:   16.481 / max:  118.542) us     GPU-0:16290.183 us   +/-273.802 (min:16187.391 / max:17508.352) us
fast_sum            :    CPU:   38.679 us   +/- 2.807 (min:   37.246 / max:  292.944) us     GPU-0:  134.390 us   +/- 2.664 (min:  126.976 / max:  372.736) us
sum                 :    CPU:   15.940 us   +/- 2.425 (min:   15.333 / max:  185.851) us     GPU-0:  111.900 us   +/- 2.404 (min:  109.568 / max:  272.384) us

TODO: more benchmarks for various shapes and contiguity of inputs.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:3
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
asi1024commented, May 22, 2021

@leofang

OK so you’re saying converting to a two-stage reduction helps. It’s actually what’s done in _cub_two_pass_launch too! 😄

Yes. That’s right!

I am wondering if there’s a way to do non-contiguous reduction in two passes with some performance boost (even by 50% faster is good enough). It’d be great to have. AFAIK the only performant way to do this is to use cuTENSOR.

This algorithm can be used for non-contiguous input. Here is the example for reduction of non-contiguous axis:

import cupy
import cupyx

def fast_sum(x):
    workspace_size = 2 ** 10
    workspace = cupy.empty(workspace_size, dtype=cupy.int64).reshape(-1, x.shape[1])
    x = x.reshape(-1, *workspace.shape)
    x.sum(axis=0, out=workspace)
    return workspace.sum(axis=0)

x = cupy.arange(2 ** 24, dtype=cupy.int32).reshape(-1, 2)

print(x.sum(axis=0))
print(fast_sum(x))

perf = cupyx.time.repeat(cupy.sum, (x,), {'axis': 0}, max_duration=2)
print(perf)

perf = cupyx.time.repeat(fast_sum, (x,), max_duration=2)
print(perf)
[70368735789056 70368744177664]
[70368735789056 70368744177664]
sum                 :    CPU:   80.187 us   +/-79.567 (min:   20.415 / max:  208.176) us     GPU-0:228061.525 us   +/-399.609 (min:227557.373 / max:228855.804) us
fast_sum            :    CPU:   39.783 us   +/- 5.905 (min:   38.235 / max:  284.604) us     GPU-0:  564.138 us   +/- 5.622 (min:  561.152 / max:  793.600) us
1reaction
asi1024commented, May 14, 2021

@leofang For input of ndarray of shape (n, m), current CuPy’s naive reduction operation cupy.sum(x, axis=1) uses at most n (= the length of elementwise-axis) blocks. So only 1 block is used for the full-reduction operation. The fast_sum function in my sample program splits the operation into 2 kernel calls and avoids full-reduction kernel.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Workspaces
Operating excellence​​ Deliver long-term, strategic value and reduce risk by connecting your operations. Strengthen common services and meet changing ...
Read more >
Design a Log Analytics workspace architecture
Your design should always start with a single workspace to reduce the complexity of managing multiple workspaces and in querying data from them....
Read more >
Workspace Utilization and Allocation Benchmark
facility mangers are forced to let go of their old workspace and try new ways to use less space, increase operation efficiency, and...
Read more >
Optimize training performance with Reduction Server ...
By optimizing bandwidth usage and latency of the all-reduce collective operation used by these frameworks, Reduction Server can decrease ...
Read more >
How To Reduce Workspace Email and Boost Productivity
At Crozdesk, we incorporated the use of Slack that proved to be faster and more efficient than email for internal communication. The results ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found