Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use workspace for performance of reduction operations

See original GitHub issue

Idea:

import cupy
import cupyx

x = cupy.arange(2 ** 24)
perf = cupyx.time.repeat(cupy.sum, (x,), max_duration=2)
print(perf)

def fast_sum(x):
    workspace_size = 2 ** 10
    workspace = cupy.empty(workspace_size, dtype=x.dtype)
    x = x.reshape(workspace_size, x.size // workspace_size)
    x.sum(axis=1, out=workspace)
    return workspace.sum()

perf = cupyx.time.repeat(fast_sum, (x,), max_duration=2)
print(perf)

cupy._core.set_routine_accelerators(['cub'])
perf = cupyx.time.repeat(cupy.sum, (x,), max_duration=2)
print(perf)

Result (CUDA 11.2, NVIDIA A100):

sum                 :    CPU:   19.114 us   +/-11.902 (min:   16.481 / max:  118.542) us     GPU-0:16290.183 us   +/-273.802 (min:16187.391 / max:17508.352) us
fast_sum            :    CPU:   38.679 us   +/- 2.807 (min:   37.246 / max:  292.944) us     GPU-0:  134.390 us   +/- 2.664 (min:  126.976 / max:  372.736) us
sum                 :    CPU:   15.940 us   +/- 2.425 (min:   15.333 / max:  185.851) us     GPU-0:  111.900 us   +/- 2.404 (min:  109.568 / max:  272.384) us

TODO: more benchmarks for various shapes and contiguity of inputs.

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

asi1024commented, May 22, 2021

@leofang

OK so you’re saying converting to a two-stage reduction helps. It’s actually what’s done in _cub_two_pass_launch too! 😄

Yes. That’s right!

I am wondering if there’s a way to do non-contiguous reduction in two passes with some performance boost (even by 50% faster is good enough). It’d be great to have. AFAIK the only performant way to do this is to use cuTENSOR.

This algorithm can be used for non-contiguous input. Here is the example for reduction of non-contiguous axis:

import cupy
import cupyx

def fast_sum(x):
    workspace_size = 2 ** 10
    workspace = cupy.empty(workspace_size, dtype=cupy.int64).reshape(-1, x.shape[1])
    x = x.reshape(-1, *workspace.shape)
    x.sum(axis=0, out=workspace)
    return workspace.sum(axis=0)

x = cupy.arange(2 ** 24, dtype=cupy.int32).reshape(-1, 2)

print(x.sum(axis=0))
print(fast_sum(x))

perf = cupyx.time.repeat(cupy.sum, (x,), {'axis': 0}, max_duration=2)
print(perf)

perf = cupyx.time.repeat(fast_sum, (x,), max_duration=2)
print(perf)

[70368735789056 70368744177664]
[70368735789056 70368744177664]
sum                 :    CPU:   80.187 us   +/-79.567 (min:   20.415 / max:  208.176) us     GPU-0:228061.525 us   +/-399.609 (min:227557.373 / max:228855.804) us
fast_sum            :    CPU:   39.783 us   +/- 5.905 (min:   38.235 / max:  284.604) us     GPU-0:  564.138 us   +/- 5.622 (min:  561.152 / max:  793.600) us

1reaction

asi1024commented, May 14, 2021

@leofang For input of ndarray of shape (n, m), current CuPy’s naive reduction operation cupy.sum(x, axis=1) uses at most n (= the length of elementwise-axis) blocks. So only 1 block is used for the full-reduction operation. The fast_sum function in my sample program splits the operation into 2 kernel calls and avoids full-reduction kernel.

Top Results From Across the Web

Workspaces

Operating excellence Deliver long-term, strategic value and reduce risk by connecting your operations. Strengthen common services and meet changing ...

Design a Log Analytics workspace architecture

Your design should always start with a single workspace to reduce the complexity of managing multiple workspaces and in querying data from them....

Workspace Utilization and Allocation Benchmark

facility mangers are forced to let go of their old workspace and try new ways to use less space, increase operation efficiency, and...

Optimize training performance with Reduction Server ...

By optimizing bandwidth usage and latency of the all-reduce collective operation used by these frameworks, Reduction Server can decrease ...

How To Reduce Workspace Email and Boost Productivity

At Crozdesk, we incorporated the use of Slack that proved to be faster and more efficient than email for internal communication. The results ......