Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use cuTENSOR in reduction routines

See original GitHub issue

For the performance, _AbstractReductionKernel should use cuTENSOR by default if cupy.cuda.cutensor_enabled is True.

Issue Analytics

State:
Created 4 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

2reactions

emcastillocommented, Jan 9, 2020

After applying #2921 cub is faster in all cases

time_reduction       - case all naive                     :    16.432 us   +/- 2.531 (min:   14.376 / max:   21.252) us  18450.842 us   +/-27.035 (min:18412.544 / max:18479.103) us
time_reduction       - case all cub                       :    30.325 us   +/- 7.241 (min:   25.795 / max:   44.700) us    262.554 us   +/- 6.528 (min:  258.048 / max:  275.456) us
time_reduction       - case all cute                      :   108.133 us   +/- 5.533 (min:  102.569 / max:  116.556) us    334.029 us   +/- 4.733 (min:  328.704 / max:  340.992) us
time_reduction       - case first naive                    :    13.955 us   +/- 0.377 (min:   13.324 / max:   14.455) us    244.326 us   +/- 0.819 (min:  243.712 / max:  245.760) us
time_reduction       - case first cub                     :    20.176 us   +/- 1.597 (min:   18.873 / max:   23.197) us    251.085 us   +/- 2.996 (min:  248.832 / max:  257.024) us
time_reduction       - case first cute                    :   109.144 us   +/-11.078 (min:   98.060 / max:  128.639) us    342.630 us   +/-11.154 (min:  330.752 / max:  361.472) us
time_reduction       - case mid naive                     :    16.435 us   +/- 0.724 (min:   15.675 / max:   17.591) us    331.776 us   +/- 0.916 (min:  330.752 / max:  332.800) us
time_reduction       - case mid cub                       :    22.627 us   +/- 2.220 (min:   20.463 / max:   25.401) us    336.486 us   +/- 2.939 (min:  333.824 / max:  340.992) us
time_reduction       - case mid cute                      :   121.863 us   +/-35.807 (min:   99.245 / max:  192.610) us    350.413 us   +/-35.279 (min:  327.680 / max:  419.840) us
time_reduction       - case batch naive                    :    17.367 us   +/- 2.506 (min:   15.790 / max:   22.336) us    950.682 us   +/- 3.402 (min:  948.224 / max:  957.440) us
time_reduction       - case batch cub                     :    57.497 us   +/- 1.720 (min:   55.409 / max:   60.489) us    272.384 us   +/- 2.148 (min:  270.336 / max:  276.480) us
time_reduction       - case batch cute                    :   127.743 us   +/-46.858 (min:   96.462 / max:  220.364) us    362.906 us   +/-45.989 (min:  331.776 / max:  453.632) us

1reaction

asi1024commented, Jan 8, 2020

I compared the performance between CUB and cuTENSOR. benchmark script: https://gist.github.com/asi1024/ee62c50fd1254acb0e9431473862a014 Output on V100:

basic    (axes:  (0, 1, 2)):    54.037 us   +/-44.718 (min:   14.485 / max:  348.380) us  17436.959 us   +/-50.810 (min:17367.041 / max:17750.015) us
cub      (axes:  (0, 1, 2)):    51.916 us   +/-387.385 (min:   23.083 / max:16699.770) us    154.005 us   +/-390.641 (min:  128.000 / max:16706.560) us
cutensor (axes:  (0, 1, 2)):    73.594 us   +/-475.080 (min:   31.735 / max:24820.344) us    192.569 us   +/-473.173 (min:  152.576 / max:24834.047) us
basic    (axes:       (0,)):    19.748 us   +/- 1.972 (min:   14.864 / max:   35.450) us    158.065 us   +/- 1.978 (min:  153.600 / max:  173.056) us
cub      (axes:       (0,)):    29.280 us   +/-219.529 (min:   13.376 / max:19073.978) us    169.111 us   +/-255.464 (min:  151.552 / max:19080.193) us
cutensor (axes:       (0,)):    68.540 us   +/-398.813 (min:   27.978 / max:18368.091) us    186.627 us   +/-406.490 (min:  145.408 / max:18482.176) us
basic    (axes:       (1,)):    23.490 us   +/- 4.151 (min:   20.438 / max:   47.350) us    312.504 us   +/- 4.082 (min:  309.248 / max:  335.872) us
cub      (axes:       (1,)):    38.303 us   +/-390.924 (min:   15.284 / max:22703.214) us    327.374 us   +/-388.341 (min:  304.128 / max:22990.849) us
cutensor (axes:       (1,)):    77.555 us   +/-583.095 (min:   28.172 / max:22202.181) us    195.864 us   +/-585.698 (min:  145.408 / max:22317.057) us
basic    (axes:       (2,)):    19.400 us   +/- 6.439 (min:   12.033 / max:   38.020) us    850.309 us   +/- 6.084 (min:  842.752 / max:  869.376) us
cub      (axes:       (2,)):    36.753 us   +/-124.664 (min:   15.517 / max: 9828.809) us    867.743 us   +/-124.581 (min:  845.824 / max:10658.816) us
cutensor (axes:       (2,)):    75.536 us   +/-439.583 (min:   27.872 / max:16241.652) us    202.909 us   +/-489.361 (min:  150.528 / max:21430.271) us

cuTENSOR is faster in batch reduction, and CUB is faster in full reduction.

Top Results From Across the Web

cuTENSOR: A High-Performance CUDA Library For Tensor ...

Main computational routines: Direct (i.e., transpose-free) tensor contractions. Tensor reductions (including partial reductions).

CUTENSOR

Tensor reductions. • Element-wise operations (e.g., ... Potential Use Cases: HPC & AI ... const cutensorContractionPlan_toid *plan,.

Overview — CuPy 9.3.0 documentation

CUB/cuTENSOR backends for reduction and other routines. Customizable memory allocator and memory pool. cuDNN utilities. Full coverage of NCCL APIs. CuPy uses on ......

cutensor - PyPI

Main computational routines: Direct (i.e., transpose-free) tensor contractions. Tensor reductions (including partial reductions).

High Performance Third-order Hierarchical Tucker Tensor ...

... use the properties of low rank tensors, significantly reduce the amount of storage. ... vectors matrix U. We use the routine gesvd(·)...