question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use cuTENSOR in reduction routines

See original GitHub issue

For the performance, _AbstractReductionKernel should use cuTENSOR by default if cupy.cuda.cutensor_enabled is True.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
emcastillocommented, Jan 9, 2020

After applying #2921 cub is faster in all cases

time_reduction       - case all naive                     :    16.432 us   +/- 2.531 (min:   14.376 / max:   21.252) us  18450.842 us   +/-27.035 (min:18412.544 / max:18479.103) us
time_reduction       - case all cub                       :    30.325 us   +/- 7.241 (min:   25.795 / max:   44.700) us    262.554 us   +/- 6.528 (min:  258.048 / max:  275.456) us
time_reduction       - case all cute                      :   108.133 us   +/- 5.533 (min:  102.569 / max:  116.556) us    334.029 us   +/- 4.733 (min:  328.704 / max:  340.992) us
time_reduction       - case first naive                    :    13.955 us   +/- 0.377 (min:   13.324 / max:   14.455) us    244.326 us   +/- 0.819 (min:  243.712 / max:  245.760) us
time_reduction       - case first cub                     :    20.176 us   +/- 1.597 (min:   18.873 / max:   23.197) us    251.085 us   +/- 2.996 (min:  248.832 / max:  257.024) us
time_reduction       - case first cute                    :   109.144 us   +/-11.078 (min:   98.060 / max:  128.639) us    342.630 us   +/-11.154 (min:  330.752 / max:  361.472) us
time_reduction       - case mid naive                     :    16.435 us   +/- 0.724 (min:   15.675 / max:   17.591) us    331.776 us   +/- 0.916 (min:  330.752 / max:  332.800) us
time_reduction       - case mid cub                       :    22.627 us   +/- 2.220 (min:   20.463 / max:   25.401) us    336.486 us   +/- 2.939 (min:  333.824 / max:  340.992) us
time_reduction       - case mid cute                      :   121.863 us   +/-35.807 (min:   99.245 / max:  192.610) us    350.413 us   +/-35.279 (min:  327.680 / max:  419.840) us
time_reduction       - case batch naive                    :    17.367 us   +/- 2.506 (min:   15.790 / max:   22.336) us    950.682 us   +/- 3.402 (min:  948.224 / max:  957.440) us
time_reduction       - case batch cub                     :    57.497 us   +/- 1.720 (min:   55.409 / max:   60.489) us    272.384 us   +/- 2.148 (min:  270.336 / max:  276.480) us
time_reduction       - case batch cute                    :   127.743 us   +/-46.858 (min:   96.462 / max:  220.364) us    362.906 us   +/-45.989 (min:  331.776 / max:  453.632) us
1reaction
asi1024commented, Jan 8, 2020

I compared the performance between CUB and cuTENSOR. benchmark script: https://gist.github.com/asi1024/ee62c50fd1254acb0e9431473862a014 Output on V100:

basic    (axes:  (0, 1, 2)):    54.037 us   +/-44.718 (min:   14.485 / max:  348.380) us  17436.959 us   +/-50.810 (min:17367.041 / max:17750.015) us
cub      (axes:  (0, 1, 2)):    51.916 us   +/-387.385 (min:   23.083 / max:16699.770) us    154.005 us   +/-390.641 (min:  128.000 / max:16706.560) us
cutensor (axes:  (0, 1, 2)):    73.594 us   +/-475.080 (min:   31.735 / max:24820.344) us    192.569 us   +/-473.173 (min:  152.576 / max:24834.047) us
basic    (axes:       (0,)):    19.748 us   +/- 1.972 (min:   14.864 / max:   35.450) us    158.065 us   +/- 1.978 (min:  153.600 / max:  173.056) us
cub      (axes:       (0,)):    29.280 us   +/-219.529 (min:   13.376 / max:19073.978) us    169.111 us   +/-255.464 (min:  151.552 / max:19080.193) us
cutensor (axes:       (0,)):    68.540 us   +/-398.813 (min:   27.978 / max:18368.091) us    186.627 us   +/-406.490 (min:  145.408 / max:18482.176) us
basic    (axes:       (1,)):    23.490 us   +/- 4.151 (min:   20.438 / max:   47.350) us    312.504 us   +/- 4.082 (min:  309.248 / max:  335.872) us
cub      (axes:       (1,)):    38.303 us   +/-390.924 (min:   15.284 / max:22703.214) us    327.374 us   +/-388.341 (min:  304.128 / max:22990.849) us
cutensor (axes:       (1,)):    77.555 us   +/-583.095 (min:   28.172 / max:22202.181) us    195.864 us   +/-585.698 (min:  145.408 / max:22317.057) us
basic    (axes:       (2,)):    19.400 us   +/- 6.439 (min:   12.033 / max:   38.020) us    850.309 us   +/- 6.084 (min:  842.752 / max:  869.376) us
cub      (axes:       (2,)):    36.753 us   +/-124.664 (min:   15.517 / max: 9828.809) us    867.743 us   +/-124.581 (min:  845.824 / max:10658.816) us
cutensor (axes:       (2,)):    75.536 us   +/-439.583 (min:   27.872 / max:16241.652) us    202.909 us   +/-489.361 (min:  150.528 / max:21430.271) us

cuTENSOR is faster in batch reduction, and CUB is faster in full reduction.

Read more comments on GitHub >

github_iconTop Results From Across the Web

cuTENSOR: A High-Performance CUDA Library For Tensor ...
Main computational routines: Direct (i.e., transpose-free) tensor contractions. Tensor reductions (including partial reductions).
Read more >
CUTENSOR
Tensor reductions. • Element-wise operations (e.g., ... Potential Use Cases: HPC & AI ... const cutensorContractionPlan_toid *plan,.
Read more >
Overview — CuPy 9.3.0 documentation
CUB/cuTENSOR backends for reduction and other routines. Customizable memory allocator and memory pool. cuDNN utilities. Full coverage of NCCL APIs. CuPy uses on ......
Read more >
cutensor - PyPI
Main computational routines: Direct (i.e., transpose-free) tensor contractions. Tensor reductions (including partial reductions).
Read more >
High Performance Third-order Hierarchical Tucker Tensor ...
... use the properties of low rank tensors, significantly reduce the amount of storage. ... vectors matrix U. We use the routine gesvd(·)...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found