question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CuPy 8.0.0 kernel launch overhead/time

See original GitHub issue

Is there anything different about how CUDA kernels are launched between CuPy 7.3 and 8.0? I’m seeing a small drop in performance in nearly all of my CuPy raw kernels with 8.0.0b2. It’s not much but is seems pretty consistent.

Most CuPy calls seem to be faster (which is nice).

I’ve attached a pytest-benchmark comparison pytest_730vs800.txt.

But as an example.

-------------------------------------------------------------------------------------------------------- benchmark 'Convolve2d': 36 tests -------------------------------------------------------------------------------------------------------
Name (time in us)                                                  Min                    Max                  Mean                StdDev                Median                 IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_convolve2d_gpu[valid-symm-5-256] (0001_cupy730)         114.9480 (1.0)         215.6430 (1.0)        118.5592 (1.0)          6.1043 (1.30)       117.3260 (1.0)        1.0840 (1.0)       365;760  8,434.6081 (1.0)        8510           1
bench_convolve2d_gpu[valid-wrap-5-256] (0001_cupy730)         115.0150 (1.00)     35,550.1420 (164.86)     129.4123 (1.09)       394.3276 (84.01)      117.3320 (1.00)       1.5945 (1.47)      11;1365  7,727.2425 (0.92)       8543           1
bench_convolve2d_gpu[valid-fill-5-256] (0001_cupy730)         119.1470 (1.04)      7,633.0270 (35.40)      131.7770 (1.11)       130.5539 (27.81)      121.5260 (1.04)       1.5328 (1.41)      28;1259  7,588.5763 (0.90)       8135           1
bench_convolve2d_gpu[valid-symm-5-256] (0002_cupy800)         131.2820 (1.14)     10,547.4940 (48.91)      146.7623 (1.24)       165.0781 (35.17)      134.6530 (1.15)       2.3915 (2.21)      20;1344  6,813.7405 (0.81)       7465           1
bench_convolve2d_gpu[valid-wrap-5-256] (0002_cupy800)         131.8490 (1.15)        233.9780 (1.09)       136.6235 (1.15)         7.2768 (1.55)       134.6980 (1.15)       1.5060 (1.39)      567;949  7,319.3836 (0.87)       7379           1
bench_convolve2d_gpu[valid-fill-5-256] (0002_cupy800)         135.7890 (1.18)        317.2970 (1.47)       141.2737 (1.19)         9.9237 (2.11)       138.6090 (1.18)       1.6240 (1.50)      549;983  7,078.4567 (0.84)       7224           1
bench_convolve2d_gpu[same-fill-5-256] (0001_cupy730)          220.0250 (1.91)     21,787.2470 (101.03)     248.6903 (2.10)       402.6031 (85.77)      223.3620 (1.90)       3.1948 (2.95)       13;585  4,021.0650 (0.48)       4469           1
bench_convolve2d_gpu[full-fill-5-256] (0001_cupy730)          225.5980 (1.96)      8,674.5710 (40.23)      255.7666 (2.16)       248.0299 (52.84)      230.4640 (1.96)       5.0588 (4.67)       16;690  3,909.8146 (0.46)       4303           1
bench_convolve2d_gpu[same-wrap-5-256] (0001_cupy730)          249.6160 (2.17)        404.5100 (1.88)       258.5249 (2.18)         8.6587 (1.84)       256.9800 (2.19)       3.9755 (3.67)      290;342  3,868.0996 (0.46)       3929           1
bench_convolve2d_gpu[same-fill-5-256] (0002_cupy800)          254.1690 (2.21)        583.6530 (2.71)       260.1703 (2.19)        11.0884 (2.36)       258.1140 (2.20)       2.4165 (2.23)      161;398  3,843.6364 (0.46)       3735           1
bench_convolve2d_gpu[full-wrap-5-256] (0001_cupy730)          255.5680 (2.22)        463.3020 (2.15)       262.0389 (2.21)        10.6966 (2.28)       260.0770 (2.22)       2.6040 (2.40)      186;305  3,816.2270 (0.45)       3830           1
bench_convolve2d_gpu[same-symm-5-256] (0001_cupy730)          255.8330 (2.23)        385.0860 (1.79)       263.3686 (2.22)         4.6941 (1.0)        263.0425 (2.24)       2.4610 (2.27)      392;312  3,796.9595 (0.45)       3830           1
bench_convolve2d_gpu[full-fill-5-256] (0002_cupy800)          261.2810 (2.27)     28,152.8840 (130.55)     294.4878 (2.48)       476.4559 (101.50)     266.9770 (2.28)       6.1625 (5.68)       14;572  3,395.7262 (0.40)       3727           1
bench_convolve2d_gpu[full-symm-5-256] (0001_cupy730)          261.8460 (2.28)     13,590.6950 (63.02)      319.9241 (2.70)       261.0537 (55.61)      269.5450 (2.30)      53.2543 (49.13)      20;613  3,125.7414 (0.37)       3715           1
bench_convolve2d_gpu[same-wrap-5-256] (0002_cupy800)          287.3020 (2.50)     10,319.2490 (47.85)      326.2510 (2.75)       273.3862 (58.24)      296.2840 (2.53)       6.7770 (6.25)       24;591  3,065.1244 (0.36)       3434           1
bench_convolve2d_gpu[same-symm-5-256] (0002_cupy800)          292.1490 (2.54)      6,791.0310 (31.49)      329.1211 (2.78)       198.8153 (42.35)      299.6780 (2.55)       6.6305 (6.12)       80;570  3,038.3953 (0.36)       3376           1
bench_convolve2d_gpu[full-wrap-5-256] (0002_cupy800)          294.3020 (2.56)      6,894.8440 (31.97)      325.7672 (2.75)       179.1243 (38.16)      299.9030 (2.56)       8.9795 (8.28)       82;458  3,069.6764 (0.36)       2665           1
bench_convolve2d_gpu[full-symm-5-256] (0002_cupy800)          297.6980 (2.59)     13,736.2680 (63.70)      328.8248 (2.77)       329.1252 (70.12)      302.8480 (2.58)       4.5440 (4.19)       22;522  3,041.1335 (0.36)       3294           1
bench_convolve2d_gpu[valid-wrap-100-256] (0001_cupy730)     1,606.8980 (13.98)     1,765.0200 (8.18)     1,621.5128 (13.68)       15.9099 (3.39)     1,614.5380 (13.76)     11.6590 (10.76)     119;112    616.7080 (0.07)        621           1
bench_convolve2d_gpu[valid-symm-100-256] (0001_cupy730)     1,606.9310 (13.98)    10,364.9360 (48.07)    1,699.3196 (14.33)      490.6496 (104.53)   1,615.9810 (13.77)     28.5055 (26.30)        9;92    588.4708 (0.07)        620           1
bench_convolve2d_gpu[valid-fill-100-256] (0001_cupy730)     1,608.2400 (13.99)     2,453.6990 (11.38)    1,700.3170 (14.34)      213.3465 (45.45)    1,615.1715 (13.77)     11.2620 (10.39)       61;62    588.1256 (0.07)        446           1
bench_convolve2d_gpu[valid-wrap-100-256] (0002_cupy800)     1,623.3870 (14.12)     6,947.2920 (32.22)    1,709.7970 (14.42)      389.0638 (82.88)    1,629.9425 (13.89)     26.4200 (24.37)       9;104    584.8648 (0.07)        614           1
bench_convolve2d_gpu[valid-symm-100-256] (0002_cupy800)     1,623.8970 (14.13)     1,661.2350 (7.70)     1,630.7732 (13.75)        4.8489 (1.03)     1,629.0260 (13.88)      5.6245 (5.19)       142;17    613.2060 (0.07)        613           1
bench_convolve2d_gpu[valid-fill-100-256] (0002_cupy800)     1,627.8100 (14.16)     6,960.3760 (32.28)    1,836.7233 (15.49)      410.2788 (87.40)    1,665.5840 (14.20)    292.8815 (270.19)       18;7    544.4478 (0.06)        451           1
bench_convolve2d_gpu[same-fill-100-256] (0001_cupy730)      3,217.3610 (27.99)     3,583.7770 (16.62)    3,231.7568 (27.26)       28.4001 (6.05)     3,225.8565 (27.49)     10.1075 (9.32)        13;19    309.4292 (0.04)        308           1
bench_convolve2d_gpu[same-wrap-100-256] (0001_cupy730)      3,244.5420 (28.23)    19,070.7760 (88.44)    3,478.2051 (29.34)    1,183.4700 (252.12)   3,257.8750 (27.77)    121.8700 (112.43)       7;22    287.5046 (0.03)        227           1
bench_convolve2d_gpu[same-fill-100-256] (0002_cupy800)      3,252.2680 (28.29)    26,789.1500 (124.23)   3,445.3604 (29.06)    1,394.2662 (297.03)   3,269.3370 (27.87)     75.4260 (69.58)        5;32    290.2454 (0.03)        307           1
bench_convolve2d_gpu[same-symm-100-256] (0001_cupy730)      3,252.8540 (28.30)    10,608.9280 (49.20)    3,570.5240 (30.12)      736.2826 (156.85)   3,301.2800 (28.14)    228.0785 (210.40)      30;32    280.0709 (0.03)        227           1
bench_convolve2d_gpu[same-wrap-100-256] (0002_cupy800)      3,283.0430 (28.56)     4,481.9060 (20.78)    3,406.6877 (28.73)      329.8262 (70.26)    3,300.7760 (28.13)     18.5910 (17.15)       20;29    293.5403 (0.03)        226           1
bench_convolve2d_gpu[same-symm-100-256] (0002_cupy800)      3,286.3810 (28.59)     4,452.7300 (20.65)    3,347.9291 (28.24)      237.7220 (50.64)    3,294.7930 (28.08)     10.1937 (9.40)        10;13    298.6921 (0.04)        225           1
bench_convolve2d_gpu[full-fill-100-256] (0001_cupy730)      6,152.5650 (53.52)    12,701.7910 (58.90)    6,492.6414 (54.76)      852.2611 (181.56)   6,168.5815 (52.58)    339.4900 (313.18)       8;22    154.0205 (0.02)        136           1
bench_convolve2d_gpu[full-symm-100-256] (0001_cupy730)      6,183.2290 (53.79)     6,642.7990 (30.80)    6,231.9070 (52.56)       72.5656 (15.46)    6,203.6550 (52.88)     42.3120 (39.03)       13;14    160.4645 (0.02)        118           1
bench_convolve2d_gpu[full-wrap-100-256] (0001_cupy730)      6,185.1640 (53.81)    15,431.1240 (71.56)    6,613.9338 (55.79)    1,087.9448 (231.77)   6,332.8605 (53.98)    448.9280 (414.14)        7;9    151.1959 (0.02)        122           1
bench_convolve2d_gpu[full-fill-100-256] (0002_cupy800)      6,186.1080 (53.82)     6,794.2840 (31.51)    6,288.3675 (53.04)      202.2969 (43.10)    6,197.0730 (52.82)     17.9327 (16.54)       20;21    159.0238 (0.02)        119           1
bench_convolve2d_gpu[full-wrap-100-256] (0002_cupy800)      6,217.8560 (54.09)     6,549.1520 (30.37)    6,239.1438 (52.62)       35.2079 (7.50)     6,232.9380 (53.12)     12.0200 (11.09)         6;8    160.2784 (0.02)        118           1
bench_convolve2d_gpu[full-symm-100-256] (0002_cupy800)      6,224.7550 (54.15)     6,446.2190 (29.89)    6,249.2002 (52.71)       33.0107 (7.03)     6,236.9630 (53.16)     20.6007 (19.00)       11;11    160.0205 (0.02)        117           1

I’m comparing the following configurations.

CuPy Version          : 7.3.0
CUDA Root             : /usr/local/cuda
CUDA Build Version    : 10010
CUDA Driver Version   : 10020
CUDA Runtime Version  : 10010
cuBLAS Version        : 10201
cuFFT Version         : 10101
cuRAND Version        : 10101
cuSOLVER Version      : (10, 2, 0)
cuSPARSE Version      : 10300
NVRTC Version         : (10, 1)
cuDNN Build Version   : 7605
cuDNN Version         : 7600
NCCL Build Version    : 2406
NCCL Runtime Version  : 2604

CuPy Version          : 8.0.0b2
CUDA Root             : /usr/local/cuda
CUDA Build Version    : 10020
CUDA Driver Version   : 10020
CUDA Runtime Version  : 10020
cuBLAS Version        : 10202
cuFFT Version         : 10102
cuRAND Version        : 10102
cuSOLVER Version      : (10, 3, 0)
cuSPARSE Version      : 10301
NVRTC Version         : (10, 2)
cuDNN Build Version   : None
cuDNN Version         : None
NCCL Build Version    : None
NCCL Runtime Version  : None

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:35 (35 by maintainers)

github_iconTop GitHub Comments

2reactions
emcastillocommented, May 27, 2020

please clean the temporaries with git clean -f X

1reaction
mnicelycommented, May 27, 2020

Right, or git clean -fdx, or my lazy way: #3341 (comment)

I’ll eventually get this Github figured out! Haha

Read more comments on GitHub >

github_iconTop Results From Across the Web

User-Defined Kernels — CuPy 11.4.0 documentation
CuPy provides easy ways to define three types of CUDA kernels: elementwise kernels, reduction kernels and raw kernels. In this documentation, we describe...
Read more >
Installation — CuPy 11.4.0 documentation
Required only when using Automatic Kernel Parameters Optimizations (cupyx.optimizing). Note. SciPy and Optuna are optional dependencies and will not be ...
Read more >
Basics of CuPy — CuPy 11.4.0 documentation
In CuPy, all CUDA operations such as data transfer (see the Data Transfer section) and kernel launches are enqueued onto the current stream,...
Read more >
Performance Best Practices — CuPy 11.4.0 documentation
CuPy caches the kernel code sent to GPU device within the process, ... starting v8 CuPy introduces an environment variable CUPY_ACCELERATORS to allow...
Read more >
Accessing CUDA Functionalities — CuPy 11.4.0 documentation
Data copies and kernel launches are enqueued onto the Current Stream, which can be queried via get_current_stream() and changed either by setting up...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found