CuPy 8.0.0 kernel launch overhead/time
See original GitHub issueIs there anything different about how CUDA kernels are launched between CuPy 7.3 and 8.0? I’m seeing a small drop in performance in nearly all of my CuPy raw kernels with 8.0.0b2. It’s not much but is seems pretty consistent.
Most CuPy calls seem to be faster (which is nice).
I’ve attached a pytest-benchmark comparison pytest_730vs800.txt.
But as an example.
-------------------------------------------------------------------------------------------------------- benchmark 'Convolve2d': 36 tests -------------------------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_convolve2d_gpu[valid-symm-5-256] (0001_cupy730) 114.9480 (1.0) 215.6430 (1.0) 118.5592 (1.0) 6.1043 (1.30) 117.3260 (1.0) 1.0840 (1.0) 365;760 8,434.6081 (1.0) 8510 1
bench_convolve2d_gpu[valid-wrap-5-256] (0001_cupy730) 115.0150 (1.00) 35,550.1420 (164.86) 129.4123 (1.09) 394.3276 (84.01) 117.3320 (1.00) 1.5945 (1.47) 11;1365 7,727.2425 (0.92) 8543 1
bench_convolve2d_gpu[valid-fill-5-256] (0001_cupy730) 119.1470 (1.04) 7,633.0270 (35.40) 131.7770 (1.11) 130.5539 (27.81) 121.5260 (1.04) 1.5328 (1.41) 28;1259 7,588.5763 (0.90) 8135 1
bench_convolve2d_gpu[valid-symm-5-256] (0002_cupy800) 131.2820 (1.14) 10,547.4940 (48.91) 146.7623 (1.24) 165.0781 (35.17) 134.6530 (1.15) 2.3915 (2.21) 20;1344 6,813.7405 (0.81) 7465 1
bench_convolve2d_gpu[valid-wrap-5-256] (0002_cupy800) 131.8490 (1.15) 233.9780 (1.09) 136.6235 (1.15) 7.2768 (1.55) 134.6980 (1.15) 1.5060 (1.39) 567;949 7,319.3836 (0.87) 7379 1
bench_convolve2d_gpu[valid-fill-5-256] (0002_cupy800) 135.7890 (1.18) 317.2970 (1.47) 141.2737 (1.19) 9.9237 (2.11) 138.6090 (1.18) 1.6240 (1.50) 549;983 7,078.4567 (0.84) 7224 1
bench_convolve2d_gpu[same-fill-5-256] (0001_cupy730) 220.0250 (1.91) 21,787.2470 (101.03) 248.6903 (2.10) 402.6031 (85.77) 223.3620 (1.90) 3.1948 (2.95) 13;585 4,021.0650 (0.48) 4469 1
bench_convolve2d_gpu[full-fill-5-256] (0001_cupy730) 225.5980 (1.96) 8,674.5710 (40.23) 255.7666 (2.16) 248.0299 (52.84) 230.4640 (1.96) 5.0588 (4.67) 16;690 3,909.8146 (0.46) 4303 1
bench_convolve2d_gpu[same-wrap-5-256] (0001_cupy730) 249.6160 (2.17) 404.5100 (1.88) 258.5249 (2.18) 8.6587 (1.84) 256.9800 (2.19) 3.9755 (3.67) 290;342 3,868.0996 (0.46) 3929 1
bench_convolve2d_gpu[same-fill-5-256] (0002_cupy800) 254.1690 (2.21) 583.6530 (2.71) 260.1703 (2.19) 11.0884 (2.36) 258.1140 (2.20) 2.4165 (2.23) 161;398 3,843.6364 (0.46) 3735 1
bench_convolve2d_gpu[full-wrap-5-256] (0001_cupy730) 255.5680 (2.22) 463.3020 (2.15) 262.0389 (2.21) 10.6966 (2.28) 260.0770 (2.22) 2.6040 (2.40) 186;305 3,816.2270 (0.45) 3830 1
bench_convolve2d_gpu[same-symm-5-256] (0001_cupy730) 255.8330 (2.23) 385.0860 (1.79) 263.3686 (2.22) 4.6941 (1.0) 263.0425 (2.24) 2.4610 (2.27) 392;312 3,796.9595 (0.45) 3830 1
bench_convolve2d_gpu[full-fill-5-256] (0002_cupy800) 261.2810 (2.27) 28,152.8840 (130.55) 294.4878 (2.48) 476.4559 (101.50) 266.9770 (2.28) 6.1625 (5.68) 14;572 3,395.7262 (0.40) 3727 1
bench_convolve2d_gpu[full-symm-5-256] (0001_cupy730) 261.8460 (2.28) 13,590.6950 (63.02) 319.9241 (2.70) 261.0537 (55.61) 269.5450 (2.30) 53.2543 (49.13) 20;613 3,125.7414 (0.37) 3715 1
bench_convolve2d_gpu[same-wrap-5-256] (0002_cupy800) 287.3020 (2.50) 10,319.2490 (47.85) 326.2510 (2.75) 273.3862 (58.24) 296.2840 (2.53) 6.7770 (6.25) 24;591 3,065.1244 (0.36) 3434 1
bench_convolve2d_gpu[same-symm-5-256] (0002_cupy800) 292.1490 (2.54) 6,791.0310 (31.49) 329.1211 (2.78) 198.8153 (42.35) 299.6780 (2.55) 6.6305 (6.12) 80;570 3,038.3953 (0.36) 3376 1
bench_convolve2d_gpu[full-wrap-5-256] (0002_cupy800) 294.3020 (2.56) 6,894.8440 (31.97) 325.7672 (2.75) 179.1243 (38.16) 299.9030 (2.56) 8.9795 (8.28) 82;458 3,069.6764 (0.36) 2665 1
bench_convolve2d_gpu[full-symm-5-256] (0002_cupy800) 297.6980 (2.59) 13,736.2680 (63.70) 328.8248 (2.77) 329.1252 (70.12) 302.8480 (2.58) 4.5440 (4.19) 22;522 3,041.1335 (0.36) 3294 1
bench_convolve2d_gpu[valid-wrap-100-256] (0001_cupy730) 1,606.8980 (13.98) 1,765.0200 (8.18) 1,621.5128 (13.68) 15.9099 (3.39) 1,614.5380 (13.76) 11.6590 (10.76) 119;112 616.7080 (0.07) 621 1
bench_convolve2d_gpu[valid-symm-100-256] (0001_cupy730) 1,606.9310 (13.98) 10,364.9360 (48.07) 1,699.3196 (14.33) 490.6496 (104.53) 1,615.9810 (13.77) 28.5055 (26.30) 9;92 588.4708 (0.07) 620 1
bench_convolve2d_gpu[valid-fill-100-256] (0001_cupy730) 1,608.2400 (13.99) 2,453.6990 (11.38) 1,700.3170 (14.34) 213.3465 (45.45) 1,615.1715 (13.77) 11.2620 (10.39) 61;62 588.1256 (0.07) 446 1
bench_convolve2d_gpu[valid-wrap-100-256] (0002_cupy800) 1,623.3870 (14.12) 6,947.2920 (32.22) 1,709.7970 (14.42) 389.0638 (82.88) 1,629.9425 (13.89) 26.4200 (24.37) 9;104 584.8648 (0.07) 614 1
bench_convolve2d_gpu[valid-symm-100-256] (0002_cupy800) 1,623.8970 (14.13) 1,661.2350 (7.70) 1,630.7732 (13.75) 4.8489 (1.03) 1,629.0260 (13.88) 5.6245 (5.19) 142;17 613.2060 (0.07) 613 1
bench_convolve2d_gpu[valid-fill-100-256] (0002_cupy800) 1,627.8100 (14.16) 6,960.3760 (32.28) 1,836.7233 (15.49) 410.2788 (87.40) 1,665.5840 (14.20) 292.8815 (270.19) 18;7 544.4478 (0.06) 451 1
bench_convolve2d_gpu[same-fill-100-256] (0001_cupy730) 3,217.3610 (27.99) 3,583.7770 (16.62) 3,231.7568 (27.26) 28.4001 (6.05) 3,225.8565 (27.49) 10.1075 (9.32) 13;19 309.4292 (0.04) 308 1
bench_convolve2d_gpu[same-wrap-100-256] (0001_cupy730) 3,244.5420 (28.23) 19,070.7760 (88.44) 3,478.2051 (29.34) 1,183.4700 (252.12) 3,257.8750 (27.77) 121.8700 (112.43) 7;22 287.5046 (0.03) 227 1
bench_convolve2d_gpu[same-fill-100-256] (0002_cupy800) 3,252.2680 (28.29) 26,789.1500 (124.23) 3,445.3604 (29.06) 1,394.2662 (297.03) 3,269.3370 (27.87) 75.4260 (69.58) 5;32 290.2454 (0.03) 307 1
bench_convolve2d_gpu[same-symm-100-256] (0001_cupy730) 3,252.8540 (28.30) 10,608.9280 (49.20) 3,570.5240 (30.12) 736.2826 (156.85) 3,301.2800 (28.14) 228.0785 (210.40) 30;32 280.0709 (0.03) 227 1
bench_convolve2d_gpu[same-wrap-100-256] (0002_cupy800) 3,283.0430 (28.56) 4,481.9060 (20.78) 3,406.6877 (28.73) 329.8262 (70.26) 3,300.7760 (28.13) 18.5910 (17.15) 20;29 293.5403 (0.03) 226 1
bench_convolve2d_gpu[same-symm-100-256] (0002_cupy800) 3,286.3810 (28.59) 4,452.7300 (20.65) 3,347.9291 (28.24) 237.7220 (50.64) 3,294.7930 (28.08) 10.1937 (9.40) 10;13 298.6921 (0.04) 225 1
bench_convolve2d_gpu[full-fill-100-256] (0001_cupy730) 6,152.5650 (53.52) 12,701.7910 (58.90) 6,492.6414 (54.76) 852.2611 (181.56) 6,168.5815 (52.58) 339.4900 (313.18) 8;22 154.0205 (0.02) 136 1
bench_convolve2d_gpu[full-symm-100-256] (0001_cupy730) 6,183.2290 (53.79) 6,642.7990 (30.80) 6,231.9070 (52.56) 72.5656 (15.46) 6,203.6550 (52.88) 42.3120 (39.03) 13;14 160.4645 (0.02) 118 1
bench_convolve2d_gpu[full-wrap-100-256] (0001_cupy730) 6,185.1640 (53.81) 15,431.1240 (71.56) 6,613.9338 (55.79) 1,087.9448 (231.77) 6,332.8605 (53.98) 448.9280 (414.14) 7;9 151.1959 (0.02) 122 1
bench_convolve2d_gpu[full-fill-100-256] (0002_cupy800) 6,186.1080 (53.82) 6,794.2840 (31.51) 6,288.3675 (53.04) 202.2969 (43.10) 6,197.0730 (52.82) 17.9327 (16.54) 20;21 159.0238 (0.02) 119 1
bench_convolve2d_gpu[full-wrap-100-256] (0002_cupy800) 6,217.8560 (54.09) 6,549.1520 (30.37) 6,239.1438 (52.62) 35.2079 (7.50) 6,232.9380 (53.12) 12.0200 (11.09) 6;8 160.2784 (0.02) 118 1
bench_convolve2d_gpu[full-symm-100-256] (0002_cupy800) 6,224.7550 (54.15) 6,446.2190 (29.89) 6,249.2002 (52.71) 33.0107 (7.03) 6,236.9630 (53.16) 20.6007 (19.00) 11;11 160.0205 (0.02) 117 1
I’m comparing the following configurations.
CuPy Version : 7.3.0
CUDA Root : /usr/local/cuda
CUDA Build Version : 10010
CUDA Driver Version : 10020
CUDA Runtime Version : 10010
cuBLAS Version : 10201
cuFFT Version : 10101
cuRAND Version : 10101
cuSOLVER Version : (10, 2, 0)
cuSPARSE Version : 10300
NVRTC Version : (10, 1)
cuDNN Build Version : 7605
cuDNN Version : 7600
NCCL Build Version : 2406
NCCL Runtime Version : 2604
CuPy Version : 8.0.0b2
CUDA Root : /usr/local/cuda
CUDA Build Version : 10020
CUDA Driver Version : 10020
CUDA Runtime Version : 10020
cuBLAS Version : 10202
cuFFT Version : 10102
cuRAND Version : 10102
cuSOLVER Version : (10, 3, 0)
cuSPARSE Version : 10301
NVRTC Version : (10, 2)
cuDNN Build Version : None
cuDNN Version : None
NCCL Build Version : None
NCCL Runtime Version : None
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:35 (35 by maintainers)
Top Results From Across the Web
User-Defined Kernels — CuPy 11.4.0 documentation
CuPy provides easy ways to define three types of CUDA kernels: elementwise kernels, reduction kernels and raw kernels. In this documentation, we describe...
Read more >Installation — CuPy 11.4.0 documentation
Required only when using Automatic Kernel Parameters Optimizations (cupyx.optimizing). Note. SciPy and Optuna are optional dependencies and will not be ...
Read more >Basics of CuPy — CuPy 11.4.0 documentation
In CuPy, all CUDA operations such as data transfer (see the Data Transfer section) and kernel launches are enqueued onto the current stream,...
Read more >Performance Best Practices — CuPy 11.4.0 documentation
CuPy caches the kernel code sent to GPU device within the process, ... starting v8 CuPy introduces an environment variable CUPY_ACCELERATORS to allow...
Read more >Accessing CUDA Functionalities — CuPy 11.4.0 documentation
Data copies and kernel launches are enqueued onto the Current Stream, which can be queried via get_current_stream() and changed either by setting up...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
please clean the temporaries with git clean -f X
I’ll eventually get this Github figured out! Haha