CUDA 10.2 faster than 11 on older hardware
See original GitHub issueWhile testing #1129 on the ADS side on bireli
, we found this weird behavior. Downgrading CUDA from 11.1 to 10.2 speeds up inference (almost twice as fast).
bireli
has a GeForce GTX TITAN X (2014). This is the output of time
on our quick integrity test on bireli:
CUDA 11.1:
real 0m16.210s user 0m52.484s sys 0m3.251s
CUDA 10.2:
real 0m9.065s user 0m6.337s sys 0m1.478s
The performance is different with newer hardware, e.g. here is on romane
, which has RTX A6000 (2020):
CUDA 11.1:
real 0m8.021s user 0m12.308s sys 0m2.939s
My guess:
so, no it’s not as slow. I’m guessing this is because
bireli
’s gpus are older. It seems a lot of people reported slower inference time with CUDA 11 in comparison with 10.2: https://github.com/pytorch/pytorch/issues/47908 CUDA 11.1 is shipped with cudnn 8 whereas CUDA 10.2 is used from an older cudnn 7. It could be related to the cudnn version and iftorch.backends.cudnn.benchmark == True
, which seems to be the case for ivadomed, e.g. here: https://github.com/ivadomed/ivadomed/blob/7b76bf81a025cde3096fd1d686d6f3c0b8ce8f02/ivadomed/main.py#L28 according to this issue
However, setting cudnn.benchmark = True
to False in main.py and testing.py, I wasn’t able to observe a meaningful time difference but maybe I messed something up.
With this being said, this is low-priority stuff IMO because it shouldn’t have any impact on newer hardware. Just wanted to let everyone know about this.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (8 by maintainers)
Hm very curious. The complexity of this issue just keeps getting worse as we dig into this can of worms.
The reason I said the quick benchmark I did wasn’t rigorous enough is that if you rely only on
time
, you need to run the function multiple times and average the durations to get a (somewhat) accurate representation.With that being said, I found something unexpected while profiling with CUDA 10 vs 11.
CUDA 10.2 on bireli
CUDA 11 on bireli
I know it’s pretty hard to see anything on these (you can click on the image and the hover tooltip gives you more info). We were internally aware of performance difference when using onnx vs pt models. From what I can see, CUDA 10 uses the pytorch model whereas CUDA 11 uses the onnx version.
I think this is probably the main reason why the performance is so volatile.
IIUC the benchmarks reported in this issue are related to inference… Maybe we would observe some improvement in the training time as the newer GPUs reduce the training time, at least this is what the benchmarks portray.