Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA 10.2 faster than 11 on older hardware

See original GitHub issue

While testing #1129 on the ADS side on bireli, we found this weird behavior. Downgrading CUDA from 11.1 to 10.2 speeds up inference (almost twice as fast).

bireli has a GeForce GTX TITAN X (2014). This is the output of time on our quick integrity test on bireli:

CUDA 11.1:

real    0m16.210s
user    0m52.484s
sys     0m3.251s

CUDA 10.2:

real    0m9.065s
user    0m6.337s
sys     0m1.478s

The performance is different with newer hardware, e.g. here is on romane, which has RTX A6000 (2020):

CUDA 11.1:

real    0m8.021s
user    0m12.308s
sys     0m2.939s

My guess:

so, no it’s not as slow. I’m guessing this is because bireli’s gpus are older. It seems a lot of people reported slower inference time with CUDA 11 in comparison with 10.2: https://github.com/pytorch/pytorch/issues/47908 CUDA 11.1 is shipped with cudnn 8 whereas CUDA 10.2 is used from an older cudnn 7. It could be related to the cudnn version and if torch.backends.cudnn.benchmark == True, which seems to be the case for ivadomed, e.g. here: https://github.com/ivadomed/ivadomed/blob/7b76bf81a025cde3096fd1d686d6f3c0b8ce8f02/ivadomed/main.py#L28 according to this issue

However, setting cudnn.benchmark = True to False in main.py and testing.py, I wasn’t able to observe a meaningful time difference but maybe I messed something up.

With this being said, this is low-priority stuff IMO because it shouldn’t have any impact on newer hardware. Just wanted to let everyone know about this.

Issue Analytics

State:
Created a year ago
Comments:8 (8 by maintainers)

Top GitHub Comments

2reactions

hermancollincommented, Jun 13, 2022

I believe the issue here is that even tho I can downgrade cuda on bireli, romane should still be way faster regardless of the cuda version used because its gpus are much newer

I guess there are a lot of factors associated with it, even batch_size while inference makes a diff (comment upstream)

Hm very curious. The complexity of this issue just keeps getting worse as we dig into this can of worms.

The reason I said the quick benchmark I did wasn’t rigorous enough is that if you rely only on time, you need to run the function multiple times and average the durations to get a (somewhat) accurate representation.

With that being said, I found something unexpected while profiling with CUDA 10 vs 11.

CUDA 10.2 on bireli

cuda10

CUDA 11 on bireli

cuda11

I know it’s pretty hard to see anything on these (you can click on the image and the hover tooltip gives you more info). We were internally aware of performance difference when using onnx vs pt models. From what I can see, CUDA 10 uses the pytorch model whereas CUDA 11 uses the onnx version.

I think this is probably the main reason why the performance is so volatile.

1reaction

kanishk16commented, Jun 13, 2022

I believe the issue here is that even tho I can downgrade cuda on bireli, romane should still be way faster regardless of the cuda version used because its gpus are much newer

IIUC the benchmarks reported in this issue are related to inference… Maybe we would observe some improvement in the training time as the newer GPUs reduce the training time, at least this is what the benchmarks portray.