Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tests fail with CUDA 10.1

See original GitHub issue

I am launching tests on a server with 4 V100, CUDA 10.1. Tests fail with the following error: RunTime Error: The NVIDIA driver on your system is too old (found version 10010). Please update your GPU driver [bla bla]. Alternatively go to Pytorch and install a pytorch version that has been compiled with your version of the CUDA driver. I also recreated the environment from scratch with conda env create -f environment.yml.

Tests successfully complete with CPU.

Any ideas? How can I specify the CUDA driver when installing pytorch from the environment file?

Issue Analytics

State:
Created 3 years ago
Comments:17 (11 by maintainers)

Top GitHub Comments

3reactions

AndreaCossucommented, Jan 29, 2021

I agree with @AntonioCarta. I specified pytorch version into the .yml (e.g. pytorch::pytorch==1.7.1) but the error remains. I think it’s up to the user to specify its CUDA version. Something like

CUDA_VERSION = args[1] # 9.2, 10.1, 10.2, 11.0, cpu
conda env create -f environment.yml
conda activate avalanche-env
if CUDA_VERSION == cpu:
    conda install pytorch torchvision torchaudio cpuonly -c pytorch
else:
    conda install pytorch torchvision torchaudio cudatoolkit=CUDA_VERSION -c pytorch

1reaction

ggraffieticommented, Jan 29, 2021

Yes, but it depends on the drivers installed on the GPUs. We cannot control or force the installation of the drivers, so the user should provide the currently installed version of the cuda toolkit. As an example, one of our servers has the nvidia drivers version 440.100, which supports only a cudatoolkit up to 10.2. If you install the conda environment and try to run some tests the same message appears and the computation is done on the CPU.

Probably we didn’t notice this bug because we have never recreated our environments from scratch and we perform the remote testing on CPU.