question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tests fail with CUDA 10.1

See original GitHub issue

I am launching tests on a server with 4 V100, CUDA 10.1. Tests fail with the following error: RunTime Error: The NVIDIA driver on your system is too old (found version 10010). Please update your GPU driver [bla bla]. Alternatively go to Pytorch and install a pytorch version that has been compiled with your version of the CUDA driver. I also recreated the environment from scratch with conda env create -f environment.yml.

Tests successfully complete with CPU.

Any ideas? How can I specify the CUDA driver when installing pytorch from the environment file?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:17 (11 by maintainers)

github_iconTop GitHub Comments

3reactions
AndreaCossucommented, Jan 29, 2021

I agree with @AntonioCarta. I specified pytorch version into the .yml (e.g. pytorch::pytorch==1.7.1) but the error remains. I think it’s up to the user to specify its CUDA version. Something like

CUDA_VERSION = args[1] # 9.2, 10.1, 10.2, 11.0, cpu
conda env create -f environment.yml
conda activate avalanche-env
if CUDA_VERSION == cpu:
    conda install pytorch torchvision torchaudio cpuonly -c pytorch
else:
    conda install pytorch torchvision torchaudio cudatoolkit=CUDA_VERSION -c pytorch
1reaction
ggraffieticommented, Jan 29, 2021

Yes, but it depends on the drivers installed on the GPUs. We cannot control or force the installation of the drivers, so the user should provide the currently installed version of the cuda toolkit. As an example, one of our servers has the nvidia drivers version 440.100, which supports only a cudatoolkit up to 10.2. If you install the conda environment and try to run some tests the same message appears and the computation is done on the CPU.

Probably we didn’t notice this bug because we have never recreated our environments from scratch and we perform the remote testing on CPU.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues - GitHub
The following unit tests fail with CUDA 10.1/CUDNN 7.5 BatchNormTest.PositiveTestCase BatchNormTest.
Read more >
CUDA Unit Tests failing on CUDA 10.1 - VTK-m - GitLab
It appears that a large number of unit tests are failing on CUDA 10.1. $ ctest --rerun-failed Total Test time (real) = 12.26...
Read more >
CUDA 12.0 Release Notes - NVIDIA Documentation Center
Tegra: Application binaries built with CUDA 11.8 or older toolkit using the LTO feature may fail when running with CUDA 12.0 compat driver....
Read more >
CUDA 11.4 error - Install, Configure and Update
However, the output of nvcc --version displays CUDA 10.1. ... of CUDA-10.1 CRYSPARC_CUDA_PATH and CUDA-11.4 driver with a test workflow like ...
Read more >
Make check failing when GPU enabled - GROMACS forums
Hi folks, I've been trying to get a CUDA-enabled gmx to pass make check but it's timing out on a lot of tests....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found