question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`torch.cuda.current_device()` is changed by CuPy after 10.0

See original GitHub issue

Description

Recently, @Yanqi-Chen reports a bug with using a Pytorch module accelerated by CuPy when training with Distributed Data Parallel (DDP).

In DDP training, each process uses torch.cuda.current_device() as its default device. But he finds that CuPy will change torch.cuda.current_device(). For example, when training with 4 GPUs, torch.cuda.current_device() should be 0, 1, 2, 3 in each process. But after using CuPy, all processes’s torch.cuda.current_device() will output 0.

To Reproduce

I run the following codes to reproduce this problem:

import torch
import cupy
kernel_code = r'''
extern "C" __global__
        void relu(const float* x, float *y, const int &N)
        {
            const int index = blockIdx.x * blockDim.x + threadIdx.x;
            if (index < N)
            {
                y[index] = (float) (x[index] >= 0.0f);
            }
        }
'''

def relu(x: torch.Tensor):
    device_id = x.get_device()
    torch.cuda.set_device(device_id)
    print('1:', torch.cuda.current_device())
    y = torch.zeros_like(x)
    assert device_id >= 0

    with cupy.cuda.Device(device_id):

        kernel = cupy.RawKernel(kernel_code, 'relu')
        threads = 1024
        N = x.numel()

        blocks = (N + threads - 1) // threads
        x = x.contiguous()
        y = y.contiguous()
        N = cupy.asarray(N)
        kernel((blocks,), (threads,), (x.data_ptr(), y.data_ptr(), N))
    print('2:', torch.cuda.current_device())
    return y
device = 'cuda:1'
x = torch.rand([8], device=device) - 0.5
y = relu(x)
print(f'x={x}')
print(f'y={y}')

In machine A, I get outputs:

(pytorch-env) wfang@ubuntu:~/temp_dir$ python test.py 
1: 1
2: 0
x=tensor([-0.1473, -0.3093, -0.0547, -0.1389,  0.4446, -0.3286, -0.4435,  0.0105],
       device='cuda:1')
y=tensor([0., 0., 0., 0., 1., 0., 0., 1.], device='cuda:1')

You can find that the default device is changed from 1 to 0 after with cupy.cuda.Device(device_id).

However, in machine B, I get outputs:

(pytorch-env) wfang@Precision-5820-Tower-X-Series:~/tempdir$ python test.py 
1: 1
2: 1
x=tensor([ 0.0060,  0.0141,  0.4118, -0.4813,  0.4609, -0.3557,  0.3739, -0.3464],
       device='cuda:1')
y=tensor([1., 1., 1., 0., 1., 0., 1., 0.], device='cuda:1')

In machine B, the default device is not changed.

Installation

Source (pip install cupy)

Environment

In machine A:

(pytorch-env) wfang@ubuntu:~/temp_dir$ conda list torch
# packages in environment at /home/wfang/anaconda3/envs/pytorch-env:
#
# Name                    Version                   Build  Channel
pytorch                   1.10.1          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchaudio                0.10.1               py39_cu113    pytorch
torchvision               0.11.2               py39_cu113    pytorch

(pytorch-env) wfang@ubuntu:~/temp_dir$ conda list cu
# packages in environment at /home/wfang/anaconda3/envs/pytorch-env:
#
# Name                    Version                   Build  Channel
cudatoolkit               11.3.1               h2bc3f7f_2    defaults
cupy-cuda113              10.2.0                   pypi_0    pypi
ncurses                   6.3                  h7f8727e_2    defaults

(pytorch-env) wfang@ubuntu:~/temp_dir$ nvidia-smi
Sun Mar 20 19:41:32 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   56C    P0   241W / 400W |  13828MiB / 81251MiB |     81%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   61C    P0   218W / 400W |  13830MiB / 81251MiB |     97%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:86:00.0 Off |                   98 |
| N/A   56C    P0   222W / 400W |  13828MiB / 81251MiB |     96%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   57C    P0   215W / 400W |  13828MiB / 81251MiB |     88%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     20830      C   ...vs/pytorch-env/bin/python    13817MiB |
|    1   N/A  N/A     20831      C   ...vs/pytorch-env/bin/python    13819MiB |
|    2   N/A  N/A     20832      C   ...vs/pytorch-env/bin/python    13817MiB |
|    3   N/A  N/A     20833      C   ...vs/pytorch-env/bin/python    13817MiB |
+-----------------------------------------------------------------------------+

(pytorch-env) wfang@ubuntu:~/temp_dir$ gpustat 
ubuntu                   Sun Mar 20 19:44:07 2022  470.74
[0] NVIDIA A100-SXM-80GB | 58'C, 100 % | 13828 / 81251 MB | wfang(13817M)
[1] NVIDIA A100-SXM-80GB | 63'C,  81 % | 13830 / 81251 MB | wfang(13819M)
[2] NVIDIA A100-SXM-80GB | 57'C,  91 % | 13828 / 81251 MB | wfang(13817M)
[3] NVIDIA A100-SXM-80GB | 59'C,  81 % | 13828 / 81251 MB | wfang(13817M)

(pytorch-env) wfang@ubuntu:/usr/local/cuda/bin$ ./nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

In machine B:

(pytorch-env) wfang@Precision-5820-Tower-X-Series:~/tempdir$ conda list torch
# packages in environment at /home/wfang/anaconda3/envs/pytorch-env:
#
# Name                    Version                   Build  Channel
pytorch                   1.10.1          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torch-tb-profiler         0.3.1                    pypi_0    pypi
torchaudio                0.10.1               py39_cu113    pytorch
torchvision               0.11.2               py39_cu113    pytorch

(pytorch-env) wfang@Precision-5820-Tower-X-Series:~/tempdir$ conda list cu
# packages in environment at /home/wfang/anaconda3/envs/pytorch-env:
#
# Name                    Version                   Build  Channel
cudatoolkit               11.3.1               h2bc3f7f_2    defaults
cupy-cuda111              9.4.0                    pypi_0    pypi
ncurses                   6.2                  h58526e2_4    conda-forge

(pytorch-env) wfang@Precision-5820-Tower-X-Series:~/tempdir$ nvidia-smi
Sun Mar 20 19:42:23 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:17:00.0 Off |                  N/A |
| 19%   36C    P8    16W / 250W |   1448MiB / 11011MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:B3:00.0 Off |                  N/A |
| 18%   36C    P8    21W / 250W |      3MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2335870      C   python                           1445MiB |
+-----------------------------------------------------------------------------+

(pytorch-env) wfang@Precision-5820-Tower-X-Series:~/tempdir$ gpustat 
Precision-5820-Tower-X-Series  Sun Mar 20 19:44:46 2022  465.19.01
[0] NVIDIA GeForce RTX 2080 Ti | 36'C,   0 % |  1448 / 11011 MB | wfang(1445M)
[1] NVIDIA GeForce RTX 2080 Ti | 36'C,   0 % |     3 / 11019 MB |

(pytorch-env) wfang@Precision-5820-Tower-X-Series:/usr/local/cuda/bin$ ./nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0

Additional Information

No response

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
kmaehashicommented, Mar 22, 2022

Remove this line and it should work:

with cupy.cuda.Device(device_id):

The “current device” is semantics provided by CUDA and not by each library. torch.cuda.set_device() will change the current device of the current thread, so it will take effect on CuPy as well. Mixing multiple libraries to switch the current device may cause unexpected behavior.

1reaction
fangwei123456commented, Mar 22, 2022

OK, thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA semantics — PyTorch 1.13 documentation
The selected device can be changed with a torch.cuda.device context manager. ... and other methods with copy-like functionality such as to() and cuda()...
Read more >
Interoperability — CuPy 11.4.0 documentation
Starting CuPy v10, the with Device context manager would no longer respect cudaSetDevice() , see Change in cupy.cuda.Device Behavior.
Read more >
CuPy Documentation - Read the Docs
CuPy has a concept of a current device, which is the default GPU device on which the ... cudaSetDevice(), see Change in cupy.cuda....
Read more >
How do I check if PyTorch is using the GPU? - Stack Overflow
import torch >>> torch.cuda.is_available() True > ... Which makes switching between CPU and GPU comfortable without changing the actual code.
Read more >
6.1. Device Management - NVIDIA Documentation Center
cudaDevAttrGpuOverlap: 1 if the device can concurrently copy memory ... is not restricted and multiple threads can use cudaSetDevice() with this device.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found