CUDA-aware Ireduce and Iallreduce operations for PyTorch GPU tensors segfault
See original GitHub issueWhen calling either Ireduce
or Iallreduce
on PyTorch GPU tensors, a segfault occurs. I haven’t exhaustively tested all of the ops, but I don’t have problems with Reduce
, Allreduce
, Isend
/ Irecv
, and Ibcast
when tested the same way. I haven’t tested CuPy tensors, but it might be worthwhile.
It might just be something I’m doing wrong when using these functions, so here is a minimal script that can be used to demonstrate this behavior. The errors are only present when running on GPU:
# mpirun -np 2 python repro.py gpu Ireduce
from mpi4py import MPI
import torch
import sys
if len(sys.argv) < 3:
print('Usage: python repro.py [cpu|gpu] [MPI function to test]')
sys.exit(1)
use_gpu = sys.argv[1] == 'gpu'
func_name = sys.argv[2]
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
if use_gpu:
device = torch.device('cuda:' + str(rank % torch.cuda.device_count()))
else:
device = torch.device('cpu')
def test_Iallreduce():
sendbuf = torch.ones(1, device=device)
recvbuf = torch.empty_like(sendbuf)
torch.cuda.synchronize()
req = comm.Iallreduce(sendbuf, recvbuf, op=MPI.SUM) # also fails with MPI.MAX
req.wait()
assert recvbuf[0] == size
def test_Ireduce():
buf = torch.ones(1, device=device)
if rank == 0:
sendbuf = MPI.IN_PLACE
recvbuf = buf
else:
sendbuf = buf
recvbuf = None
torch.cuda.synchronize()
req = comm.Ireduce(sendbuf, recvbuf, root=0, op=MPI.SUM) # also fails with MPI.MAX
req.wait()
if rank == 0:
assert buf[0] == size
eval('test_' + func_name + '()')
Software/Hardware Versions:
- OpenMPI 4.1.2, 4.1.1, 4.1.0, and 4.0.7 (built w/
--with-cuda
flag) - mpi4py 3.1.3 (built against above MPI version)
- CUDA 11.0
- Python 3.6 (also tested under 3.8)
- Nvidia K80 GPU (also tested with V100)
- OS Ubuntu 18.04 (also tested in containerized environment)
- torch 1.10.1 (w/ GPU support)
You can reproduce my environment setup with the following commands:
wget https://www.open-mpi.org//software/ompi/v3.0/downloads/openmpi-4.1.2.tar.gz
tar xvf openmpi-4.1.2.tar.gz
cd openmpi-4.1.2
./configure --with-cuda --prefix=/opt/openmpi-4.1.2
sudo make -j4 all install
export PATH=/opt/openmpi-4.1.2/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi-4.1.2/lib:$LD_LIBRARY_PATH
env MPICC=/opt/openmpi-4.1.2/bin/mpicc pip install mpi4py
pip install torch numpy
The error message for Ireduce
is the following:
[<host>:25864] *** Process received signal ***
[<host>:25864] Signal: Segmentation fault (11)
[<host>:25864] Signal code: Invalid permissions (2)
[<host>:25864] Failing at address: 0x1201220000
[<host>:25864] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f00efcf3040]
[<host>:25864] [ 1] /opt/openmpi-4.1.2/lib/openmpi/mca_op_avx.so(+0xc079)[0x7f00e41c0079]
[<host>:25864] [ 2] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(+0x7385)[0x7f00d3330385]
[<host>:25864] [ 3] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(NBC_Progress+0x1f3)[0x7f00d3330033]
[<host>:25864] [ 4] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_progress+0x8e)[0x7f00d332e84e]
[<host>:25864] [ 5] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f00edefba3c]
[<host>:25864] [ 6] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5)[0x7f00edf025a5]
[<host>:25864] [ 7] /opt/openmpi-4.1.2/lib/libmpi.so.40(ompi_request_default_wait+0x1f9)[0x7f00ee4eafa9]
[<host>:25864] [ 8] /opt/openmpi-4.1.2/lib/libmpi.so.40(PMPI_Wait+0x52)[0x7f00ee532e02]
[<host>:25864] [ 9] /home/ubuntu/venv/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so(+0xa81e2)[0x7f00ee8911e2]
[<host>:25864] [10] python[0x50a865]
[<host>:25864] [11] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [12] python[0x509989]
[<host>:25864] [13] python[0x50a6bd]
[<host>:25864] [14] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [15] python[0x507f94]
[<host>:25864] [16] python(PyRun_StringFlags+0xaf)[0x63500f]
[<host>:25864] [17] python[0x600911]
[<host>:25864] [18] python[0x50a4ef]
[<host>:25864] [19] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [20] python[0x507f94]
[<host>:25864] [21] python(PyEval_EvalCode+0x23)[0x50b0d3]
[<host>:25864] [22] python[0x634dc2]
[<host>:25864] [23] python(PyRun_FileExFlags+0x97)[0x634e77]
[<host>:25864] [24] python(PyRun_SimpleFileExFlags+0x17f)[0x63862f]
[<host>:25864] [25] python(Py_Main+0x591)[0x6391d1]
[<host>:25864] [26] python(main+0xe0)[0x4b0d30]
[<host>:25864] [27] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f00efcd5bf7]
[<host>:25864] [28] python(_start+0x2a)[0x5b2a5a]
[<host>:25864] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node <host> exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I appreciate any guidance!
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (8 by maintainers)
Top Results From Across the Web
Segfault using cuda with openmpi - PyTorch Forums
Hi there ! I am following the tutorial for writing distributed applications. Here is the basic example code. Everything works well as long ......
Read more >Segmentation fault - PyTorch Forums
Every time at 95-99% of first epoch, the system crashed with little information (Segmentation fault). I am sure the GPU and CPU memory...
Read more >Segmentation fault (core dumped). when I was using CUDA
My model can run slowly in cpu, but it cannot run in GPU. When I was using CUDA(10.0.130), I will get Segmentation fault...
Read more >Segmentation fault (Core dump) when using model.cuda - vision
Hi, I'm getting a Segmentation Fault when using model.cuda. Torch version =1.2.0 , gpu Quadro RTX 5000 , Cuda :11.2 Here is output...
Read more >Core dumped when training, no error messages
I found another thread I made in the past Random core dumps and segmentation fault - #2 by ptrblck when I was unable...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Not sure if it’s a regression or not, I didn’t test earlier versions. But it would affect any CUDA arrays implementing either
__cuda_array_interface__
or__dlpack__
.I have no idea what I was thinking 😂