question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA-aware Ireduce and Iallreduce operations for PyTorch GPU tensors segfault

See original GitHub issue

When calling either Ireduce or Iallreduce on PyTorch GPU tensors, a segfault occurs. I haven’t exhaustively tested all of the ops, but I don’t have problems with Reduce, Allreduce, Isend / Irecv, and Ibcast when tested the same way. I haven’t tested CuPy tensors, but it might be worthwhile.

It might just be something I’m doing wrong when using these functions, so here is a minimal script that can be used to demonstrate this behavior. The errors are only present when running on GPU:

# mpirun -np 2 python repro.py gpu Ireduce
from mpi4py import MPI
import torch
import sys

if len(sys.argv) < 3:
    print('Usage: python repro.py [cpu|gpu] [MPI function to test]')
    sys.exit(1)

use_gpu = sys.argv[1] == 'gpu'
func_name = sys.argv[2]

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
if use_gpu:
    device = torch.device('cuda:' + str(rank % torch.cuda.device_count()))
else:
    device = torch.device('cpu')

def test_Iallreduce():
    sendbuf = torch.ones(1, device=device)
    recvbuf = torch.empty_like(sendbuf)
    torch.cuda.synchronize()
    req = comm.Iallreduce(sendbuf, recvbuf, op=MPI.SUM)  # also fails with MPI.MAX
    req.wait()
    assert recvbuf[0] == size

def test_Ireduce():
    buf = torch.ones(1, device=device)
    if rank == 0:
        sendbuf = MPI.IN_PLACE
        recvbuf = buf
    else:
        sendbuf = buf
        recvbuf = None
    torch.cuda.synchronize()
    req = comm.Ireduce(sendbuf, recvbuf, root=0, op=MPI.SUM)  # also fails with MPI.MAX
    req.wait()
    if rank == 0:
        assert buf[0] == size

eval('test_' + func_name + '()')

Software/Hardware Versions:

  • OpenMPI 4.1.2, 4.1.1, 4.1.0, and 4.0.7 (built w/ --with-cuda flag)
  • mpi4py 3.1.3 (built against above MPI version)
  • CUDA 11.0
  • Python 3.6 (also tested under 3.8)
  • Nvidia K80 GPU (also tested with V100)
  • OS Ubuntu 18.04 (also tested in containerized environment)
  • torch 1.10.1 (w/ GPU support)

You can reproduce my environment setup with the following commands:

wget https://www.open-mpi.org//software/ompi/v3.0/downloads/openmpi-4.1.2.tar.gz
tar xvf openmpi-4.1.2.tar.gz
cd openmpi-4.1.2
./configure --with-cuda --prefix=/opt/openmpi-4.1.2
sudo make -j4 all install
export PATH=/opt/openmpi-4.1.2/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi-4.1.2/lib:$LD_LIBRARY_PATH
env MPICC=/opt/openmpi-4.1.2/bin/mpicc pip install mpi4py
pip install torch numpy

The error message for Ireduce is the following:

[<host>:25864] *** Process received signal ***
[<host>:25864] Signal: Segmentation fault (11)
[<host>:25864] Signal code: Invalid permissions (2)
[<host>:25864] Failing at address: 0x1201220000
[<host>:25864] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f00efcf3040]
[<host>:25864] [ 1] /opt/openmpi-4.1.2/lib/openmpi/mca_op_avx.so(+0xc079)[0x7f00e41c0079]
[<host>:25864] [ 2] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(+0x7385)[0x7f00d3330385]
[<host>:25864] [ 3] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(NBC_Progress+0x1f3)[0x7f00d3330033]
[<host>:25864] [ 4] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_progress+0x8e)[0x7f00d332e84e]
[<host>:25864] [ 5] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f00edefba3c]
[<host>:25864] [ 6] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5)[0x7f00edf025a5]
[<host>:25864] [ 7] /opt/openmpi-4.1.2/lib/libmpi.so.40(ompi_request_default_wait+0x1f9)[0x7f00ee4eafa9]
[<host>:25864] [ 8] /opt/openmpi-4.1.2/lib/libmpi.so.40(PMPI_Wait+0x52)[0x7f00ee532e02]
[<host>:25864] [ 9] /home/ubuntu/venv/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so(+0xa81e2)[0x7f00ee8911e2]
[<host>:25864] [10] python[0x50a865]
[<host>:25864] [11] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [12] python[0x509989]
[<host>:25864] [13] python[0x50a6bd]
[<host>:25864] [14] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [15] python[0x507f94]
[<host>:25864] [16] python(PyRun_StringFlags+0xaf)[0x63500f]
[<host>:25864] [17] python[0x600911]
[<host>:25864] [18] python[0x50a4ef]
[<host>:25864] [19] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [20] python[0x507f94]
[<host>:25864] [21] python(PyEval_EvalCode+0x23)[0x50b0d3]
[<host>:25864] [22] python[0x634dc2]
[<host>:25864] [23] python(PyRun_FileExFlags+0x97)[0x634e77]
[<host>:25864] [24] python(PyRun_SimpleFileExFlags+0x17f)[0x63862f]
[<host>:25864] [25] python(Py_Main+0x591)[0x6391d1]
[<host>:25864] [26] python(main+0xe0)[0x4b0d30]
[<host>:25864] [27] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f00efcd5bf7]
[<host>:25864] [28] python(_start+0x2a)[0x5b2a5a]
[<host>:25864] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node <host> exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I appreciate any guidance!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
leofangcommented, Jan 7, 2022

Not sure if it’s a regression or not, I didn’t test earlier versions. But it would affect any CUDA arrays implementing either __cuda_array_interface__ or __dlpack__.

0reactions
leofangcommented, Jan 26, 2022

If the buffer protocol is not available, then mpi4py queries for DLPack!

I have no idea what I was thinking 😂

Read more comments on GitHub >

github_iconTop Results From Across the Web

Segfault using cuda with openmpi - PyTorch Forums
Hi there ! I am following the tutorial for writing distributed applications. Here is the basic example code. Everything works well as long ......
Read more >
Segmentation fault - PyTorch Forums
Every time at 95-99% of first epoch, the system crashed with little information (Segmentation fault). I am sure the GPU and CPU memory...
Read more >
Segmentation fault (core dumped). when I was using CUDA
My model can run slowly in cpu, but it cannot run in GPU. When I was using CUDA(10.0.130), I will get Segmentation fault...
Read more >
Segmentation fault (Core dump) when using model.cuda - vision
Hi, I'm getting a Segmentation Fault when using model.cuda. Torch version =1.2.0 , gpu Quadro RTX 5000 , Cuda :11.2 Here is output...
Read more >
Core dumped when training, no error messages
I found another thread I made in the past Random core dumps and segmentation fault - #2 by ptrblck when I was unable...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found