question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Gatherv problems conversion error and receive buffer size segfaults

See original GitHub issue
 if mpicomm.rank == 0:
        F = np.empty(sum(sendcounts), dtype=float)
    else:
        F = None

    mpicomm.comm.Gatherv(sendbuf=local_F, recvbuf=(F, sendcounts), root=0)

The data that is exchanged from 1 proc (in case of 128 processors) is roughly 34 MB.

  1. Gatherv problem in 1 processor case:
OverflowError: value too large to convert to int
Traceback (most recent call last):
  File "/data/backup/ARR1905/alsalihi/venv_AARR1905/Test_spacepartitioning/Case20/SMARTA/O2-V2-Huge/SMARTA.py", line 120, in <module>
    F = view_factors(mpicomm, universe)
  File "/data/backup/ARR1905/alsalihi/venv_AARR1905/Test_spacepartitioning/Case20/SMARTA/O2-V2-Huge/rarfunc.py", line 84, in view_factors
    mpicomm.comm.Gatherv(sendbuf=local_F, recvbuf=(F, sendcounts), root=0)
  File "mpi4py/MPI/Comm.pyx", line 601, in mpi4py.MPI.Comm.Gatherv
  File "mpi4py/MPI/msgbuffer.pxi", line 506, in mpi4py.MPI._p_msg_cco.for_gather
  File "mpi4py/MPI/msgbuffer.pxi", line 456, in mpi4py.MPI._p_msg_cco.for_cco_recv
  File "mpi4py/MPI/msgbuffer.pxi", line 300, in mpi4py.MPI.message_vector
  File "mpi4py/MPI/asarray.pxi", line 22, in mpi4py.MPI.chkarray
  File "mpi4py/MPI/asarray.pxi", line 15, in mpi4py.MPI.getarray
OverflowError: value too large to convert to int
  1. Similar problem in parallel:
[node13.fk.private.vki.eu:20373] Read -1, expected 270734080, errno = 14
[node13:20373:0:20373] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f4771e93820)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x1b25f) [0x7f53d7ac025f]
    1  /lib64/libucs.so.0(+0x1b42a) [0x7f53d7ac042a]
    2  /lib64/libc.so.6(+0x15f396) [0x7f53fda7e396]
    3  /software/alternate/fk/openmpi/4.0.2/lib/libopen-pal.so.40(opal_convertor_unpack+0x85) [0x7f53dda64895]
    4  /software/alternate/fk/openmpi/4.0.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_frag+0x1a7) [0x7f53d7fe4ed7]
    5  /software/alternate/fk/openmpi/4.0.2/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x7f) [0x7f53d7ff6a7f]
    6  /software/alternate/fk/openmpi/4.0.2/lib/openmpi/mca_btl_vader.so(+0x4d77) [0x7f53d7ff6d77]
    7  /software/alternate/fk/openmpi/4.0.2/lib/libopen-pal.so.40(opal_progress+0x2c) [0x7f53dda53f3c]
    8  /software/alternate/fk/openmpi/4.0.2/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5) [0x7f53dda5a585]
    9  /software/alternate/fk/openmpi/4.0.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x803) [0x7f53d7fd7ba3]
   10  /software/alternate/fk/openmpi/4.0.2/lib/openmpi/mca_coll_basic.so(mca_coll_basic_gatherv_intra+0x18a) [0x7f53d7fc383a]
   11  /software/alternate/fk/openmpi/4.0.2/lib/libmpi.so.40(MPI_Gatherv+0xf0) [0x7f53ddc54140]
   12  /data/backup/ARR1905/alsalihi/venv_AARR1905/lib64/python3.7/site-packages/mpi4py/MPI.cpython-37m-x86_64-linux-gnu.so(+0x136939) [0x7f53dde38939]
   13  /lib64/libpython3.7m.so.1.0(_PyMethodDef_RawFastCallKeywords+0x334) [0x7f53fd6e0154]
   14  /lib64/libpython3.7m.so.1.0(_PyCFunction_FastCallKeywords+0x23) [0x7f53fd6e01b3]
   15  /lib64/libpython3.7m.so.1.0(+0x140473) [0x7f53fd712473]
   16  /lib64/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0x192e) [0x7f53fd74913e]
   17  /lib64/libpython3.7m.so.1.0(_PyEval_EvalCodeWithName+0x2f0) [0x7f53fd6ff7e0]
   18  /lib64/libpython3.7m.so.1.0(_PyFunction_FastCallKeywords+0x2a2) [0x7f53fd700822]
   19  /lib64/libpython3.7m.so.1.0(+0x14035f) [0x7f53fd71235f]
   20  /lib64/libpython3.7m.so.1.0(_PyEval_EvalFrameDefault+0xb5a) [0x7f53fd74836a]
   21  /lib64/libpython3.7m.so.1.0(_PyEval_EvalCodeWithName+0x2f0) [0x7f53fd6ff7e0]
   22  /lib64/libpython3.7m.so.1.0(PyEval_EvalCodeEx+0x39) [0x7f53fd700579]
   23  /lib64/libpython3.7m.so.1.0(PyEval_EvalCode+0x1b) [0x7f53fd78fccb]
   24  /lib64/libpython3.7m.so.1.0(+0x1ffc63) [0x7f53fd7d1c63]
   25  /lib64/libpython3.7m.so.1.0(PyRun_FileExFlags+0x97) [0x7f53fd7d21d7]
   26  /lib64/libpython3.7m.so.1.0(PyRun_SimpleFileExFlags+0x19a) [0x7f53fd7d893a]
   27  /lib64/libpython3.7m.so.1.0(+0x208701) [0x7f53fd7da701]
   28  /lib64/libpython3.7m.so.1.0(_Py_UnixMain+0x3c) [0x7f53fd7da8ac]
   29  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f53fd942f43]
   30  python3(_start+0x2e) [0x557f8aedd08e]
===================

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
dalcinlcommented, Dec 15, 2020

Did you read my previous comment? For N=60,000, you have a total of 3.6G (G = 1 billion) elements, and that’s above the MPI 32bit limits of around 2.1G elements. You cannot communicate such a large array with a single communication call, you have to somehow chunk it. Again, this is not mpi4py’s fault of laziness, it is a limitation of MPI that has not yet been officially addressed by the standard.

Another, perhaps easier way to implement chunking is to use user-defined datatypes. Look at @jeffhammond’s BigMPI, all of the ideas and tricks in there are easy to implement with mpi4py.

0reactions
jeffhammondcommented, Jan 25, 2021

Again, this is not mpi4py’s fault of laziness, it is a limitation of MPI that has not yet been officially addressed by the standard.

MPI 4.0 will have large count support (second vote shown in https://www.mpi-forum.org/meetings/2020/09/votes), although implementation work is still in-progress (https://github.com/pmodels/mpich/issues/4880).

Read more comments on GitHub >

github_iconTop Results From Across the Web

c++ - Segmentation Fault with MPI_Gather - Stack Overflow
I've allocated a send buffer and receive buffer like in the example, could anybody shed some light on why I might be getting...
Read more >
Gatherv seg fault? - Google Groups
Hi,. We're seeing segfaults only when the data volumes get large with Gatherv, running on RHEL7 with openmpi 1.8.8 and mpi4py 1.3.1. The...
Read more >
Exploring Buffer Overflows in C, Part Two: The Exploit | Tallan
A segmentation fault is an error thrown when a program tries to access restricted memory. The only thing that changed between the first...
Read more >
Avoiding the Top 10 NGINX Configuration Mistakes
Errors include insufficient file descriptors per worker, disabling proxy buffering, and not using upstream groups and keepalive connections.
Read more >
NVIDIA Deep Learning TensorRT Documentation
This can often solve TensorRT conversion issues in the ONNX parser and ... and attach it to an API object to receive errors...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found