Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MPI bug when multiple GPUs are used per calculation

See original GitHub issue

I wanted to create an issue about this in openmmtools as well since we completed the transfering of the multistate code from YANK.

We’re still experiencing mixing problem with the latest version of MPI when multiple GPUs are available (see choderalab/yank#1130 and #407). I’ve added a test to test_sampling checking this in #407, but I still haven’t figure out the reason for the bug.

I’m sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the multistate module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.

Issue Analytics

State:
Created 4 years ago
Comments:12 (10 by maintainers)

Top GitHub Comments

4reactions

ijpulidoscommented, Mar 29, 2022

Thanks a lot for all the work and effort from @zhang-ivy , specially by pointing me to the differences between yank versions 0.20.1 and 0.21.0 which was when this issue first appeared. I think I managed to come up with a solution.

I made a PR with a probable solution. As far as I could tell, the _replica_thermodynamic_states attribute was not getting broadcasted to the MPI context. More details in the PR

@zhang-ivy if you can confirm that this solves it it with all your examples and systems it would be really nice as a validation. Just need to install the fix-mpi-replica-mix branch with something like pip install "git+https://github.com/choderalab/openmmtools.git@fix-mpi-replica-mix"

1reaction

jchoderacommented, Nov 9, 2019

I’m sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the multistate module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.

That sounds like the best thing to try now that you have a working test! You could use git bisect run to automate this process. Since the samplers moved from YANK to openmmtools, it could be a simple matter of testing the last version of YANK that included the multistate samplers (which presumably still have the bug) and then bisecting between that and 0.20.1.

Top Results From Across the Web

MPI bug when multiple GPUs are used per calculation #449

I wanted to create an issue about this in openmmtools as well since we completed the transfering of the multistate code from YANK....

Multi GPU Programming with MPI (Part I+II+III)

How to use MPI for inter GPU communication with CUDA and. OpenACC ... EXAMPLE: JACOBI SOLVER – MULTI GPU ... Solves the 2D-Laplace...

Did the GPU obfuscate the load imbalance in my MPI ...

Abstract—. The current proliferation of GPU-based HPC systems neces- sitates a method for assessing the performance of simulations.

MPI domain decomposition - HOOMD-blue - Read the Docs

HOOMD-blue supports multi-GPU (and multi-CPU) simulations using MPI. It uses a spatial domain decomposition approach similar to the one used by LAMMPS.

MPI + GPU : how to mix the two techniques - Stack Overflow

You use MPI between tasks (for which think nodes, although you can have multiple tasks per node), and each task may or may...