question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MPI bug when multiple GPUs are used per calculation

See original GitHub issue

I wanted to create an issue about this in openmmtools as well since we completed the transfering of the multistate code from YANK.

We’re still experiencing mixing problem with the latest version of MPI when multiple GPUs are available (see choderalab/yank#1130 and #407). I’ve added a test to test_sampling checking this in #407, but I still haven’t figure out the reason for the bug.

I’m sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the multistate module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

4reactions
ijpulidoscommented, Mar 29, 2022

Thanks a lot for all the work and effort from @zhang-ivy , specially by pointing me to the differences between yank versions 0.20.1 and 0.21.0 which was when this issue first appeared. I think I managed to come up with a solution.

I made a PR with a probable solution. As far as I could tell, the _replica_thermodynamic_states attribute was not getting broadcasted to the MPI context. More details in the PR

@zhang-ivy if you can confirm that this solves it it with all your examples and systems it would be really nice as a validation. Just need to install the fix-mpi-replica-mix branch with something like pip install "git+https://github.com/choderalab/openmmtools.git@fix-mpi-replica-mix"

1reaction
jchoderacommented, Nov 9, 2019

I’m sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the multistate module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.

That sounds like the best thing to try now that you have a working test! You could use git bisect run to automate this process. Since the samplers moved from YANK to openmmtools, it could be a simple matter of testing the last version of YANK that included the multistate samplers (which presumably still have the bug) and then bisecting between that and 0.20.1.

Read more comments on GitHub >

github_iconTop Results From Across the Web

MPI bug when multiple GPUs are used per calculation #449
I wanted to create an issue about this in openmmtools as well since we completed the transfering of the multistate code from YANK....
Read more >
Multi GPU Programming with MPI (Part I+II+III)
How to use MPI for inter GPU communication with CUDA and. OpenACC ... EXAMPLE: JACOBI SOLVER – MULTI GPU ... Solves the 2D-Laplace...
Read more >
Did the GPU obfuscate the load imbalance in my MPI ...
Abstract—. The current proliferation of GPU-based HPC systems neces- sitates a method for assessing the performance of simulations.
Read more >
MPI domain decomposition - HOOMD-blue - Read the Docs
HOOMD-blue supports multi-GPU (and multi-CPU) simulations using MPI. It uses a spatial domain decomposition approach similar to the one used by LAMMPS.
Read more >
MPI + GPU : how to mix the two techniques - Stack Overflow
You use MPI between tasks (for which think nodes, although you can have multiple tasks per node), and each task may or may...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found