question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

File locking error when resuming a parallel tempering simulation on environment with MPI

See original GitHub issue

Description

I found that if I try to extend a finished parallel tempering simulation from the netcdf file, this error would be thrown:

Traceback (most recent call last):
  File "test_extend.py", line 4, in <module>
    simulation = ms.ParallelTemperingSampler.from_storage('./output.nc')
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatesampler.py", line 296, in from_storage
    broadcast_result=False, sync_nodes=False)
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/mpiplus/mpiplus.py", line 220, in run_single_node
    result = task(*args, **kwargs)
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 280, in open 
    mode, version=netcdf_format)
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 391, in _open_dataset_robustly
    return netcdf.Dataset(*args, **kwargs)
  File "src/netCDF4/_netCDF4.pyx", line 2307, in netCDF4._netCDF4.Dataset.__init__
  File "src/netCDF4/_netCDF4.pyx", line 1925, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -101] NetCDF: HDF error: b'./output.nc'

What is intriguing is, when I put the code for extending the simulation in the same script with the code for running the simulation, it works fine. It only raises issues when I first run the script of the simulation, and then try to extend it by another file. This doesn’t make sense to me at all.

Another thing worth mentioning is, this error would not show up in my local machine, and after comparing the error logs I found that the difference is in my local machine the MPI is not found/disabled. For the jobs that create the error, I was running on one GPU card. I also tried one CPU core and it still throws the error.

Version

openmm 7.7.0, openmmtools 0.21.5, mpiplus v0.0.1, netcdf4 1.5.8, h5py 3.6.0. I noticed that two issues in yank https://github.com/choderalab/yank/issues/1165 and perses https://github.com/choderalab/openmmtools/pull/515 might be relavant, but it seems they are already solved and updated in the openmmtools. @mikemhenry maybe you knows better about this.

Procedure to reproduce

Producing errors with two scripts

  1. running the parallel tempering of alanine dipeptide.
#!/usr/bin/env python
from openmm import unit
from openmmtools import testsystems, states, mcmc
from openmmtools import multistate as ms
import logging
logging.basicConfig(level=logging.DEBUG)

testsystem = testsystems.AlanineDipeptideImplicit()
n_replicas = 3  # Number of temperature replicas.
T_min = 298.0 * unit.kelvin  # Minimum temperature.
T_max = 600.0 * unit.kelvin  # Maximum temperature.
reference_state = states.ThermodynamicState(system=testsystem.system, temperature=T_min)

move = mcmc.GHMCMove(timestep=2.0*unit.femtoseconds, n_steps=50)
simulation = ms.ParallelTemperingSampler(mcmc_moves=move, number_of_iterations=2)

reporter = ms.MultiStateReporter('./output.nc', checkpoint_interval=1)
simulation.create(reference_state,
                  states.SamplerState(testsystem.positions),
                  reporter, min_temperature=T_min,
                  max_temperature=T_max, n_temperatures=n_replicas)

simulation.run()

  1. run the following script (in another file).
#!/usr/bin/env python                                                                                                                                   
from openmmtools import multistate as ms                                                                                                                
import logging                                                                                                                                          
logging.basicConfig(level=logging.DEBUG)                                                                                                                
                                                                                                                                                        
simulation = ms.ParallelTemperingSampler.from_storage('./output.nc')                                                                                    
simulation.extend(1)     

and this is the whole error log.

WARNING:openmmtools.multistate.multistatereporter:Warning: The openmmtools.multistate API is experimental and may change in future releases
DEBUG:openmmtools.multistate.multistatereporter:Initial checkpoint file automatically chosen as ./output_checkpoint.nc
DEBUG:openmmtools.multistate.multistatereporter:checkpoint_interval != on-file checkpoint interval! Using on file analysis interval of 1.
WARNING:openmmtools.multistate.multistatesampler:Warning: The openmmtools.multistate API is experimental and may change in future releases
DEBUG:openmmtools.multistate.multistatesampler:CUDA devices available: (['0', ' GeForce GTX 1080'],)
DEBUG:mpiplus.mpiplus:MPI initialized on node 1/1 
DEBUG:mpiplus.mpiplus:Node 1/1: executing <function ReplicaExchangeSampler._display_citations at 0x7fa6bfe3cb90>
DEBUG:mpiplus.mpiplus:Node 1/1: executing <function MultiStateSampler._display_citations at 0x7fa6d62b40e0>
DEBUG:openmmtools.multistate.multistatesampler:Reading storage file ./output.nc...
DEBUG:openmmtools.utils:Reading thermodynamic states from storage took    0.008s
DEBUG:openmmtools.multistate.multistatereporter:read_replica_thermodynamic_states: iteration = 2 
DEBUG:mpiplus.mpiplus:Node 1/1: executing <bound method MultiStateReporter.open of <openmmtools.multistate.multistatereporter.MultiStateReporter object at 0x7fa6f7216690>>
DEBUG:openmmtools.multistate.multistatereporter:Attempt 1/5 to open ./output.nc failed. Retrying in 2 seconds
DEBUG:openmmtools.multistate.multistatereporter:Attempt 2/5 to open ./output.nc failed. Retrying in 2 seconds
DEBUG:openmmtools.multistate.multistatereporter:Attempt 3/5 to open ./output.nc failed. Retrying in 2 seconds
DEBUG:openmmtools.multistate.multistatereporter:Attempt 4/5 to open ./output.nc failed. Retrying in 2 seconds
Traceback (most recent call last):
  File "test_extend.py", line 6, in <module>
    simulation = ms.ParallelTemperingSampler.from_storage('./output.nc')
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatesampler.py", line 296, in from_storage
    broadcast_result=False, sync_nodes=False)
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/mpiplus/mpiplus.py", line 220, in run_single_node
    result = task(*args, **kwargs)
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 280, in open
    mode, version=netcdf_format)
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 391, in _open_dataset_robustly
    return netcdf.Dataset(*args, **kwargs)
  File "src/netCDF4/_netCDF4.pyx", line 2307, in netCDF4._netCDF4.Dataset.__init__
  File "src/netCDF4/_netCDF4.pyx", line 1925, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -101] NetCDF: HDF error: b'./output.nc'
CRITICAL:mpiplus.mpiplus:MPI node 1/1 raised an exception and called Abort()! The exception traceback follows
Traceback (most recent call last):
  File "test_extend.py", line 6, in <module>
    simulation = ms.ParallelTemperingSampler.from_storage('./output.nc')
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatesampler.py", line 296, in from_storage
    broadcast_result=False, sync_nodes=False)
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/mpiplus/mpiplus.py", line 220, in run_single_node
    result = task(*args, **kwargs)
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 280, in open
    mode, version=netcdf_format)
  File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 391, in _open_dataset_robustly
    return netcdf.Dataset(*args, **kwargs)
  File "src/netCDF4/_netCDF4.pyx", line 2307, in netCDF4._netCDF4.Dataset.__init__
  File "src/netCDF4/_netCDF4.pyx", line 1925, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -101] NetCDF: HDF error: b'./output.nc'
DEBUG:mpiplus.mpiplus:Node 1/1: executing <bound method MultiStateReporter.close of <openmmtools.multistate.multistatereporter.MultiStateReporter object at 0x7fa6f7216690>>                                                                                                                                                                          

Here are the output and error log of the first script and the second script two_scripts.zip

No error when combined in one script

run this script would not raise any error

#!/usr/bin/env python
from openmm import unit
from openmmtools import testsystems, states, mcmc
from openmmtools import multistate as ms
import logging
logging.basicConfig(level=logging.DEBUG)

testsystem = testsystems.AlanineDipeptideImplicit()
n_replicas = 3  # Number of temperature replicas.
T_min = 298.0 * unit.kelvin  # Minimum temperature.
T_max = 600.0 * unit.kelvin  # Maximum temperature.
reference_state = states.ThermodynamicState(system=testsystem.system, temperature=T_min)

move = mcmc.GHMCMove(timestep=2.0*unit.femtoseconds, n_steps=50)
simulation = ms.ParallelTemperingSampler(mcmc_moves=move, number_of_iterations=2)

reporter = ms.MultiStateReporter('./output.nc', checkpoint_interval=1)
simulation.create(reference_state,
                  states.SamplerState(testsystem.positions),
                  reporter, min_temperature=T_min,
                  max_temperature=T_max, n_temperatures=n_replicas)

simulation.run()

del simulation

#print(simulation)

simulation2 = ms.ParallelTemperingSampler.from_storage('./output.nc')
simulation2.extend(1)

Here is the zip of the scrips and the full output and error log: one_script.zip

No error when two scripts are run in local (without mpi)

the same two scripts as above here is the zip of everything: local.zip

Highlighting the vimdiff here:

 WARNING:openmmtools.multistate.multistatesampler:Warning: The openmmtools.multistate API is experimental and may change in future releases|  WARNING:openmmtools.multistate.multistatesampler:Warning: The openmmtools.multistate API is experimental and may change in future releases
  ------------------------------------------------------------------------------------------------------------------------------------------|  DEBUG:openmmtools.multistate.multistatesampler:CUDA devices available: (['0', ' GeForce GTX 1080'],)                                      
  WARNING:openmmtools.multistate.multistatereporter:Warning: The openmmtools.multistate API is experimental and may change in future release|  WARNING:openmmtools.multistate.multistatereporter:Warning: The openmmtools.multistate API is experimental and may change in future release
  DEBUG:openmmtools.multistate.multistatereporter:Initial checkpoint file automatically chosen as ./output_checkpoint.nc                    |  DEBUG:openmmtools.multistate.multistatereporter:Initial checkpoint file automatically chosen as ./output_checkpoint.nc                    
  DEBUG:openmmtools.multistate.paralleltempering:using temperatures [298.         422.84749024 600.        ] K                              |  DEBUG:openmmtools.multistate.paralleltempering:using temperatures [298.         422.84749024 600.        ] K                              
  DEBUG:mpiplus.mpiplus:Cannot find MPI environment. MPI disabled.                                                                          |  DEBUG:mpiplus.mpiplus:MPI initialized on node 1/1                                                                                         
  DEBUG:mpiplus.mpiplus:Single node: executing <bound method MultiStateReporter.storage_exists of <openmmtools.multistate.multistatereporter|  DEBUG:mpiplus.mpiplus:Node 1/1: executing <bound method MultiStateReporter.storage_exists of <openmmtools.multistate.multistatereporter.Mu
  DEBUG:mpiplus.mpiplus:Single node: executing <function ReplicaExchangeSampler._display_citations at 0x7f2b000483b0>                       |  DEBUG:mpiplus.mpiplus:Node 1/1: waiting for broadcast of <bound method MultiStateReporter.storage_exists of <openmmtools.multistate.multis
  DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._display_citations at 0x7f2b066b5d40>                            |  DEBUG:mpiplus.mpiplus:Node 1/1: executing <function ReplicaExchangeSampler._display_citations at 0x7f62a2fb3a70>                          
  DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._initialize_reporter at 0x7f2b066b8050>                          |  DEBUG:mpiplus.mpiplus:Node 1/1: executing <function MultiStateSampler._display_citations at 0x7f62a3a4bf80>                               
  ------------------------------------------------------------------------------------------------------------------------------------------|  DEBUG:mpiplus.mpiplus:Node 1/1: executing <function MultiStateSampler._initialize_reporter at 0x7f62a39cc290>                             
  DEBUG:openmmtools.multistate.multistatereporter:Serialized state thermodynamic_states/0 is  3480B | 3.398KB | 0.003MB                     |  DEBUG:openmmtools.multistate.multistatereporter:Serialized state thermodynamic_states/0 is  3480B | 3.398KB | 0.003MB                     
  DEBUG:openmmtools.utils:Storing thermodynamic states took    0.004s                                                                       |  DEBUG:openmmtools.utils:Storing thermodynamic states took    0.012s                                                                       
  DEBUG:openmmtools.multistate.multistatesampler:Storing general ReplicaExchange options...                                                 |  DEBUG:openmmtools.multistate.multistatesampler:Storing general ReplicaExchange options...                                                 
  DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration at 0x7f2b066b8170>                             |  DEBUG:mpiplus.mpiplus:Node 1/1: executing <function MultiStateSampler._report_iteration at 0x7f62a39cc3b0>                                
  DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration_items at 0x7f2b066b8440>                       |  DEBUG:mpiplus.mpiplus:Node 1/1: executing <function MultiStateSampler._report_iteration_items at 0x7f62a39cc680>                          
  DEBUG:openmmtools.utils:Storing sampler states took    0.002s                                                                             |  DEBUG:openmmtools.utils:Storing sampler states took    0.005s                                                                             
  DEBUG:openmmtools.utils:Writing iteration information to storage took    0.011s                                                           |  DEBUG:openmmtools.utils:Writing iteration information to storage took    0.019s                                                           
  DEBUG:mpiplus.mpiplus:Running _compute_replica_energies serially.                                                                         |  DEBUG:mpiplus.mpiplus:Node 1/1: waiting for barrier after <function MultiStateSampler._initialize_reporter at 0x7f62a39cc290>             
  DEBUG:openmmtools.utils:Computing energy matrix took    0.021s                                                                            |  DEBUG:mpiplus.mpiplus:Node 1/1: execute _compute_replica_energies(0)                                                                      
  DEBUG:mpiplus.mpiplus:Single node: executing <bound method MultiStateReporter.write_energies of <openmmtools.multistate.multistatereporter|  DEBUG:mpiplus.mpiplus:Node 1/1: execute _compute_replica_energies(1)                        

On the left there is the error log of the local node, on the right there is the error log of the gpu node.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mikemhenrycommented, Sep 21, 2022

@yihengwuKP thank you for this very detailed report! 😍

I don’t have time to look into this right now, but I will make sure this gets triaged into a milestone. I know that @zhang-ivy has had success in using MPI, so perhaps there is something borked with the MPI environment where you are running your scripts. Do you have another HPC environment you can use to test if you can reproduce this error?

Thanks! Mike

0reactions
mikemhenrycommented, Oct 11, 2022

Hmm, it could break things but you could try something like export SLURM_PROCID=foo to get around that check.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · choderalab/openmmtools - GitHub
About a wrapper on context and integrator as a simulation object ... File locking error when resuming a parallel tempering simulation on environment...
Read more >
Fluent MPI Error - Ansys Learning Forum
If it's run for some steps it's usually either diverged (the mpi errors are triggered as the node(s) fail) or the hardware has...
Read more >
#21 (File system locking error in testing) – parallel-netcdf
I've been attempting to build parallel-netcdf for our local cluster with gcc and intel-mpi 5.1.3 and netcdf 4.3.2. The code compiles fine, but...
Read more >
applications.xml - NVIDIA
The code has been developed using the CUDA programming model and MPI to address multiple GPUs in parallel. In typical diagnostic imaging simulations, ......
Read more >
Available CRAN Packages By Name
abstr, R Interface to the A/B Street Transport System Simulation Software ... baystability, Bayesian Stability Analysis of Genotype by Environment ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found