File locking error when resuming a parallel tempering simulation on environment with MPI
See original GitHub issueDescription
I found that if I try to extend a finished parallel tempering simulation from the netcdf file, this error would be thrown:
Traceback (most recent call last):
File "test_extend.py", line 4, in <module>
simulation = ms.ParallelTemperingSampler.from_storage('./output.nc')
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatesampler.py", line 296, in from_storage
broadcast_result=False, sync_nodes=False)
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/mpiplus/mpiplus.py", line 220, in run_single_node
result = task(*args, **kwargs)
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 280, in open
mode, version=netcdf_format)
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 391, in _open_dataset_robustly
return netcdf.Dataset(*args, **kwargs)
File "src/netCDF4/_netCDF4.pyx", line 2307, in netCDF4._netCDF4.Dataset.__init__
File "src/netCDF4/_netCDF4.pyx", line 1925, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -101] NetCDF: HDF error: b'./output.nc'
What is intriguing is, when I put the code for extending the simulation in the same script with the code for running the simulation, it works fine. It only raises issues when I first run the script of the simulation, and then try to extend it by another file. This doesn’t make sense to me at all.
Another thing worth mentioning is, this error would not show up in my local machine, and after comparing the error logs I found that the difference is in my local machine the MPI is not found/disabled. For the jobs that create the error, I was running on one GPU card. I also tried one CPU core and it still throws the error.
Version
openmm 7.7.0
, openmmtools 0.21.5
, mpiplus v0.0.1
, netcdf4 1.5.8
, h5py 3.6.0
. I noticed that two issues in yank
https://github.com/choderalab/yank/issues/1165 and perses
https://github.com/choderalab/openmmtools/pull/515 might be relavant, but it seems they are already solved and updated in the openmmtools. @mikemhenry maybe you knows better about this.
Procedure to reproduce
Producing errors with two scripts
- running the parallel tempering of alanine dipeptide.
#!/usr/bin/env python
from openmm import unit
from openmmtools import testsystems, states, mcmc
from openmmtools import multistate as ms
import logging
logging.basicConfig(level=logging.DEBUG)
testsystem = testsystems.AlanineDipeptideImplicit()
n_replicas = 3 # Number of temperature replicas.
T_min = 298.0 * unit.kelvin # Minimum temperature.
T_max = 600.0 * unit.kelvin # Maximum temperature.
reference_state = states.ThermodynamicState(system=testsystem.system, temperature=T_min)
move = mcmc.GHMCMove(timestep=2.0*unit.femtoseconds, n_steps=50)
simulation = ms.ParallelTemperingSampler(mcmc_moves=move, number_of_iterations=2)
reporter = ms.MultiStateReporter('./output.nc', checkpoint_interval=1)
simulation.create(reference_state,
states.SamplerState(testsystem.positions),
reporter, min_temperature=T_min,
max_temperature=T_max, n_temperatures=n_replicas)
simulation.run()
- run the following script (in another file).
#!/usr/bin/env python
from openmmtools import multistate as ms
import logging
logging.basicConfig(level=logging.DEBUG)
simulation = ms.ParallelTemperingSampler.from_storage('./output.nc')
simulation.extend(1)
and this is the whole error log.
WARNING:openmmtools.multistate.multistatereporter:Warning: The openmmtools.multistate API is experimental and may change in future releases
DEBUG:openmmtools.multistate.multistatereporter:Initial checkpoint file automatically chosen as ./output_checkpoint.nc
DEBUG:openmmtools.multistate.multistatereporter:checkpoint_interval != on-file checkpoint interval! Using on file analysis interval of 1.
WARNING:openmmtools.multistate.multistatesampler:Warning: The openmmtools.multistate API is experimental and may change in future releases
DEBUG:openmmtools.multistate.multistatesampler:CUDA devices available: (['0', ' GeForce GTX 1080'],)
DEBUG:mpiplus.mpiplus:MPI initialized on node 1/1
DEBUG:mpiplus.mpiplus:Node 1/1: executing <function ReplicaExchangeSampler._display_citations at 0x7fa6bfe3cb90>
DEBUG:mpiplus.mpiplus:Node 1/1: executing <function MultiStateSampler._display_citations at 0x7fa6d62b40e0>
DEBUG:openmmtools.multistate.multistatesampler:Reading storage file ./output.nc...
DEBUG:openmmtools.utils:Reading thermodynamic states from storage took 0.008s
DEBUG:openmmtools.multistate.multistatereporter:read_replica_thermodynamic_states: iteration = 2
DEBUG:mpiplus.mpiplus:Node 1/1: executing <bound method MultiStateReporter.open of <openmmtools.multistate.multistatereporter.MultiStateReporter object at 0x7fa6f7216690>>
DEBUG:openmmtools.multistate.multistatereporter:Attempt 1/5 to open ./output.nc failed. Retrying in 2 seconds
DEBUG:openmmtools.multistate.multistatereporter:Attempt 2/5 to open ./output.nc failed. Retrying in 2 seconds
DEBUG:openmmtools.multistate.multistatereporter:Attempt 3/5 to open ./output.nc failed. Retrying in 2 seconds
DEBUG:openmmtools.multistate.multistatereporter:Attempt 4/5 to open ./output.nc failed. Retrying in 2 seconds
Traceback (most recent call last):
File "test_extend.py", line 6, in <module>
simulation = ms.ParallelTemperingSampler.from_storage('./output.nc')
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatesampler.py", line 296, in from_storage
broadcast_result=False, sync_nodes=False)
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/mpiplus/mpiplus.py", line 220, in run_single_node
result = task(*args, **kwargs)
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 280, in open
mode, version=netcdf_format)
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 391, in _open_dataset_robustly
return netcdf.Dataset(*args, **kwargs)
File "src/netCDF4/_netCDF4.pyx", line 2307, in netCDF4._netCDF4.Dataset.__init__
File "src/netCDF4/_netCDF4.pyx", line 1925, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -101] NetCDF: HDF error: b'./output.nc'
CRITICAL:mpiplus.mpiplus:MPI node 1/1 raised an exception and called Abort()! The exception traceback follows
Traceback (most recent call last):
File "test_extend.py", line 6, in <module>
simulation = ms.ParallelTemperingSampler.from_storage('./output.nc')
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatesampler.py", line 296, in from_storage
broadcast_result=False, sync_nodes=False)
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/mpiplus/mpiplus.py", line 220, in run_single_node
result = task(*args, **kwargs)
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 280, in open
mode, version=netcdf_format)
File "/scratch/midway2/yihengwu917/.conda/envs/openmm_hh/lib/python3.7/site-packages/openmmtools/multistate/multistatereporter.py", line 391, in _open_dataset_robustly
return netcdf.Dataset(*args, **kwargs)
File "src/netCDF4/_netCDF4.pyx", line 2307, in netCDF4._netCDF4.Dataset.__init__
File "src/netCDF4/_netCDF4.pyx", line 1925, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -101] NetCDF: HDF error: b'./output.nc'
DEBUG:mpiplus.mpiplus:Node 1/1: executing <bound method MultiStateReporter.close of <openmmtools.multistate.multistatereporter.MultiStateReporter object at 0x7fa6f7216690>>
Here are the output and error log of the first script and the second script two_scripts.zip
No error when combined in one script
run this script would not raise any error
#!/usr/bin/env python
from openmm import unit
from openmmtools import testsystems, states, mcmc
from openmmtools import multistate as ms
import logging
logging.basicConfig(level=logging.DEBUG)
testsystem = testsystems.AlanineDipeptideImplicit()
n_replicas = 3 # Number of temperature replicas.
T_min = 298.0 * unit.kelvin # Minimum temperature.
T_max = 600.0 * unit.kelvin # Maximum temperature.
reference_state = states.ThermodynamicState(system=testsystem.system, temperature=T_min)
move = mcmc.GHMCMove(timestep=2.0*unit.femtoseconds, n_steps=50)
simulation = ms.ParallelTemperingSampler(mcmc_moves=move, number_of_iterations=2)
reporter = ms.MultiStateReporter('./output.nc', checkpoint_interval=1)
simulation.create(reference_state,
states.SamplerState(testsystem.positions),
reporter, min_temperature=T_min,
max_temperature=T_max, n_temperatures=n_replicas)
simulation.run()
del simulation
#print(simulation)
simulation2 = ms.ParallelTemperingSampler.from_storage('./output.nc')
simulation2.extend(1)
Here is the zip of the scrips and the full output and error log: one_script.zip
No error when two scripts are run in local (without mpi)
the same two scripts as above here is the zip of everything: local.zip
Highlighting the vimdiff here:
WARNING:openmmtools.multistate.multistatesampler:Warning: The openmmtools.multistate API is experimental and may change in future releases| WARNING:openmmtools.multistate.multistatesampler:Warning: The openmmtools.multistate API is experimental and may change in future releases
------------------------------------------------------------------------------------------------------------------------------------------| DEBUG:openmmtools.multistate.multistatesampler:CUDA devices available: (['0', ' GeForce GTX 1080'],)
WARNING:openmmtools.multistate.multistatereporter:Warning: The openmmtools.multistate API is experimental and may change in future release| WARNING:openmmtools.multistate.multistatereporter:Warning: The openmmtools.multistate API is experimental and may change in future release
DEBUG:openmmtools.multistate.multistatereporter:Initial checkpoint file automatically chosen as ./output_checkpoint.nc | DEBUG:openmmtools.multistate.multistatereporter:Initial checkpoint file automatically chosen as ./output_checkpoint.nc
DEBUG:openmmtools.multistate.paralleltempering:using temperatures [298. 422.84749024 600. ] K | DEBUG:openmmtools.multistate.paralleltempering:using temperatures [298. 422.84749024 600. ] K
DEBUG:mpiplus.mpiplus:Cannot find MPI environment. MPI disabled. | DEBUG:mpiplus.mpiplus:MPI initialized on node 1/1
DEBUG:mpiplus.mpiplus:Single node: executing <bound method MultiStateReporter.storage_exists of <openmmtools.multistate.multistatereporter| DEBUG:mpiplus.mpiplus:Node 1/1: executing <bound method MultiStateReporter.storage_exists of <openmmtools.multistate.multistatereporter.Mu
DEBUG:mpiplus.mpiplus:Single node: executing <function ReplicaExchangeSampler._display_citations at 0x7f2b000483b0> | DEBUG:mpiplus.mpiplus:Node 1/1: waiting for broadcast of <bound method MultiStateReporter.storage_exists of <openmmtools.multistate.multis
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._display_citations at 0x7f2b066b5d40> | DEBUG:mpiplus.mpiplus:Node 1/1: executing <function ReplicaExchangeSampler._display_citations at 0x7f62a2fb3a70>
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._initialize_reporter at 0x7f2b066b8050> | DEBUG:mpiplus.mpiplus:Node 1/1: executing <function MultiStateSampler._display_citations at 0x7f62a3a4bf80>
------------------------------------------------------------------------------------------------------------------------------------------| DEBUG:mpiplus.mpiplus:Node 1/1: executing <function MultiStateSampler._initialize_reporter at 0x7f62a39cc290>
DEBUG:openmmtools.multistate.multistatereporter:Serialized state thermodynamic_states/0 is 3480B | 3.398KB | 0.003MB | DEBUG:openmmtools.multistate.multistatereporter:Serialized state thermodynamic_states/0 is 3480B | 3.398KB | 0.003MB
DEBUG:openmmtools.utils:Storing thermodynamic states took 0.004s | DEBUG:openmmtools.utils:Storing thermodynamic states took 0.012s
DEBUG:openmmtools.multistate.multistatesampler:Storing general ReplicaExchange options... | DEBUG:openmmtools.multistate.multistatesampler:Storing general ReplicaExchange options...
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration at 0x7f2b066b8170> | DEBUG:mpiplus.mpiplus:Node 1/1: executing <function MultiStateSampler._report_iteration at 0x7f62a39cc3b0>
DEBUG:mpiplus.mpiplus:Single node: executing <function MultiStateSampler._report_iteration_items at 0x7f2b066b8440> | DEBUG:mpiplus.mpiplus:Node 1/1: executing <function MultiStateSampler._report_iteration_items at 0x7f62a39cc680>
DEBUG:openmmtools.utils:Storing sampler states took 0.002s | DEBUG:openmmtools.utils:Storing sampler states took 0.005s
DEBUG:openmmtools.utils:Writing iteration information to storage took 0.011s | DEBUG:openmmtools.utils:Writing iteration information to storage took 0.019s
DEBUG:mpiplus.mpiplus:Running _compute_replica_energies serially. | DEBUG:mpiplus.mpiplus:Node 1/1: waiting for barrier after <function MultiStateSampler._initialize_reporter at 0x7f62a39cc290>
DEBUG:openmmtools.utils:Computing energy matrix took 0.021s | DEBUG:mpiplus.mpiplus:Node 1/1: execute _compute_replica_energies(0)
DEBUG:mpiplus.mpiplus:Single node: executing <bound method MultiStateReporter.write_energies of <openmmtools.multistate.multistatereporter| DEBUG:mpiplus.mpiplus:Node 1/1: execute _compute_replica_energies(1)
On the left there is the error log of the local node, on the right there is the error log of the gpu node.
Issue Analytics
- State:
- Created a year ago
- Comments:10 (5 by maintainers)
Top GitHub Comments
@yihengwuKP thank you for this very detailed report! 😍
I don’t have time to look into this right now, but I will make sure this gets triaged into a milestone. I know that @zhang-ivy has had success in using MPI, so perhaps there is something borked with the MPI environment where you are running your scripts. Do you have another HPC environment you can use to test if you can reproduce this error?
Thanks! Mike
Hmm, it could break things but you could try something like
export SLURM_PROCID=foo
to get around that check.