question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

File locking error when resuming a h-repex simulation on multiple GPUs

See original GitHub issue

I was trying to resume a single repex simulation (using 4 GPUs) with the following code:

reporter = MultiStateReporter(reporter_file, checkpoint_interval=10)
simulation = HybridRepexSampler.from_storage(reporter)
# Determine how many more iterations are needed
total_iterations = 5000
iterations =  total_iterations - simulation.iteration
# Resume simulation
simulation.extend(n_iterations=iterations)

However, I got the following error:

OSError: [Errno -101] NetCDF: HDF error: b'/data/chodera/zhangi/perses_benchmark/repe
x/31/7/0/0_complex.nc'
CRITICAL:mpiplus.mpiplus:MPI node 1/4 raised an exception and called Abort()! The exc
eption traceback follows

This problem was found (and solved) in Yank as well: https://github.com/choderalab/yank/issues/1165

To fix this, I added the following line to my bash script: export HDF5_USE_FILE_LOCKING=FALSE

@jchodera suggests we fix this in a similar way to how Yank fixes this: https://github.com/choderalab/yank/pull/1168

However, the Yank code containing the fixes above has already been ported to openmmtools (see here), so I’m not sure what the appropriate fix for perses is (or why the lines in openmmtools aren’t sufficient).

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
mikemhenrycommented, Jun 29, 2021

The issue is in python 3.3+ IOError was merged into OSError https://docs.python.org/3/library/exceptions.html#OSError IE

>>> try:
...  raise OSError
... except (IOError):
...  print("oh no")
... 
oh no
>>>

So the issue is that instead of just trying a few times, we hit the code path that assumes the file doesn’t exit and exits early. Looking at the source code (and docs) of netcdf https://github.com/Unidata/netcdf4-python/blob/master/src/netCDF4/_netCDF4.pyx#L1939 It it isn’t obvious which error will be thrown if the file doesn’t exit verses some issue with HDF locking, so what I will do is remove the logic that handles the case of the file not exiting since the worse case scenario is instead of failing right away, it will fail in ~ 10 seconds.

0reactions
mikemhenrycommented, Jul 8, 2021

Will take a new release but fixed with choderalab/openmmtools#515

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error resuming from checkpoint with multiple GPUs
I started training a model on two GPUs, using the following trainer: trainer = pl.Trainer( devices = [0,2], accelerator='gpu', precision=16, ...
Read more >
Multi-GPU TFF simulation errors "Detected dataset reduce op ...
My code work perfectly fine using CPUs only. However, I received this error when trying to run TFF with GPU. ValueError: Detected dataset ......
Read more >
AccelWattch: A Power Modeling Framework for Modern GPUs
To mitigate this problem we propose. AccelWattch, a configurable GPU power model that resolves two long-standing needs: the lack of a detailed and...
Read more >
Amber (PMEMD) GPU Support
Employing multiple GPUs in a single simulation requires MPI and the pmemd.cuda.MPI executable. If you have multiple simulations to run then the recommended ......
Read more >
Multi-Process Service :: GPU Deployment and Management ...
This document is a comprehensive guide to MPS capabilities and usage. ... Pre-Volta MPS client processes share on-GPU scheduling and error ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found