Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting `CUDA_ERROR_OUT_OF_MEMORY` error when running replica exchange (with lots of replicas) on large system

See original GitHub issue

I’m trying to run replica exchange (rest) on a system with 185K atoms, and it runs fine when I use a rest region of ~70 atoms with 12 replicas. However, I get the following error when I run with a rest region of ~1000 atoms with 24 or 36 replicas.

I’m currently using this as the context cache: context_cache = cache.ContextCache(capacity=None, time_to_live=None)

DEBUG:openmmtools.utils:Mixing of replicas took    3.747s
DEBUG:openmmtools.multistate.replicaexchange:Accepted 12284/93312 attempted swaps (13.2%)
DEBUG:openmmtools.multistate.multistatesampler:Propagating all replicas...
DEBUG:mpiplus.mpiplus:Running _propagate_replica serially.
Traceback (most recent call last):
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/cache.py", line 445, in get_context
    context = self._lru[context_id]
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/cache.py", line 147, in __getitem__
    entry = self._data.pop(key)
KeyError: (366470656340979585, 4946973717435834380)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "generate_rest2_cache_interface.py", line 153, in <module>
    simulation.run()
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/multistate/multistatesampler.py", line 681, in run
    self._propagate_replicas()
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/utils.py", line 90, in _wrapper
    return func(*args, **kwargs)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/multistate/multistatesampler.py", line 1196, in _propagate_replicas
    propagated_states, replica_ids = mpiplus.distribute(self._propagate_replica, range(self.n_replicas),
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/mpiplus/mpiplus.py", line 512, in distribute
    all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/mpiplus/mpiplus.py", line 512, in <listcomp>
    all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/multistate/multistatesampler.py", line 1223, in _propagate_replica
    mcmc_move.apply(thermodynamic_state, sampler_state)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/mcmc.py", line 1114, in apply
    super(LangevinDynamicsMove, self).apply(thermodynamic_state, sampler_state)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/mcmc.py", line 655, in apply
    context, integrator = context_cache.get_context(thermodynamic_state, integrator)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/cache.py", line 447, in get_context
    context = thermodynamic_state.create_context(integrator, self._platform, self._platform_properties)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/states.py", line 1172, in create_context
    return openmm.Context(system, integrator)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/simtk/openmm/openmm.py", line 4948, in __init__
    _openmm.Context_swiginit(self, _openmm.new_Context(*args))
simtk.openmm.OpenMMException: Error creating array savedForces: CUDA_ERROR_OUT_OF_MEMORY (2)
DEBUG:mpiplus.mpiplus:Single node: executing <bound method MultiStateReporter.close of <openmmtools.multistate.multistatereporter.MultiStateReporter object at 0x2b547fb44ac0>>

To reproduce the issue, I can point to where the files are located on lilac (the files are large, so it doesn’t make sense to drop them here): In /data/chodera/zhangi/perses_benchmark/neq/14/147/for_debugging/:

perses-rbd-ace2-direct.yml – contains a yaml of the env I’m using
generate_rest_cache_interface.py – rest script
run_rest2_complex_for_debugging.sh – bash script
147_complex_1.pickle – pickled RepartitionedHybridTopologyFactory needed for rest script
147_complex_1_rest.pickle – pickled RESTTopologyFactory needed for rest script
Note that to use the above two pickled factories with the rest script, you’ll need to create a dir called 147/ and place the factories and scripts in that directory
- You’ll also need to make sure you set outdir in the rest script to point to your 147/ directory There are also xmls for the system (the REST system), state (at thermodynamic state with temp 298 K), and integrator (attached to the context cache --though i dont think this is actually being used anywhere).

Some observations I made:

The error happens more often when I use job arrays than when I don’t use job arrays. However, I do see the error in both scenarios.
From glancing at which GPUs have been pulled for my jobs, it doesn’t seem like one type is responsible for the error. I’ve seen the error happen on lt, ly, lx , and lu . I’ve also seen absence of an error on these nodes.

Issue Analytics

State:
Created 2 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

mikemhenrycommented, Sep 30, 2021

@ijpulidos can you try using the dummy cache and see if memory use still grows without bounds? Also try setting the cache to keep only like 12 contexts (assuming replicas 12): context_cache = cache.ContextCache(capacity=12, time_to_live=None) And see if that stops the memory from growing.

0reactions

jchoderacommented, Oct 5, 2021

It definitely looks like the bug is with ContextCache then.

Top Results From Across the Web

CUDA_ERROR_OUT_OF_MEM...

I'm trying to run the samples on smaller gpu: GTX1060 6Gb. The "Einstein" examples runs fine, but when I run the fox example...

CUDA_ERROR_OUT_OF_MEM...

It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes...

Re: [AMBER] replica exchange with GPU

I advise you to try to run with one replica per GPU. ... #SBATCH --error job.err ... The error message given by the...

Resolving CUDA Being Out of Memory With Gradient ...

The issue is, to train the model using GPU, you need the error between the labels and predictions, and for the error, you...

CUDA_ERROR_OUT_OF_MEM...

This time it's a test program and no large tensor is used in computation. How come this error happens again? Time in UTC+8....

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Getting `CUDA_ERROR_OUT_OF_MEMORY` error when running replica exchange (with lots of replicas) on large system

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

File locking error when resuming a parallel tempering simulation on environment with MPI

GPU test_alchemy.py failures.