Getting `CUDA_ERROR_OUT_OF_MEMORY` error when running replica exchange (with lots of replicas) on large system
See original GitHub issueI’m trying to run replica exchange (rest) on a system with 185K atoms, and it runs fine when I use a rest region of ~70 atoms with 12 replicas. However, I get the following error when I run with a rest region of ~1000 atoms with 24 or 36 replicas.
I’m currently using this as the context cache:
context_cache = cache.ContextCache(capacity=None, time_to_live=None)
DEBUG:openmmtools.utils:Mixing of replicas took 3.747s
DEBUG:openmmtools.multistate.replicaexchange:Accepted 12284/93312 attempted swaps (13.2%)
DEBUG:openmmtools.multistate.multistatesampler:Propagating all replicas...
DEBUG:mpiplus.mpiplus:Running _propagate_replica serially.
Traceback (most recent call last):
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/cache.py", line 445, in get_context
context = self._lru[context_id]
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/cache.py", line 147, in __getitem__
entry = self._data.pop(key)
KeyError: (366470656340979585, 4946973717435834380)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "generate_rest2_cache_interface.py", line 153, in <module>
simulation.run()
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/multistate/multistatesampler.py", line 681, in run
self._propagate_replicas()
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/utils.py", line 90, in _wrapper
return func(*args, **kwargs)
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/multistate/multistatesampler.py", line 1196, in _propagate_replicas
propagated_states, replica_ids = mpiplus.distribute(self._propagate_replica, range(self.n_replicas),
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/mpiplus/mpiplus.py", line 512, in distribute
all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/mpiplus/mpiplus.py", line 512, in <listcomp>
all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/multistate/multistatesampler.py", line 1223, in _propagate_replica
mcmc_move.apply(thermodynamic_state, sampler_state)
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/mcmc.py", line 1114, in apply
super(LangevinDynamicsMove, self).apply(thermodynamic_state, sampler_state)
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/mcmc.py", line 655, in apply
context, integrator = context_cache.get_context(thermodynamic_state, integrator)
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/cache.py", line 447, in get_context
context = thermodynamic_state.create_context(integrator, self._platform, self._platform_properties)
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/states.py", line 1172, in create_context
return openmm.Context(system, integrator)
File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/simtk/openmm/openmm.py", line 4948, in __init__
_openmm.Context_swiginit(self, _openmm.new_Context(*args))
simtk.openmm.OpenMMException: Error creating array savedForces: CUDA_ERROR_OUT_OF_MEMORY (2)
DEBUG:mpiplus.mpiplus:Single node: executing <bound method MultiStateReporter.close of <openmmtools.multistate.multistatereporter.MultiStateReporter object at 0x2b547fb44ac0>>
To reproduce the issue, I can point to where the files are located on lilac (the files are large, so it doesn’t make sense to drop them here):
In /data/chodera/zhangi/perses_benchmark/neq/14/147/for_debugging/
:
perses-rbd-ace2-direct.yml
– contains a yaml of the env I’m usinggenerate_rest_cache_interface.py
– rest scriptrun_rest2_complex_for_debugging.sh
– bash script147_complex_1.pickle
– pickled RepartitionedHybridTopologyFactory needed for rest script147_complex_1_rest.pickle
– pickled RESTTopologyFactory needed for rest script- Note that to use the above two pickled factories with the rest script, you’ll need to create a dir called
147/
and place the factories and scripts in that directory- You’ll also need to make sure you set outdir in the rest script to point to your
147/
directory There are also xmls for the system (the REST system), state (at thermodynamic state with temp 298 K), and integrator (attached to the context cache --though i dont think this is actually being used anywhere).
- You’ll also need to make sure you set outdir in the rest script to point to your
Some observations I made:
- The error happens more often when I use job arrays than when I don’t use job arrays. However, I do see the error in both scenarios.
- From glancing at which GPUs have been pulled for my jobs, it doesn’t seem like one type is responsible for the error. I’ve seen the error happen on
lt
,ly
,lx
, andlu
. I’ve also seen absence of an error on these nodes.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
CUDA_ERROR_OUT_OF_MEM...
I'm trying to run the samples on smaller gpu: GTX1060 6Gb. The "Einstein" examples runs fine, but when I run the fox example...
Read more >CUDA_ERROR_OUT_OF_MEM...
It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes...
Read more >Re: [AMBER] replica exchange with GPU
I advise you to try to run with one replica per GPU. ... #SBATCH --error job.err ... The error message given by the...
Read more >Resolving CUDA Being Out of Memory With Gradient ...
The issue is, to train the model using GPU, you need the error between the labels and predictions, and for the error, you...
Read more >CUDA_ERROR_OUT_OF_MEM...
This time it's a test program and no large tensor is used in computation. How come this error happens again? Time in UTC+8....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ijpulidos can you try using the dummy cache and see if memory use still grows without bounds? Also try setting the cache to keep only like 12 contexts (assuming replicas 12):
context_cache = cache.ContextCache(capacity=12, time_to_live=None)
And see if that stops the memory from growing.It definitely looks like the bug is with
ContextCache
then.