RayOutOfMemoryError when running ULTRA experiments
See original GitHub issueIssue
After running an ULTRA training experiment (this experiment was run with the baseline DQN policy) for about half a day, the program stops because of a RayOutOfMemoryError.
Error
2021-02-09 09:09:34,529 ERROR worker.py:987 -- Possible unhandled error from worker: ray::ultra.evaluate.evaluate() (pid=4336, ip=10.208.237.111)
File "python/ray/_raylet.pyx", line 408, in ray._raylet.execute_task
File "/SMARTS/.venv/lib/python3.7/site-packages/ray/memory_monitor.py", line 128, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node gpu-machine is used (30.17 / 31.29 GB). The top 10 memory consumers are:
PID MEM COMMAND
4331 18.49GiB ray::__main__.train()
17726 4.02GiB /SMARTS/.venv/bin/python3.7 /SMARTS/.venv/bin/scl envision start -s ./ultra/scenarios -p 8081
4336 3.29GiB ray::IDLE
4644 0.22GiB ray::__main__.train()
4213 0.21GiB /SMARTS/.venv/bin/python3.7 /SMARTS/.venv/bin/tensorboard --logdir_spec=BDQN:logs/experiment-2021.2.
4274 0.09GiB python -u ultra/train.py --task 1 --level easy
4325 0.09GiB ray::IDLE
4335 0.09GiB ray::IDLE
4334 0.09GiB ray::IDLE
4330 0.09GiB ray::IDLE
In addition, up to 0.04 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
---
Will send @Gamenot the full log of the program execution internally as I am unable to upload the log to this public post.
Configuration
Was run in a Docker container with Ubuntu 18.04. nvidia-smi
outputs:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:01:00.0 Off | N/A |
| 25% 36C P2 34W / 215W | 896MiB / 7979MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
The command $ sumo
outputs:
Eclipse SUMO sumo Version 1.8.0
Build features: Linux-4.15.0-124-generic x86_64 GNU 7.5.0 Release Proj GUI SWIG GDAL GL2PS
Copyright (C) 2001-2020 German Aerospace Center (DLR) and others; https://sumo.dlr.de
License EPL-2.0: Eclipse Public License Version 2 <https://eclipse.org/legal/epl-v20.html>
Use --help to get the list of options.
In the virtual environment, $ pip list
yields the following:
Package Version Location
---------------------- --------- --------
absl-py 0.11.0
aiohttp 3.7.3
apipkg 1.5
argon2-cffi 20.1.0
astunparse 1.6.3
async-timeout 3.0.1
atari-py 0.2.6
attrs 20.3.0
Automat 20.2.0
backcall 0.2.0
beautifulsoup4 4.9.3
bleach 3.2.1
cachetools 4.2.0
certifi 2020.12.5
cffi 1.14.4
chardet 3.0.4
click 7.1.2
cloudpickle 1.3.0
colorama 0.4.4
commonmark 0.9.1
constantly 15.1.0
coverage 5.3.1
cycler 0.10.0
decorator 4.4.2
defusedxml 0.6.0
dill 0.3.3
dm-tree 0.1.5
entrypoints 0.3
evdev 1.4.0
execnet 1.7.1
filelock 3.0.12
future 0.18.2
gast 0.3.3
gitdb 4.0.5
GitPython 3.1.12
google 3.0.0
google-auth 1.24.0
google-auth-oauthlib 0.4.2
google-pasta 0.2.0
grpcio 1.30.0
gym 0.18.0
h5py 2.10.0
hyperlink 21.0.0
idna 2.10
imageio 2.9.0
importlib-metadata 3.4.0
importlib-resources 5.0.0
incremental 17.5.0
iniconfig 1.1.1
ipykernel 5.4.3
ipython 7.19.0
ipython-genutils 0.2.0
jedi 0.18.0
Jinja2 2.11.2
joblib 1.0.0
jsonpatch 1.28
jsonpointer 2.0
jsonschema 3.2.0
jupyter-client 6.1.11
jupyter-core 4.7.0
Keras-Preprocessing 1.1.2
kiwisolver 1.3.1
lz4 3.1.2
Markdown 3.3.3
MarkupSafe 1.1.1
matplotlib 3.3.3
mistune 0.8.4
msgpack 1.0.2
multidict 5.1.0
nbconvert 5.6.1
nbdime 2.1.0
nbformat 5.1.2
networkx 2.5
notebook 6.2.0
numpy 1.18.5
oauthlib 3.1.0
opencv-python 4.5.1.48
opencv-python-headless 4.5.1.48
opt-einsum 3.3.0
packaging 20.8
panda3d 1.10.8
panda3d-gltf 0.12
panda3d-simplepbr 0.7
pandas 1.2.0
pandocfilters 1.4.3
parso 0.8.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 7.2.0
pip 20.3.3
pluggy 0.13.1
prometheus-client 0.9.0
prompt-toolkit 3.0.10
protobuf 3.14.0
psutil 5.8.0
ptyprocess 0.7.0
py 1.10.0
py-cpuinfo 7.0.0
py-spy 0.3.4
pyasn1 0.4.8
pyasn1-modules 0.2.8
pybullet 3.0.8
pycparser 2.20
pyglet 1.5.0
Pygments 2.7.4
PyHamcrest 2.0.2
pynput 1.7.2
pyparsing 2.4.7
pyrsistent 0.17.3
pytest 6.2.1
pytest-benchmark 3.2.3
pytest-cov 2.11.0
pytest-forked 1.3.0
pytest-notebook 0.6.1
pytest-xdist 2.2.0
python-dateutil 2.8.1
python-xlib 0.29
pytz 2020.5
PyWavelets 1.1.1
PyYAML 5.3.1
pyzmq 21.0.1
ray 0.8.6
redis 3.4.1
requests 2.25.1
requests-oauthlib 1.3.0
rich 9.8.2
rsa 4.7
Rtree 0.9.7
scikit-image 0.18.1
scikit-learn 0.24.0
scipy 1.4.1
Send2Trash 1.5.0
setuptools 47.1.0
sh 1.14.1
Shapely 1.7.1
six 1.15.0
sklearn 0.0
smarts 0.4.11 /SMARTS
smmap 3.0.4
soupsieve 2.1
supervisor 4.2.1
tableprint 0.9.1
tabulate 0.8.7
tensorboard 2.2.2
tensorboard-plugin-wit 1.7.0
tensorboardX 2.1
tensorflow 2.2.1
tensorflow-estimator 2.2.0
termcolor 1.1.0
terminado 0.9.2
testpath 0.4.4
threadpoolctl 2.1.0
tifffile 2021.1.14
toml 0.10.2
torch 1.4.0
torchfile 0.1.0
torchvision 0.5.0
tornado 6.1
traitlets 5.0.5
trimesh 3.9.1
Twisted 20.3.0
typing-extensions 3.7.4.3
urllib3 1.26.2
visdom 0.1.8.9
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 0.57.0
Werkzeug 1.0.1
wheel 0.36.2
wrapt 1.12.1
yarl 1.6.3
yattag 1.14.0
zipp 3.4.0
zope.interface 5.2.0
Steps to Reproduce
- Once the repository is downloaded, SUMO is installed, and the virtual environment is created and activated with the listed packages:
$ cd SMARTS/ $ git checkout master # The current master branch available (latest commit: ebd72d6) $ scl scenario build-all ultra/scenarios/pool $ python ultra/scenarios/interface.py generate --task 1 --level easy $ ./ultra/env/envision_base.sh
- Then go into ultra/train.py and modify the line:
topolicy_class = "ultra.baselines.sac:sac-v0"
policy_class = "ultra.baselines.dqn:dqn-v0"
- And then finally run the training, redirecting the output to a text file:
View the file with$ ray stop $ nohup python -u ultra/train.py --task 1 --level easy > log.txt &
$ tail -f log.txt
Notes
- I do not encounter this error when running the training in headless mode, i.e. when running
nohup python -u ultra/train.py --task 1 --level easy --headless True > log.txt &
- I was running the training in the ULTRA Docker container created from the ULTRA Dockerfile, however I anticipate the error will occur independent of Docker?
Let me know what I missed adding or if any other information would be helpful.
Issue Analytics
- State:
- Created 3 years ago
- Comments:21 (16 by maintainers)
Top Results From Across the Web
MK-Ultra - HISTORY
MK-Ultra was a top-secret CIA project in which the agency conducted hundreds of clandestine experiments—sometimes on unwitting U.S. ...
Read more >MKUltra - Wikipedia
Project MKUltra (or MK-Ultra) was an illegal human experimentation program designed and undertaken by the U.S. Central Intelligence Agency (CIA), ...
Read more >'Poisoner In Chief' Details The CIA's Secret Quest For Mind ...
In response, the CIA began its own secret program, called MK-ULTRA, to search for a mind control drug that could be weaponized against...
Read more >Execution (Tuner, tune.Experiment) — Ray 2.2.0
run_config – Runtime configuration that is specific to individual trials. If passed, this will overwrite the run config passed to the Trainer, if...
Read more >Brainwashed: The echoes of MK-ULTRA | CBC News
During the Cold War, the CIA secretly funded mind control experiments on unwitting Canadians in a program codenamed MK Ultra.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Glad it could help! Sure, I can bring that up with the ULTRA team.
Update: Headless mode is now the default (#703).
Hi @Yuanzhuo-Liu : We did find some memory leaks in SMARTS as a result of this. See Issues #794 and #805, which led to PR #852. We also opened issue #870 for more work on this.
I believe PR #852 has only been merged into the
develop
branch so far, but it should be in our next release (0.4.17).