question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RayOutOfMemoryError when running ULTRA experiments

See original GitHub issue

Issue

After running an ULTRA training experiment (this experiment was run with the baseline DQN policy) for about half a day, the program stops because of a RayOutOfMemoryError.

Error

2021-02-09 09:09:34,529	ERROR worker.py:987 -- Possible unhandled error from worker: ray::ultra.evaluate.evaluate() (pid=4336, ip=10.208.237.111)
  File "python/ray/_raylet.pyx", line 408, in ray._raylet.execute_task
  File "/SMARTS/.venv/lib/python3.7/site-packages/ray/memory_monitor.py", line 128, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node gpu-machine is used (30.17 / 31.29 GB). The top 10 memory consumers are:

PID	MEM	COMMAND
4331	18.49GiB	ray::__main__.train()
17726	4.02GiB	/SMARTS/.venv/bin/python3.7 /SMARTS/.venv/bin/scl envision start -s ./ultra/scenarios -p 8081
4336	3.29GiB	ray::IDLE
4644	0.22GiB	ray::__main__.train()
4213	0.21GiB	/SMARTS/.venv/bin/python3.7 /SMARTS/.venv/bin/tensorboard --logdir_spec=BDQN:logs/experiment-2021.2.
4274	0.09GiB	python -u ultra/train.py --task 1 --level easy
4325	0.09GiB	ray::IDLE
4335	0.09GiB	ray::IDLE
4334	0.09GiB	ray::IDLE
4330	0.09GiB	ray::IDLE

In addition, up to 0.04 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
---

Will send @Gamenot the full log of the program execution internally as I am unable to upload the log to this public post.

Configuration

Was run in a Docker container with Ubuntu 18.04. nvidia-smi outputs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    Off  | 00000000:01:00.0 Off |                  N/A |
| 25%   36C    P2    34W / 215W |    896MiB /  7979MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The command $ sumo outputs:

Eclipse SUMO sumo Version 1.8.0
 Build features: Linux-4.15.0-124-generic x86_64 GNU 7.5.0 Release Proj GUI SWIG GDAL GL2PS
 Copyright (C) 2001-2020 German Aerospace Center (DLR) and others; https://sumo.dlr.de
 License EPL-2.0: Eclipse Public License Version 2 <https://eclipse.org/legal/epl-v20.html>
 Use --help to get the list of options.

In the virtual environment, $ pip list yields the following:

Package                Version   Location
---------------------- --------- --------
absl-py                0.11.0
aiohttp                3.7.3
apipkg                 1.5
argon2-cffi            20.1.0
astunparse             1.6.3
async-timeout          3.0.1
atari-py               0.2.6
attrs                  20.3.0
Automat                20.2.0
backcall               0.2.0
beautifulsoup4         4.9.3
bleach                 3.2.1
cachetools             4.2.0
certifi                2020.12.5
cffi                   1.14.4
chardet                3.0.4
click                  7.1.2
cloudpickle            1.3.0
colorama               0.4.4
commonmark             0.9.1
constantly             15.1.0
coverage               5.3.1
cycler                 0.10.0
decorator              4.4.2
defusedxml             0.6.0
dill                   0.3.3
dm-tree                0.1.5
entrypoints            0.3
evdev                  1.4.0
execnet                1.7.1
filelock               3.0.12
future                 0.18.2
gast                   0.3.3
gitdb                  4.0.5
GitPython              3.1.12
google                 3.0.0
google-auth            1.24.0
google-auth-oauthlib   0.4.2
google-pasta           0.2.0
grpcio                 1.30.0
gym                    0.18.0
h5py                   2.10.0
hyperlink              21.0.0
idna                   2.10
imageio                2.9.0
importlib-metadata     3.4.0
importlib-resources    5.0.0
incremental            17.5.0
iniconfig              1.1.1
ipykernel              5.4.3
ipython                7.19.0
ipython-genutils       0.2.0
jedi                   0.18.0
Jinja2                 2.11.2
joblib                 1.0.0
jsonpatch              1.28
jsonpointer            2.0
jsonschema             3.2.0
jupyter-client         6.1.11
jupyter-core           4.7.0
Keras-Preprocessing    1.1.2
kiwisolver             1.3.1
lz4                    3.1.2
Markdown               3.3.3
MarkupSafe             1.1.1
matplotlib             3.3.3
mistune                0.8.4
msgpack                1.0.2
multidict              5.1.0
nbconvert              5.6.1
nbdime                 2.1.0
nbformat               5.1.2
networkx               2.5
notebook               6.2.0
numpy                  1.18.5
oauthlib               3.1.0
opencv-python          4.5.1.48
opencv-python-headless 4.5.1.48
opt-einsum             3.3.0
packaging              20.8
panda3d                1.10.8
panda3d-gltf           0.12
panda3d-simplepbr      0.7
pandas                 1.2.0
pandocfilters          1.4.3
parso                  0.8.1
pexpect                4.8.0
pickleshare            0.7.5
Pillow                 7.2.0
pip                    20.3.3
pluggy                 0.13.1
prometheus-client      0.9.0
prompt-toolkit         3.0.10
protobuf               3.14.0
psutil                 5.8.0
ptyprocess             0.7.0
py                     1.10.0
py-cpuinfo             7.0.0
py-spy                 0.3.4
pyasn1                 0.4.8
pyasn1-modules         0.2.8
pybullet               3.0.8
pycparser              2.20
pyglet                 1.5.0
Pygments               2.7.4
PyHamcrest             2.0.2
pynput                 1.7.2
pyparsing              2.4.7
pyrsistent             0.17.3
pytest                 6.2.1
pytest-benchmark       3.2.3
pytest-cov             2.11.0
pytest-forked          1.3.0
pytest-notebook        0.6.1
pytest-xdist           2.2.0
python-dateutil        2.8.1
python-xlib            0.29
pytz                   2020.5
PyWavelets             1.1.1
PyYAML                 5.3.1
pyzmq                  21.0.1
ray                    0.8.6
redis                  3.4.1
requests               2.25.1
requests-oauthlib      1.3.0
rich                   9.8.2
rsa                    4.7
Rtree                  0.9.7
scikit-image           0.18.1
scikit-learn           0.24.0
scipy                  1.4.1
Send2Trash             1.5.0
setuptools             47.1.0
sh                     1.14.1
Shapely                1.7.1
six                    1.15.0
sklearn                0.0
smarts                 0.4.11    /SMARTS
smmap                  3.0.4
soupsieve              2.1
supervisor             4.2.1
tableprint             0.9.1
tabulate               0.8.7
tensorboard            2.2.2
tensorboard-plugin-wit 1.7.0
tensorboardX           2.1
tensorflow             2.2.1
tensorflow-estimator   2.2.0
termcolor              1.1.0
terminado              0.9.2
testpath               0.4.4
threadpoolctl          2.1.0
tifffile               2021.1.14
toml                   0.10.2
torch                  1.4.0
torchfile              0.1.0
torchvision            0.5.0
tornado                6.1
traitlets              5.0.5
trimesh                3.9.1
Twisted                20.3.0
typing-extensions      3.7.4.3
urllib3                1.26.2
visdom                 0.1.8.9
wcwidth                0.2.5
webencodings           0.5.1
websocket-client       0.57.0
Werkzeug               1.0.1
wheel                  0.36.2
wrapt                  1.12.1
yarl                   1.6.3
yattag                 1.14.0
zipp                   3.4.0
zope.interface         5.2.0

Steps to Reproduce

  • Once the repository is downloaded, SUMO is installed, and the virtual environment is created and activated with the listed packages:
    $ cd SMARTS/
    $ git checkout master  # The current master branch available (latest commit: ebd72d6)
    $ scl scenario build-all ultra/scenarios/pool
    $ python ultra/scenarios/interface.py generate --task 1 --level easy
    $ ./ultra/env/envision_base.sh
    
  • Then go into ultra/train.py and modify the line:
    policy_class = "ultra.baselines.sac:sac-v0"
    
    to
    policy_class = "ultra.baselines.dqn:dqn-v0"
    
  • And then finally run the training, redirecting the output to a text file:
    $ ray stop
    $ nohup python -u ultra/train.py --task 1 --level easy > log.txt &
    
    View the file with $ tail -f log.txt

Notes

  • I do not encounter this error when running the training in headless mode, i.e. when running nohup python -u ultra/train.py --task 1 --level easy --headless True > log.txt &
  • I was running the training in the ULTRA Docker container created from the ULTRA Dockerfile, however I anticipate the error will occur independent of Docker?

Let me know what I missed adding or if any other information would be helpful.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:21 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
christianjanscommented, Mar 23, 2021

Glad it could help! Sure, I can bring that up with the ULTRA team.

Update: Headless mode is now the default (#703).

0reactions
sah-huaweicommented, Jun 4, 2021

Hi @Yuanzhuo-Liu : We did find some memory leaks in SMARTS as a result of this. See Issues #794 and #805, which led to PR #852. We also opened issue #870 for more work on this.

I believe PR #852 has only been merged into the develop branch so far, but it should be in our next release (0.4.17).

Read more comments on GitHub >

github_iconTop Results From Across the Web

MK-Ultra - HISTORY
MK-Ultra was a top-secret CIA project in which the agency conducted hundreds of clandestine experiments—sometimes on unwitting U.S. ...
Read more >
MKUltra - Wikipedia
Project MKUltra (or MK-Ultra) was an illegal human experimentation program designed and undertaken by the U.S. Central Intelligence Agency (CIA), ...
Read more >
'Poisoner In Chief' Details The CIA's Secret Quest For Mind ...
In response, the CIA began its own secret program, called MK-ULTRA, to search for a mind control drug that could be weaponized against...
Read more >
Execution (Tuner, tune.Experiment) — Ray 2.2.0
run_config – Runtime configuration that is specific to individual trials. If passed, this will overwrite the run config passed to the Trainer, if...
Read more >
Brainwashed: The echoes of MK-ULTRA | CBC News
During the Cold War, the CIA secretly funded mind control experiments on unwitting Canadians in a program codenamed MK Ultra.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found