[RLlib] Memory leaks during RLlib training.
See original GitHub issueWhat is the problem?
Ray version and other system information (Python version, TensorFlow version, OS): OS: docker on centos ray:0.8.4 python:3.6
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Recently, we found our RL model trained by rllib will deplete memory and throw OOM error. Then I run a rllib DQN model as belows, the memory usage grows as time pass by.
rllib train --run=DQN --env=Breakout-v0 --config='{"output": "dqn_breakout_1M/", "output_max_file_size": 50000000,"num_workers":3}' --stop='{"timesteps_total": 1000000}'
Memory grows as time goes on:
Hope someone can give some help.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:20 (7 by maintainers)
Top Results From Across the Web
Memory Leak when training PPO on a single agent environment
I noticed that when using a single worker that I wasn't running out of space on a short test of a couple hundred...
Read more >RAM Usage Keeps Going Up While Training an RL Network ...
While training a single-agent network, I have been experiencing some issues with exceeding RAM utilization. See TensorBoard screenshot below of ...
Read more >Understanding memory leaks - Help Yourself - LinkedIn
Follow a similar course of action to discover and eliminate memory leaks you find in your computer. Closing the program is all you...
Read more >ray-dev - Google Groups
Ray RLlib with custom simulator. Seems like the image wasn't attached properly. ... time per epoch increases and contains ocillations during k-fold training....
Read more >Learning Ray
With Early Release ebooks, you get books in their earliest ... Machine Learning and the Data Science Workflow ... Reinforcement Learning with Ray...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
[closing as stale]
I experience the same problem with APEX-DQN running in local mode with multiple workers. Memory usage linearly rises, and the experiments fail with RayOutOfMemoryError at some point.
I have tried setting the buffer_size to a smaller value, though I did not figure out what exactly the number is supposed to mean even after some invesitgation in the docs (is it # samples or bytes?) and it did not stop the memory error.
The traceback shows RolloutWorker occupying 56 of 64 GB. Feels like a memory leak to me.
Running on 0.8.5