question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RLlib] Memory leaks during RLlib training.

See original GitHub issue

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): OS: docker on centos ray:0.8.4 python:3.6

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Recently, we found our RL model trained by rllib will deplete memory and throw OOM error. Then I run a rllib DQN model as belows, the memory usage grows as time pass by.

rllib train --run=DQN --env=Breakout-v0 --config='{"output": "dqn_breakout_1M/", "output_max_file_size": 50000000,"num_workers":3}' --stop='{"timesteps_total": 1000000}' 

Memory grows as time goes on:

image

Hope someone can give some help.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:4
  • Comments:20 (7 by maintainers)

github_iconTop GitHub Comments

4reactions
ericlcommented, Aug 2, 2021

[closing as stale]

2reactions
wulllicommented, Aug 24, 2020

I experience the same problem with APEX-DQN running in local mode with multiple workers. Memory usage linearly rises, and the experiments fail with RayOutOfMemoryError at some point.

I have tried setting the buffer_size to a smaller value, though I did not figure out what exactly the number is supposed to mean even after some invesitgation in the docs (is it # samples or bytes?) and it did not stop the memory error.

The traceback shows RolloutWorker occupying 56 of 64 GB. Feels like a memory leak to me.

Running on 0.8.5

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory Leak when training PPO on a single agent environment
I noticed that when using a single worker that I wasn't running out of space on a short test of a couple hundred...
Read more >
RAM Usage Keeps Going Up While Training an RL Network ...
While training a single-agent network, I have been experiencing some issues with exceeding RAM utilization. See TensorBoard screenshot below of ...
Read more >
Understanding memory leaks - Help Yourself - LinkedIn
Follow a similar course of action to discover and eliminate memory leaks you find in your computer. Closing the program is all you...
Read more >
ray-dev - Google Groups
Ray RLlib with custom simulator. Seems like the image wasn't attached properly. ... time per epoch increases and contains ocillations during k-fold training....
Read more >
Learning Ray
With Early Release ebooks, you get books in their earliest ... Machine Learning and the Data Science Workflow ... Reinforcement Learning with Ray...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found