Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RLlib] Memory leaks during RLlib training.

See original GitHub issue

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): OS: docker on centos ray:0.8.4 python:3.6

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Recently, we found our RL model trained by rllib will deplete memory and throw OOM error. Then I run a rllib DQN model as belows, the memory usage grows as time pass by.

rllib train --run=DQN --env=Breakout-v0 --config='{"output": "dqn_breakout_1M/", "output_max_file_size": 50000000,"num_workers":3}' --stop='{"timesteps_total": 1000000}'

Memory grows as time goes on:

Hope someone can give some help.

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:20 (7 by maintainers)

Top GitHub Comments

4reactions

ericlcommented, Aug 2, 2021

[closing as stale]

2reactions

wulllicommented, Aug 24, 2020

I experience the same problem with APEX-DQN running in local mode with multiple workers. Memory usage linearly rises, and the experiments fail with RayOutOfMemoryError at some point.

I have tried setting the buffer_size to a smaller value, though I did not figure out what exactly the number is supposed to mean even after some invesitgation in the docs (is it # samples or bytes?) and it did not stop the memory error.

The traceback shows RolloutWorker occupying 56 of 64 GB. Feels like a memory leak to me.

Running on 0.8.5

Top Results From Across the Web

Memory Leak when training PPO on a single agent environment

I noticed that when using a single worker that I wasn't running out of space on a short test of a couple hundred...

RAM Usage Keeps Going Up While Training an RL Network ...

While training a single-agent network, I have been experiencing some issues with exceeding RAM utilization. See TensorBoard screenshot below of ...

Understanding memory leaks - Help Yourself - LinkedIn

Follow a similar course of action to discover and eliminate memory leaks you find in your computer. Closing the program is all you...

ray-dev - Google Groups

Ray RLlib with custom simulator. Seems like the image wasn't attached properly. ... time per epoch increases and contains ocillations during k-fold training....

Learning Ray

With Early Release ebooks, you get books in their earliest ... Machine Learning and the Data Science Workflow ... Reinforcement Learning with Ray...