question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PPO2 with MlpLstmPolicy crashes GPU

See original GitHub issue

Describe the bug While training using PPO2 with MlpLstmPolicy on custom env, my computer intermittently freezes yet continues training. When I attempt to monitor GPU’s with watch -n0.5 nvidia-smi it loads first gpu data then seems to hang for a while until I see that my second GPU has an error. Even after training, anything that uses a gpu glitches requiring me to reset the computer even to run another model. I’ve run the same training on the same env using just MlpPolicy and it trains just fine (although my problem needs a recurrent network so I always get bad results) and I can monitor everything without GPU glitches. I thought it might be memory overload but I don’t get anywhere near using all the ram or GPU memory.

Code example

import gym
import money_maker
import os

from stable_baselines.common.policies import MlpLstmPolicy, LstmPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import PPO2

# multiprocess environment
n_cpu = 128
env = SubprocVecEnv([lambda: gym.make('maker-v0') for i in range(n_cpu)])

model = PPO2(MlpLstmPolicy, env, verbose=1, nminibatches=32,
             tensorboard_log="./ppo2_lstm_21_jan_morn_tensorboard/")

model.learn(total_timesteps=10000000)
model.save("ppo2_maker_lstm")
del model
env.close()

System Info Describe the characteristic of your environment:

  • installed using PIP according to docs
  • 2- gtx-1080ti GPU’s, driver:410.93,
  • python version 3.6.5 in conda environment
  • Tensorflow version: 1.12.0

Additional context image

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
SerialIteratorcommented, Feb 13, 2019

Awesome!!! No more crashing and logs are about 100x smaller. Good job guys!

1reaction
hill-acommented, Jan 29, 2019

Hey,

The logging parameters are usually found in def setup_model(self): in the model file (ex: stable-baselines/ppo2/ppo2.py), and look like tf.summary.[type]([name], [value]).

If you comment everything except the scalars that should reduce by an order of magnitude the logging size.

Read more comments on GitHub >

github_iconTop Results From Across the Web

PPO2 — Stable Baselines 2.10.3a0 documentation
PPO2 is the implementation of OpenAI made for GPU. For multiprocessing, it uses vectorized environments compared to PPO1 which uses MPI.
Read more >
Stable Baselines Documentation - Read the Docs
results need not be reproducible between CPU and GPU executions, ... model = PPO2('MlpLstmPolicy', 'CartPole-v1', nminibatches=1, verbose=1).
Read more >
[PC] How to Resolve GPU Crashes in Vermintide 2 – Fatshark
Please note that these solutions are intended for players who are experiencing crashes that appear to be GPU-related. Fatshark Support...
Read more >
GPU randomly crashing while playing games and fully ...
When it crashes I have to use safe mode to disable it but when I enable it again I get the same error....
Read more >
This GPU crashes when a driver is loaded... But why? - YouTube
Learn more about Kioxia at https://www.kioxia.com/en-us/top.htmlBG4 - https://business.kioxia.com/en-us/ssd/client-ssd/bg4.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found