Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PPO2 MlpLnLstm taking exponentially longer between updates

See original GitHub issue

I am training a PPO2 agent with MlpLnLstm policy on $10^6$ samples with 7 columns (features) making a total of 7 million 32 bit floats; a relatively small dataset.

My hyperparams are

n_steps: 1024,
gamma: 0.999,
learning_rate: 0.0005,
ent_coef: 0.04,
vf_coef: 0.6,
cliprange: 0.25,
noptepochs: 4,
lam: 0.85,
nminibatches: 1

(everything else (including network architecture) is left as default)

and the hardware is

CPU: AMD Ryzen threadripper 12 core (24 CPU)
GPU: EVGA (NVIDIA) 2070 RTX 2070 Super
RAM: Corsair Vengeance 32 Gb (2 x 16Gb)

I am using 24 actors in parallel to utilise all 24 CPU’s when training.

I find that when training the model, there is massive overhead in between batch updates and this is increasing exponentially with every update; when n_updates was between 1-10 it was taking about 10 seconds between in each update, when n_updates was around 180 it was taking 21 minutes between updates, when n_updates is 205 it’s taking 88 minutes between updates (with the hyperparams and actors set above, we get a total of around 1050 updates). When the update is taking place, I see the GPU cranks up and makes the update very quickly (like 5 seconds). But in between updates I see that GPU utilization is at 0% while all the CPU usage oscillates like a sine wave between 20% and 80%.

I would like to better understand how the CPU and GPU are being utilized by stable-baselines.

Why is there so much CPU overhead between updates? My (custom) gym environment is very simple; the observations are directly taken from the raw data (a csv file stored in a pandas dataframe), and there is no transformation / calculation made on observations at each step. Also, the reward is very quick to calculate (only $10^{-4}$ s on an array of $10^5$ elements, as I am using numpy (this array increases by 1 with every serial timestep). At first I thought that the untrained agent would “die” (done=True) a lot in the first few iterations, which would cause time between updates to be very quick. But even when the agent stays “alive” for more steps in later iterations, 88 minutes between updates seems far too long. How can each actor (each CPU) stepping through the environment for 1024 steps take so long?

Is there a some hidden “minimum loss decrease” parameter?" If the algorithm only updates when some minimum loss change between updates is achieved, then maybe that could be why it takes so long to update. Something analogous would be the Keras EarlyStopping min_delta method. If this is the case, can we change this by changing some PPO2 parameter or kwarg?

Increase the architecture size. Issue #308 seems to suggest that increasing the network architecture would at least increase the GPU utilization at update time (which may also help convergence). I am open to trying this, but I don’t think it really answers my first question above.

My ultimate goal is just to have a fully trained agent that traversed the entire dataset of 1 million samples in a reasonable amount of time (even 2-3 days). Any changes I can make to my gym environment / PPO2 params or any explanation on how stable baselines utilises the hardware would be much appreciated. Thanks

Issue Analytics

State:
Created 4 years ago
Comments:20

Top GitHub Comments

1reaction

Miffylicommented, Nov 18, 2019

That behaviour does not sound normal at all, especially if you are using GPU with stable-baselines (it should utilize GPU in spikes for training and CPU only for environments).

Two things pop to my mind:

You seem to share the same data object with all workers. I am not too familiar with Pandas to know what this object is exactly, but it could end up being shared in a wonky way. I would move reading dataset just before env = MyEnv(data=data), just to make sure this is not breaking things.
Try DummyVecEnv instead of SubprocVecEnv. Since your environment is computationally very simple, using different Python processes (SubprocVecEnv) adds considerable overhead. See note here for more info.

1reaction

Miffylicommented, Nov 14, 2019

Hmm on a glimpse this environment seems alright and should work fine. Have you tried using “MlpLstm” policy rather than the “MlpLnLstm” policy? This one has worked as expect for me in my experiments. Other than that I do not have other suggestions to give other than starting to debug timings and try to pin down what takes long 😕