[question] What is the proper way to log metrics at the end of each epoch when epochs are variable in length?
See original GitHub issueProblem description
I am training a PPO
model for stock trading using a custom gym
environment, called StockTradingEnv
. Each “epoch” of training is variable in length, since the epoch ends under two conditions: 1) the agent loses all of its initial money, or 2) the agent reaches the end of the data frame/time series (and has not lost all of its money). I would like to log the net change in the agent’s balance at the end of each of these epochs. To do so, I maintain an array within the environment, StockTradingEnv.list_networth
, containing the agent’s net worth at each time step, and reset it (i.e. empty the array) at the start of each new epoch. I attempted to create a subclass of BaseCallback
, called TensorboardCallback
, with a very simple _on_step()
method – it checks StockTradingEnv.done
, and if True, logs the net_change
for that epoch (the difference between the values at the last and first indexes of StockTradingEnv.list_networth
). However, it appears that PPO
is only invoking its callbacks every n_steps
and n_steps=1
is not permitted as per the documentation:
:param n_steps: The number of steps to run for each environment per update (i.e. rollout buffer size is n_steps * n_envs where n_envs is number of environment copies running in parallel) NOTE: n_steps * n_envs must be greater than 1 (because of the advantage normalization)
Even with n_steps=2
, it is possible that an epoch ends on, say, step 1001 (not divisible by 2) and thus no net_change
will be logged for that epoch.
What is the proper solution using stable-baselines3
to log metrics from the environment systematically at the end of each epoch, when the epoch lengths are not a constant number of steps?
Code For the sake of brevity, I did not include the code for the custom environment here. I can always add this if someone deems it necessary.
import pandas as pd
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import BaseCallback
from env import StockTradingEnv # a custom gym environment for stock trading
# A custom callback
class TensorboardCallback(BaseCallback):
""" Logs the net change in cash between the beginning and end of each epoch/run. """
def __init__(self, verbose=0):
super(TensorboardCallback, self).__init__(verbose)
self.env = self.training_env.envs[0]
def _on_step(self) -> bool:
if self.env.done:
net_change = self.env.list_networth[-1] - self.env.list_networth[0]
self.logger.record("net_change", net_change)
return True
# Load training data
WMT_Train = pd.read_csv("WMT_Train.csv")
# Instantiate the custom environment
env = DummyVecEnv([lambda: StockTradingEnv(WMT_Train, start=0, end=10000, look_back=10)])
# Instantiate model
model = PPO('MlpPolicy', env, learning_rate=0.0001, verbose=0, ent_coef=0.5,
tensorboard_log="./ppo_log", n_steps=128)
# Fit model using the custom callback
model.learn(total_timesteps=500000, tb_log_name="PPO_log", callback=TensorboardCallback())
System Info
- Python version: 3.9.7
- Stable-baselines3 (version 1.2.0), installed via pip: pip install ‘stable-baselines3[extra]’
- Tensorflow version: 2.6.0
- Gym version: 0.20.0
Issue Analytics
- State:
- Created 2 years ago
- Comments:5
Top GitHub Comments
Hmm I am bit confused about the concept of epoch here. It sounds like what you mean is an episode (from
reset
todone=True
in an environment)? If that is the case, a simple Monitor wrapper (see examples on how to add this) would do the trick which saves data on each individual episode into a csv file you can then load up later. At least, this is what I understood by your description (sorry for not suggesting this earlier, I was under the impression you might have tried this).You should probably open this issue on stable-baselines3 repository 😃.
But to answer your question: if I understand correctly you want to log stats after each time PPO is updated. In that case you should use
_on_rollout_start
and_on_rollout_end
(former is called when new samples are collected, latter when the sampling is done. Training also happens once per rollout start/end).