Episode mean reward is not properly logged on tensorboard when using SAC
See original GitHub issueI’m training an agent on a custom environment using SAC. The environment is wrapped in a Monitor
, which is wrapped in a DummyVecEnv
, which is wrapped in a VecNormalize
, with norm_reward = True
.
This is the tensorboard graph for the episode mean reward:
No smoothing | 0.9 Smoothing |
---|---|
As you can see, the graph has some weird loops. For example, at around 170k steps or 450k steps.
Edit: Training is conducted in epochs of 50k steps.
Program starts by calling ./start.sh
.
start.sh
#!/bin/bash
while [ "$?" -eq 0 ]; do
python3 main.py
done
main.py
import os.path
from my_custom_env import MyCustomEnv
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
class SaveCheckpoint(BaseCallback):
def __init__(self, save_freq, verbose = 0):
super(SaveCheckpoint, self).__init__(verbose)
self.save_freq = save_freq
def _on_step(self):
if self.num_timesteps % self.save_freq == 0:
self.model.save("model.zip")
self.training_env.save("stats.pkl")
return True
if __name__ == '__main__':
# inits
env = DummyVecEnv([lambda: Monitor(MyCustomEnv())])
model = None
# load recent checkpoint
if os.path.isfile("model.zip") and os.path.isfile("stats.pkl"):
env = VecNormalize.load("stats.pkl", env)
env.reset()
model = SAC.load("model.zip", env)
else:
env = VecNormalize(env)
model = SAC('MlpPolicy', env, verbose = 1, tensorboard_log = ".")
# replay buffer
if os.path.isfile("replay_buffer.pkl"):
model.load_replay_buffer("replay_buffer.pkl")
# train
model.learn(50000,
callback = SaveCheckpoint(10000),
log_interval = 1,
reset_num_timesteps = False
)
# save replay buffer
model.save_replay_buffer(".")
env.close()
> pip3 freeze | grep 'stable-baselines3'
stable-baselines3==0.7.0a1
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Tensorboard Integration - Stable Baselines3 - Read the Docs
It will display information such as the episode reward (when using a Monitor wrapper), the model losses and other parameter unique to some...
Read more >Understanding the tensorboard plots on a stable-baseline3's ...
ep_rew_mean : Mean episodic training reward (averaged over 100 episodes), a Monitor wrapper is required to compute that value (automatically ...
Read more >Resolved - Tensorboard only provides text data, no scalar data
I've tried Release versions: 2, 4, and 10 along with their corresponding python versions. I'm running mlagents-learn without problem.
Read more >Everything You Need To Master Actor Critic Methods - YouTube
We'll cover the Markov decision process, the agent's policy, reward discounting and why it's necessary, and the actor critic algorithm.
Read more >Environments | TensorFlow Agents
Using Standard Environments; Creating your own Python Environment ... reward : The agent is learning to maximize the sum of these rewards ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Updated the post with more detailed info
Yup, that seems to have been the problem. Thanks!