PPO rollouts not terminating with `done == True`
See original GitHub issueI am using a custom environment, and I’ve already checked the following:
from stable_baselines3.common.env_checker import check_env
env = CustomEnv(arg1, ...)
# It will check your custom environment and output additional warnings if needed
check_env(env)
But the PPO algo is continuing to step(action)
when done == True
(when the state is no longer in the bounds).
This is how I am interfacing with the algorithm:
class Agent:
def __init__(self, environment, name, net_arch=[100, 100], n_env=1, n_steps=10000):
# vectorise environment
self.environment = environment
check_env(self.environment)
venv = DummyVecEnv([lambda: environment]*n_env)
# load or create model
assert(isinstance(name, str))
self.name = name
try:
self.model = PPO.load(self.name, venv)
except:
self.model = PPO(
'MlpPolicy',
venv,
use_sde=True,
sde_sample_freq=5,
gae_lambda=0.9,
learning_rate=1e-2,
verbose=1,
policy_kwargs=dict(net_arch=net_arch),
n_steps=n_steps
)
def train(self, time_steps):
# learn and save
self.model.learn(total_timesteps=time_steps)
self.model.save(self.name)
def evaluate(self):
# simulate
obs = self.environment.reset()
while True:
action, _ = self.model.predict(obs, deterministic=True)
obs, rew, done, _ = self.environment.step(action)
if done: break
# plot
self.environment.system.plot(fname='{}.pdf'.format(self.name))
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (1 by maintainers)
Top Results From Across the Web
PPO — Stable Baselines3 1.7.0a8 documentation
True if function returned with at least n_rollout_steps collected, False if callback terminated rollout prematurely. Return the parameters of the agent.
Read more >The 37 Implementation Details of Proximal Policy Optimization
For example, the CartPole-v1 has a 500 time limit (see link) and will return done=True if the game lasts for more than 500...
Read more >Sample Collections and Trajectory Views — Ray 2.2.0
In either case, no episode is allowed to exceed the given horizon number of timesteps (RLlib will artificially terminate an episode if this...
Read more >How to prevent my reward sum received during evaluation ...
The PPO default of 4000? Is the x-axis really showing iterations or training steps? 1k steps would not be much, so maybe just...
Read more >A Graphic Guide to Implementing PPO for Atari Games
Learning how Proximal Policy Optimisation (PPO) works and writing a ... no other value getting the chance of being chosen (as it is...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@Miffyli, thanks. But, after more thought, I actually need the agent to stay within the state-space bounds during training, because some my underlying code requires that. How can I enforce this?
If I understood this right (you desire agent to avoid specific situations), then yes, a correct reward and/or terminal states should be able to teach agent to avoid those situations.