Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PPO rollouts not terminating with `done == True`

See original GitHub issue

I am using a custom environment, and I’ve already checked the following:

from stable_baselines3.common.env_checker import check_env

env = CustomEnv(arg1, ...)
# It will check your custom environment and output additional warnings if needed
check_env(env)

But the PPO algo is continuing to step(action) when done == True (when the state is no longer in the bounds).

This is how I am interfacing with the algorithm:

class Agent:

    def __init__(self, environment, name, net_arch=[100, 100], n_env=1, n_steps=10000):

        # vectorise environment
        self.environment = environment
        check_env(self.environment)
        venv = DummyVecEnv([lambda: environment]*n_env)

        # load or create model
        assert(isinstance(name, str))
        self.name = name
        try:
            self.model = PPO.load(self.name, venv)
        except:
            self.model = PPO(
                'MlpPolicy', 
                venv,
                use_sde=True,
                sde_sample_freq=5,
                gae_lambda=0.9,
                learning_rate=1e-2,
                verbose=1,
                policy_kwargs=dict(net_arch=net_arch),
                n_steps=n_steps
            )

    def train(self, time_steps):

        # learn and save
        self.model.learn(total_timesteps=time_steps)
        self.model.save(self.name)

    def evaluate(self):

        # simulate
        obs = self.environment.reset()
        while True:
            action, _ = self.model.predict(obs, deterministic=True)
            obs, rew, done, _ = self.environment.step(action)
            if done: break

        # plot
        self.environment.system.plot(fname='{}.pdf'.format(self.name))

Issue Analytics

State:
Created 3 years ago
Comments:10 (1 by maintainers)

Top GitHub Comments

1reaction

cispraguecommented, May 15, 2020

@Miffyli, thanks. But, after more thought, I actually need the agent to stay within the state-space bounds during training, because some my underlying code requires that. How can I enforce this?

1reaction

Miffylicommented, May 15, 2020

So, in this case, the reward itself should be enough to eventually enforce that I want the agent to stay within the bounds, right?

If I understood this right (you desire agent to avoid specific situations), then yes, a correct reward and/or terminal states should be able to teach agent to avoid those situations.

Top Results From Across the Web

PPO — Stable Baselines3 1.7.0a8 documentation

True if function returned with at least n_rollout_steps collected, False if callback terminated rollout prematurely. Return the parameters of the agent.

The 37 Implementation Details of Proximal Policy Optimization

For example, the CartPole-v1 has a 500 time limit (see link) and will return done=True if the game lasts for more than 500...

Sample Collections and Trajectory Views — Ray 2.2.0

In either case, no episode is allowed to exceed the given horizon number of timesteps (RLlib will artificially terminate an episode if this...

How to prevent my reward sum received during evaluation ...

The PPO default of 4000? Is the x-axis really showing iterations or training steps? 1k steps would not be much, so maybe just...

A Graphic Guide to Implementing PPO for Atari Games

Learning how Proximal Policy Optimisation (PPO) works and writing a ... no other value getting the chance of being chosen (as it is...