Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Filtering out artificial teminal states

See original GitHub issue

In many gym environments, like MountainCarContinuous, there is an epsiode step limit. This leads to episode termination before actually achieving the end of trajectory(which in this case is reaching uphill).

Saving these experiences to buffer without changing artificial terminals to False, for example, in here, leads to an error in computing TD errors. I think the agent’s prediction about the future rewards while it has not reached the real end of the trajectory yet, should be taken into account.

This is why some implementations like OpenAI SpinningUp change that terminal states before saving the experience, like this:

"""From OpanAI SpinningUp source code"""

# Ignore the "done" signal if it comes from hitting the time
# horizon (that is when it's an artificial terminal signal
# that isn't based on the agent's state)
d = False if ep_len==max_ep_len else d

# Store experience to replay buffer
replay_buffer.store(o, a, r, o2, d)

Issue Analytics

State:
Created 3 years ago
Comments:6

Top GitHub Comments

1reaction

araffincommented, Mar 23, 2022

Answered here https://github.com/DLR-RM/stable-baselines3/issues/829

1reaction

araffincommented, Sep 30, 2020

I created a branch on SB3 but it in fact a bit more tricky than expected (notably because VecEnv resets automatically): https://github.com/DLR-RM/stable-baselines3/compare/feat/remove-timelimit

For A2C/PPO or any n-step methods, we would need to keep track of two types of terminations signal…