Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Infinite horizon tasks are handled like episodic tasks

See original GitHub issue

Hi, I wonder how to correctly use SAC with infinite horizon environments. I saw @araffin s answer to https://github.com/hill-a/stable-baselines/issues/776 where he points out that algorithms are step-based. Our environments could always return done = False, but we would have to reset the environment manually then. As a consequence, we would add transitions to the replay buffer going from the last state to the initial state, which is bad.

Is the only solution to include a time-feature? That means messing with the observation_space size and handling dict spaces correctly + explaining what this “time-feature” is in papers. Let me know if I’ve missed a thread treating this issue already 😄
Greetings!

🐛 Bug / Background

My understanding is that SAC skips the target if s' is a terminal state:

q_backup = replay_data.rewards + (1 - replay_data.dones) * self.gamma * target_q

In infinite horizon tasks, we wrap our env with gym.wrappers.TimeLimit, which sets done = True when the maximum episode length is reached. This stops the episode in SAC and the transition is saved in the replay buffer for learning.

However, according to “Time Limits in Reinforcement Learning” (https://arxiv.org/abs/1712.00378), we should not see that last state as a “terminal” state, since the termination has nothing to do with the MDP. If we ignore this, we are doing “state aliasing” and violating the Markov Property.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

araffincommented, Jan 7, 2021

Is the only solution to include a time-feature? That means messing with the observation_space size and handling dict spaces correctly + explaining what this “time-feature” is in papers

TimeFeature is one solution and equivalent in performance to specific handling of timeout. We have an implementation in SB3-Contrib that already handles dict and it is used for all PyBullet env in the zoo: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/sac.yml#L142

Personally, this is the recommended solution (and you can use the test mode at test time too).

Note: the timeout handling is indeed important, see appendix of https://arxiv.org/abs/2005.05719

Related issues

linking to all relevant issues:

Experimental branch

You can add a check here to check if not infos.get(TimeLimit.truncated, False): buffer.add(…). Such flag is added to info dictionary when episode is truncated by timelimit.

"I created a branch on SB3 but it in fact a bit more tricky than expected (notably because VecEnv resets automatically): " As mentioned, I already created an experimental branch here: https://github.com/DLR-RM/stable-baselines3/compare/feat/remove-timelimit

For my work, I’ll use a time-feature and set gamma = 1.

you don’t need gamma=1, this is independent from the infinite horizon problem.

1reaction

Miffylicommented, Jan 7, 2021

To summarize so that I understood things right: You have non-episodic task (never truly “done”), but you use TimeLimit to reset game every now and then, and to train correctly you can not apply terminal boundaries during training (does not reflect true agent setup).

There should not be a problem with this while using SAC, as long as you always feed in done=False. The biggest problem then is that final timestep does not reflect environment behaviour (it was reset under the hood). The easiest fix for this is not to include it in the training data. You can add a check here to check if not infos.get(TimeLimit.truncated, False): buffer.add(...). Such flag is added to info dictionary when episode is truncated by timelimit.

A more sophisticated solution would indeed be a nice enhancement though, as errors like these are easy to miss. I will mark it as an enhancement for some later versions of stable-baselines.