[Bug] Infinite horizon tasks are handled like episodic tasks
See original GitHub issueHi,
I wonder how to correctly use SAC with infinite horizon environments. I saw @araffin s answer to https://github.com/hill-a/stable-baselines/issues/776 where he points out that algorithms are step-based. Our environments could always return done = False
, but we would have to reset the environment manually then. As a consequence, we would add transitions to the replay buffer going from the last state to the initial state, which is bad.
Is the only solution to include a time-feature? That means messing with the observation_space
size and handling dict spaces correctly + explaining what this “time-feature” is in papers. Let me know if I’ve missed a thread treating this issue already 😄
Greetings!
🐛 Bug / Background
My understanding is that SAC skips the target if s'
is a terminal state:
q_backup = replay_data.rewards + (1 - replay_data.dones) * self.gamma * target_q
In infinite horizon tasks, we wrap our env with gym.wrappers.TimeLimit
, which sets done = True
when the maximum episode length is reached. This stops the episode in SAC and the transition is saved in the replay buffer for learning.
However, according to “Time Limits in Reinforcement Learning” (https://arxiv.org/abs/1712.00378), we should not see that last state as a “terminal” state, since the termination has nothing to do with the MDP. If we ignore this, we are doing “state aliasing” and violating the Markov Property.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (2 by maintainers)
Top GitHub Comments
TimeFeature is one solution and equivalent in performance to specific handling of timeout. We have an implementation in SB3-Contrib that already handles dict and it is used for all PyBullet env in the zoo: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/sac.yml#L142
Personally, this is the recommended solution (and you can use the test mode at test time too).
Note: the timeout handling is indeed important, see appendix of https://arxiv.org/abs/2005.05719
Related issues
linking to all relevant issues:
Experimental branch
"I created a branch on SB3 but it in fact a bit more tricky than expected (notably because VecEnv resets automatically): " As mentioned, I already created an experimental branch here: https://github.com/DLR-RM/stable-baselines3/compare/feat/remove-timelimit
you don’t need
gamma=1
, this is independent from the infinite horizon problem.To summarize so that I understood things right: You have non-episodic task (never truly “done”), but you use
TimeLimit
to reset game every now and then, and to train correctly you can not apply terminal boundaries during training (does not reflect true agent setup).There should not be a problem with this while using SAC, as long as you always feed in
done=False
. The biggest problem then is that final timestep does not reflect environment behaviour (it was reset under the hood). The easiest fix for this is not to include it in the training data. You can add a check here to checkif not infos.get(TimeLimit.truncated, False): buffer.add(...)
. Such flag is added toinfo
dictionary when episode is truncated by timelimit.A more sophisticated solution would indeed be a nice enhancement though, as errors like these are easy to miss. I will mark it as an enhancement for some later versions of stable-baselines.