question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Infinite horizon tasks are handled like episodic tasks

See original GitHub issue

Hi, I wonder how to correctly use SAC with infinite horizon environments. I saw @araffin s answer to https://github.com/hill-a/stable-baselines/issues/776 where he points out that algorithms are step-based. Our environments could always return done = False, but we would have to reset the environment manually then. As a consequence, we would add transitions to the replay buffer going from the last state to the initial state, which is bad.

Is the only solution to include a time-feature? That means messing with the observation_space size and handling dict spaces correctly + explaining what this “time-feature” is in papers. Let me know if I’ve missed a thread treating this issue already 😄
Greetings!

🐛 Bug / Background

My understanding is that SAC skips the target if s' is a terminal state:

q_backup = replay_data.rewards + (1 - replay_data.dones) * self.gamma * target_q

In infinite horizon tasks, we wrap our env with gym.wrappers.TimeLimit, which sets done = True when the maximum episode length is reached. This stops the episode in SAC and the transition is saved in the replay buffer for learning.

However, according to “Time Limits in Reinforcement Learning” (https://arxiv.org/abs/1712.00378), we should not see that last state as a “terminal” state, since the termination has nothing to do with the MDP. If we ignore this, we are doing “state aliasing” and violating the Markov Property.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
araffincommented, Jan 7, 2021

Is the only solution to include a time-feature? That means messing with the observation_space size and handling dict spaces correctly + explaining what this “time-feature” is in papers

TimeFeature is one solution and equivalent in performance to specific handling of timeout. We have an implementation in SB3-Contrib that already handles dict and it is used for all PyBullet env in the zoo: https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/sac.yml#L142

Personally, this is the recommended solution (and you can use the test mode at test time too).

Note: the timeout handling is indeed important, see appendix of https://arxiv.org/abs/2005.05719

Related issues

linking to all relevant issues:

Experimental branch

You can add a check here to check if not infos.get(TimeLimit.truncated, False): buffer.add(…). Such flag is added to info dictionary when episode is truncated by timelimit.

"I created a branch on SB3 but it in fact a bit more tricky than expected (notably because VecEnv resets automatically): " As mentioned, I already created an experimental branch here: https://github.com/DLR-RM/stable-baselines3/compare/feat/remove-timelimit

For my work, I’ll use a time-feature and set gamma = 1.

you don’t need gamma=1, this is independent from the infinite horizon problem.

1reaction
Miffylicommented, Jan 7, 2021

To summarize so that I understood things right: You have non-episodic task (never truly “done”), but you use TimeLimit to reset game every now and then, and to train correctly you can not apply terminal boundaries during training (does not reflect true agent setup).

There should not be a problem with this while using SAC, as long as you always feed in done=False. The biggest problem then is that final timestep does not reflect environment behaviour (it was reset under the hood). The easiest fix for this is not to include it in the training data. You can add a check here to check if not infos.get(TimeLimit.truncated, False): buffer.add(...). Such flag is added to info dictionary when episode is truncated by timelimit.

A more sophisticated solution would indeed be a nice enhancement though, as errors like these are easy to miss. I will mark it as an enhancement for some later versions of stable-baselines.

Read more comments on GitHub >

github_iconTop Results From Across the Web

arXiv:1609.01995v4 [cs.AI] 17 Sep 2021
In episodic problems, the agent reaches some terminal state, and is teleported back to a start state. In continuing prob- lems, the agent ......
Read more >
Autonomous Reinforcement Learning via Subgoal Curricula
Summary: The authors present a novel RL method that learns an adaptive curriculum to minimize the number of resets needed in episodic RL....
Read more >
What is the difference between infinite horizon MDP (Markov ...
An episodic task has some defined end state… either you run for a certain number of timesteps or you reach some termination state...
Read more >
Reinforcement Learning: Introduction I
An example for episodic tasks (with infinite horizon) is a board game where there is no limitations for the number of steps.
Read more >
Sample complexity of episodic fixed ... - ACM Digital Library
Such scenarios can often be better treated as episodic fixed-horizon MDPs, for which only looser bounds on the sample complexity exist.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found