[Feature Request] Proper TimeLimit/Infinite Horizon Handling for On-Policy algorithm
See original GitHub issueTop Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yup, and I understood your idea. Turns out I was wrong ^^'. I only noticed it now that I tried to type it out.
1) Current setup when termination is encountered in next step
delta = self.rewards[step] + ~self.gamma * next_values * next_non_terminal~ - self.values[step] last_gae_lam = delta + ~self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam~
2) Ideal setup where timeouts are handled correctly (next state is timeout termination)
delta = self.rewards[step] + self.gamma * next_values ~* next_non_terminal~ - self.values[step] <— bootstrap despite step is done last_gae_lam = delta + ~self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam~ <— avoid leaking from next episode
3) Changing timeout reward to
reward + next_value * gamma
delta = (reward + next_value * self.gamma) + ~self.gamma * next_values * next_non_terminal~ - self.values[step] <— same as in above example last_gae_lam = delta + ~self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam~ <— avoid leaking from next episode
Yup, at least in this part of the code ^^. I would still rethink the whole process through carefully, as “hacks” like this often break something (and sadly it is hard to test).