Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Proper TimeLimit/Infinite Horizon Handling for On-Policy algorithm

See original GitHub issue

🚀 Feature

Same as #284 but for on-policy algorithms. The current workaround is to use a TimeFeatureWrapper (cf. zoo).

### Checklist

I have checked that there is no similar issue in the repo (required)

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

Miffylicommented, Nov 5, 2021

you are refering to L380 and 381?

Yup, and I understood your idea. Turns out I was wrong ^^'. I only noticed it now that I tried to type it out.

1) Current setup when termination is encountered in next step

delta = self.rewards[step] + ~self.gamma * next_values * next_non_terminal~ - self.values[step] last_gae_lam = delta + ~self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam~

2) Ideal setup where timeouts are handled correctly (next state is timeout termination)

delta = self.rewards[step] + self.gamma * next_values ~* next_non_terminal~ - self.values[step] <— bootstrap despite step is done last_gae_lam = delta + ~self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam~ <— avoid leaking from next episode

3) Changing timeout reward to `reward + next_value * gamma`

delta = (reward + next_value * self.gamma) + ~self.gamma * next_values * next_non_terminal~ - self.values[step] <— same as in above example last_gae_lam = delta + ~self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam~ <— avoid leaking from next episode

1reaction

Miffylicommented, Nov 5, 2021

so the conclusion is that my proposed hack is valid :p?

Yup, at least in this part of the code ^^. I would still rethink the whole process through carefully, as “hacks” like this often break something (and sadly it is hard to test).