[question] Enabling agents to keep bootstraping in the last step per episode
See original GitHub issueI am using stable-baselines 2.10.1 to train AC/ACER agents in a custom environment with time limits [0, T] per episode. In the last update per episode, the value function is normally updated by
V(S^{T-1}) = r + 0
which treats state S^T as an absorbing state where no value will be incurred thereafter. In the code, (1. - done)
is used.
def discount_with_dones(rewards, dones, gamma):
discounted = []
ret = 0 # Return: discounted reward
for reward, done in zip(rewards[::-1], dones[::-1]):
ret = reward + gamma * ret * (1. - done) # fixed off by one bug
discounted.append(ret)
return discounted[::-1]
However, for my limited-time cases, the update is expected to be like this
V(S^{T-1}) = r + gamma*V(S^T)
Since the training terminates not because the terminal state is reached, but because the time is out and V(S^T) has its value, therefore the training is expected to keep bootstraping in this last step.
I skimmed through the source code, and neither found this functionality nor figure out where to rewrite. Was wondering how to enable this?
Issue Analytics
- State:
- Created 3 years ago
- Comments:10
Top GitHub Comments
It is ok. I have little experience with PPO, so I am trying ACER. Thank you, Miffyli, your comments and quotes are very helpful. I will test it by myself.
Related to #863
There is no functionality to support this per se (indicating the episode ended on timeout is not standardized in Gym, although some environments provide this in the info dict). An easy solution for this problem is to provide episode time in observations as suggested in #863.