question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[question] Enabling agents to keep bootstraping in the last step per episode

See original GitHub issue

I am using stable-baselines 2.10.1 to train AC/ACER agents in a custom environment with time limits [0, T] per episode. In the last update per episode, the value function is normally updated by

V(S^{T-1}) = r + 0

which treats state S^T as an absorbing state where no value will be incurred thereafter. In the code, (1. - done) is used.

def discount_with_dones(rewards, dones, gamma):
    discounted = []
    ret = 0  # Return: discounted reward
    for reward, done in zip(rewards[::-1], dones[::-1]):
        ret = reward + gamma * ret * (1. - done)  # fixed off by one bug
        discounted.append(ret)
    return discounted[::-1]

However, for my limited-time cases, the update is expected to be like this

V(S^{T-1}) = r + gamma*V(S^T)

Since the training terminates not because the terminal state is reached, but because the time is out and V(S^T) has its value, therefore the training is expected to keep bootstraping in this last step.

I skimmed through the source code, and neither found this functionality nor figure out where to rewrite. Was wondering how to enable this?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10

github_iconTop GitHub Comments

1reaction
guoyangqincommented, Sep 15, 2020

It is ok. I have little experience with PPO, so I am trying ACER. Thank you, Miffyli, your comments and quotes are very helpful. I will test it by myself.

1reaction
Miffylicommented, Sep 15, 2020

Related to #863

There is no functionality to support this per se (indicating the episode ended on timeout is not standardized in Gym, although some environments provide this in the info dict). An easy solution for this problem is to provide episode time in observations as suggested in #863.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Time Limits in Reinforcement Learning
(d) The agent with partial-episode bootstrapping maximizes its return over an indefinite horizon, it learns to go for the most rewarding goal.
Read more >
Deep Exploration via Bootstrapped DQN
Each episode of interaction lasts N + 9 steps after which point the agent resets to the initial state s2. These are toy...
Read more >
Reinforcement Learning Tutorial: Semi-gradient n-step Sarsa ...
However, since the selection of actions in an episode is stochastic, there will be a high variance amongst the returns. Also, because Gt...
Read more >
TIME LIMITS IN REINFORCEMENT LEARNING
(c) An agent with the proposed partial-episode bootstrapping that continues to ... 2.1 THE LAST MOMENT PROBLEM jump stay. +1. 0. -1. A....
Read more >
Time Limits in Reinforcement Learning
The episodes terminate after a fixed number of steps T. The goal of the game is thus to jump at the last moment....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found