[Bug] optimize_memory_usage not compatible with handle_timeout_termination
See original GitHub issue🐛 Bug
When using the ReplayBuffer class, setting both optimize_memory_usage = True
and handle_timeout_termination = True
will lead to incorrect behaviour.
This is because when handle_timeout_termination = True
, the replay buffer will set the done
value at the end of an episode to False
if the end of an episode was due to a timeout:
https://github.com/DLR-RM/stable-baselines3/blob/d68f0a2411766beb6da58ee0e989d1a6a72869bc/stable_baselines3/common/buffers.py#L300-L302
In an algorithm like DQN, this means the Q value target will bootstrap using the next_observation
variable
https://github.com/DLR-RM/stable-baselines3/blob/d68f0a2411766beb6da58ee0e989d1a6a72869bc/stable_baselines3/dqn/dqn.py#L196
However, this leads to incorrect behaviour if optimize_memory_usage = True
, because the next_observation
variable in that case is defined as:
https://github.com/DLR-RM/stable-baselines3/blob/d68f0a2411766beb6da58ee0e989d1a6a72869bc/stable_baselines3/common/buffers.py#L291-L292
Which will get overwritten by the first state of the new episode. So, when optimize_memory_usage = True
, the replay buffer will not return the last state of the episode (which would be correct behaviour).
As a result, if both optimize_memory_usage = True
and handle_timeout_termination = True
, a reinforcement learning algorithm will sometimes have target values that bootstrap from the wrong state (the first state of the next episode, rather than the last state of the current episode)
Note that this behaviour does not happen if
optimize_memory_usage = False
, because in that case thenext_observation
variable is stored in a separate buffer and won’t be overwritten by the first state of the new episode.
To Reproduce
This bug only affects the Q value updates if there are end of episodes due to timeouts in the environment. In practice, the difference between the correct target and the incorrect target can be small.
For this reason, I wasn’t able to find a simple scenario in which this will lead to noticeable incorrect training results However, I added a simple code segment that will illustrate the problem and why if both optimize_memory_usage = True
and handle_timeout_termination = True
the behaviour will be incorrect.
import gym
import numpy as np
from stable_baselines3 import DQN
env = gym.make('MountainCar-v0')
model = DQN("MlpPolicy", env, optimize_memory_usage=False, replay_buffer_kwargs={'handle_timeout_termination': True})
# running for 250 steps so that a single timeout (after 200 steps) will be in the replay buffer
model.learn(total_timesteps=250)
index_of_done = np.where(model.replay_buffer.dones == 1)[0]
print(f"current observation: {model.replay_buffer.observations[index_of_done]}")
# if optimize_memory_usage is False, the next observation is stored in a seperate buffer called next_observations
print(f"next observation if optimize_memory_usage is False: {model.replay_buffer.next_observations[index_of_done]}")
# if optimize_memory_usage is True, the next observation is stored in the same buffer as the current observations, but at the index + 1
# which is the state after an environment reset
print(f"next observation if optimize_memory_usage were True: {model.replay_buffer.observations[index_of_done + 1]}")
print(f"done: {model.replay_buffer.dones[index_of_done]}")
print(f"timeout: {model.replay_buffer.timeouts[index_of_done]}")
# the replay buffer will return done == False
# which means that DQN will bootstrap the return from the next observation, which will be the state after a reset if optimize_memory_usage is True
print(f"value of done returned by ReplayBuffer.sample(): {model.replay_buffer.dones[index_of_done] * (1 - model.replay_buffer.timeouts[index_of_done])}")
current observation: [[[-0.49728924 0.00461391]]]
next observation if optimize_memory_usage is False: [[[-0.49387246 0.00341679]]]
next observation if optimize_memory_usage were True: [[[-0.51673186 0. ]]]
done: [[1.]]
timeout: [[1.]]
value of done returned by ReplayBuffer.sample(): [[0.]]
Expected behavior
If handle_timeout_termination = True
, the replay buffer should return the last (terminal) state as the next_observation
variable, regardless of whether optimize_memory_usage
is True or False.
System Info
OS: Ubuntu 20.04.4 LTS Python: 3.9.7 Stable-Baselines3: 1.5.1a6 PyTorch: 1.11.0 GPU Enabled: True Numpy: 1.21.2 Gym: 0.21.0
Checklist
- I have checked that there is no similar issue in the repo (required)
- I have read the documentation (required)
- I have provided a minimal working example to reproduce the bug (required)
Issue Analytics
- State:
- Created a year ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
Cheers! Will take a look over it.
Hmm you could submit a PR for that, but I personally would not have time to 100% verify it works correctly (well, that is what tests are for 😃). There are also no guarantees if it would be merged, but if you wish to do so, I won’t say no!
Edit: If you do not plan to make the PR, please close this issue 😃
Nice spot, and thank you for all the details! Sorry for the delay in response ^^.
Even if minor, this should be brought up by the code (or ideally, fixed). Sadly the
optimize_memory_usage
is somewhat headachy code that can mess with many things. The most obvious solution is to raise an exception if both flags are used; I lean towards an exception as this could potentially destroy someone’s runs, and they can still run them without memory optimization. Of course, ideal solution would be to fix the issue, and you seem to have a code to test out the implementation ready 😃We would be happy to review a PR that addresses this issue, if you have the time to offer.