Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] optimize_memory_usage not compatible with handle_timeout_termination

See original GitHub issue

🐛 Bug

When using the ReplayBuffer class, setting both optimize_memory_usage = True and handle_timeout_termination = True will lead to incorrect behaviour.

This is because when handle_timeout_termination = True, the replay buffer will set the done value at the end of an episode to False if the end of an episode was due to a timeout: https://github.com/DLR-RM/stable-baselines3/blob/d68f0a2411766beb6da58ee0e989d1a6a72869bc/stable_baselines3/common/buffers.py#L300-L302

In an algorithm like DQN, this means the Q value target will bootstrap using the next_observation variable https://github.com/DLR-RM/stable-baselines3/blob/d68f0a2411766beb6da58ee0e989d1a6a72869bc/stable_baselines3/dqn/dqn.py#L196

However, this leads to incorrect behaviour if optimize_memory_usage = True, because the next_observation variable in that case is defined as: https://github.com/DLR-RM/stable-baselines3/blob/d68f0a2411766beb6da58ee0e989d1a6a72869bc/stable_baselines3/common/buffers.py#L291-L292 Which will get overwritten by the first state of the new episode. So, when optimize_memory_usage = True, the replay buffer will not return the last state of the episode (which would be correct behaviour).

As a result, if both optimize_memory_usage = True and handle_timeout_termination = True, a reinforcement learning algorithm will sometimes have target values that bootstrap from the wrong state (the first state of the next episode, rather than the last state of the current episode)

Note that this behaviour does not happen if optimize_memory_usage = False, because in that case the next_observation variable is stored in a separate buffer and won’t be overwritten by the first state of the new episode.

To Reproduce

This bug only affects the Q value updates if there are end of episodes due to timeouts in the environment. In practice, the difference between the correct target and the incorrect target can be small. For this reason, I wasn’t able to find a simple scenario in which this will lead to noticeable incorrect training results However, I added a simple code segment that will illustrate the problem and why if both optimize_memory_usage = True and handle_timeout_termination = True the behaviour will be incorrect.

import gym 
import numpy as np

from stable_baselines3 import DQN

env = gym.make('MountainCar-v0')
model = DQN("MlpPolicy", env, optimize_memory_usage=False, replay_buffer_kwargs={'handle_timeout_termination': True})
# running for 250 steps so that a single timeout (after 200 steps) will be in the replay buffer
model.learn(total_timesteps=250)

index_of_done = np.where(model.replay_buffer.dones == 1)[0]
print(f"current observation: {model.replay_buffer.observations[index_of_done]}")

# if optimize_memory_usage is False, the next observation is stored in a seperate buffer called next_observations
print(f"next observation if optimize_memory_usage is False: {model.replay_buffer.next_observations[index_of_done]}")

# if optimize_memory_usage is True, the next observation is stored in the same buffer as the current observations, but at the index + 1
# which is the state after an environment reset
print(f"next observation if optimize_memory_usage were True: {model.replay_buffer.observations[index_of_done + 1]}")

print(f"done: {model.replay_buffer.dones[index_of_done]}")
print(f"timeout: {model.replay_buffer.timeouts[index_of_done]}")

# the replay buffer will return done == False
# which means that DQN will bootstrap the return from the next observation, which will be the state after a reset if optimize_memory_usage is True
print(f"value of done returned by ReplayBuffer.sample(): {model.replay_buffer.dones[index_of_done] * (1 - model.replay_buffer.timeouts[index_of_done])}")

current observation: [[[-0.49728924  0.00461391]]]
next observation if optimize_memory_usage is False: [[[-0.49387246  0.00341679]]]
next observation if optimize_memory_usage were True: [[[-0.51673186  0.        ]]]
done: [[1.]]
timeout: [[1.]]
value of done returned by ReplayBuffer.sample(): [[0.]]

Expected behavior

If handle_timeout_termination = True, the replay buffer should return the last (terminal) state as the next_observation variable, regardless of whether optimize_memory_usage is True or False.

System Info

OS: Ubuntu 20.04.4 LTS Python: 3.9.7 Stable-Baselines3: 1.5.1a6 PyTorch: 1.11.0 GPU Enabled: True Numpy: 1.21.2 Gym: 0.21.0

Checklist

I have checked that there is no similar issue in the repo (required)
I have read the documentation (required)
I have provided a minimal working example to reproduce the bug (required)

Issue Analytics

State:
Created a year ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

Miffylicommented, Jul 1, 2022

Cheers! Will take a look over it.

Hmm you could submit a PR for that, but I personally would not have time to 100% verify it works correctly (well, that is what tests are for 😃). There are also no guarantees if it would be merged, but if you wish to do so, I won’t say no!

Edit: If you do not plan to make the PR, please close this issue 😃

1reaction

Miffylicommented, Jun 19, 2022

Nice spot, and thank you for all the details! Sorry for the delay in response ^^.

Even if minor, this should be brought up by the code (or ideally, fixed). Sadly the optimize_memory_usage is somewhat headachy code that can mess with many things. The most obvious solution is to raise an exception if both flags are used; I lean towards an exception as this could potentially destroy someone’s runs, and they can still run them without memory optimization. Of course, ideal solution would be to fix the issue, and you seem to have a code to test out the implementation ready 😃

We would be happy to review a PR that addresses this issue, if you have the time to offer.