Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Double DQN

See original GitHub issue

🚀 Feature

Add double variant of the dqn algorithm.

Motivation

It’s in the roadmap https://github.com/DLR-RM/stable-baselines3/issues/1.

Pitch

I suggest we go from:

with th.no_grad():
      # Compute the next Q-values using the target network
      next_q_values = self.q_net_target(replay_data.next_observations)
      # Follow greedy policy: use the one with the highest value
      next_q_values, _ = next_q_values.max(dim=1)

to:

with th.no_grad():
      # Compute the next Q-values using the target network
      next_q_values = self.q_net_target(replay_data.next_observations)
      if self.double_dqn:
          # use current model to select the action with maximal q value
          max_actions = th.argmax(self.q_net(replay_data.next_observations), dim=1)
          # evaluate q value of that action using fixed target network
          next_q_values = th.gather(next_q_values, dim=1, index=max_actions.unsqueeze(-1))
      else:
          # Follow greedy policy: use the one with the highest value
          next_q_values, _ = next_q_values.max(dim=1)

with double_dqn as additional flag to be passed to DQN init.

### Checklist

[ x] I have checked that there is no similar issue in the repo (required)

Issue Analytics

State:
Created 2 years ago
Comments:18 (7 by maintainers)

Top GitHub Comments

2reactions

ercoargantecommented, Oct 22, 2021

As a lecturer in an AI program, I recommend our students to use Stable Baselines for their projects due to ease of use and clear documentation (thx for that!). From an educational point of Q-learning and DQN are a good introduction to RL, so students start off using DQN. Results using DQN of SB3 are much, much worse compared to SB2 (both with default values for the parameters). This hampers the adoption of SB3 (and the enthusiasm of students for RL). I have not yet understood/investigated the reason of this difference. Obvious candidates are the missing extensions like PER and DDQN, but of course this is an assumption. Goal of this comment is just to mention that progress in SB3 on this topic is much appreciated. If I can be of help, for example in testing improvements, let me know. Best regards, Erco Argante

2reactions

NickLucchecommented, Sep 16, 2021

Sorry for the long inactivity. I managed to run a few experiments with the proposed change on pong+breakout, I’ll leave here the training curves although I can’t notice much of a difference (at least that’s on par with original papers findings).