Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] DQN Exploration divides by 0 when learn steps are small

See original GitHub issue

Training a DQN agent for a few steps fails because of a divide by zero bug here:

class LinearSchedule(Schedule):
    # 
    # .... other code ....
    # 
    def value(self, step):
        # self.schedule_timesteps = 0
        fraction = min(float(step) / self.schedule_timesteps, 1.0)
        return self.initial_p + fraction * (self.final_p - self.initial_p)

Which is a consequence of the following in the learn function in DQN:

def learn(total_timesteps, ....):
    self.exploration = LinearSchedule(
          schedule_timesteps=int(self.exploration_fraction * total_timesteps),
          initial_p=self.exploration_initial_eps,
          final_p=self.exploration_final_eps)

The bug occurs when self.exploration_fraction * total_timesteps is less than 1.

Reproduce

import gym
import stable_baselines as sb

env = gym.make('CartPole-v1')
agent = sb.DQN('MlpPolicy', env)
agent.learn(1)

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/stelios/anaconda3/envs/thesis37/lib/python3.7/site-packages/stable_baselines/deepq/dqn.py", line 201, in learn
    update_eps = self.exploration.value(self.num_timesteps)
  File "/Users/stelios/anaconda3/envs/thesis37/lib/python3.7/site-packages/stable_baselines/common/schedules.py", line 107, in value
    fraction = min(float(step) / self.schedule_timesteps, 1.0)
ZeroDivisionError: float division by zero

Secondary issue

Assuming this is fixed, using total_timesteps assumes that the agent will run .learn only one time, when, ideally, the exploration rate should be independent of the number of calls.

Proposed solution

class LinearSchedule(Schedule):

    def __init__(self, schedule_timesteps, final_p, initial_p=1.0):
        self.schedule_timesteps = max(schedule_timesteps, 1)
        self.final_p = final_p
        self.initial_p = initial_p

Issue Analytics

State:
Created 4 years ago
Comments:10

Top GitHub Comments

1reaction

PartiallyTypedcommented, Mar 22, 2020

Thanks @araffin, Due to some other issues that I encountered, I will derive the classes. For future reference and anyone else that might encounter this, the callback doesn’t have access to done, observation, etc. So to access them, one needs to wrap the environment in something to keep track of everything and access it through BaseCallback.training_env.

1reaction

PartiallyTypedcommented, Mar 21, 2020

Indeed, DQN learns every n_step, however, it compares against self.num_timesteps so it does work.

if can_sample and self.num_timesteps > self.learning_starts \
                        and self.num_timesteps % self.train_freq == 0:

The only callback that exits the loop (which is a condition for the algorithm to work is the on_step. Exiting on on_step means skipping the network update, not storing the transition and no logging.

Top Results From Across the Web

Deep Reinforcement learning: DQN, Double DQN, Dueling ...

In this blog article we will discuss deep Q-learning and four of its most important supplements. Double DQN, Dueling DQN, Noisy DQN and...

Why Dividing by Zero is Undefined - University of North Georgia

In this video we're going to explore why dividing by zero is undefined. But first, what we need to do is familiarize ourselves...

Deep Exploration via Bootstrapped DQN

Bootstrapping means to approximate the population distribution using a sample distribution. ○. How to bootstrap? ○. Step 1: Sample population data D with ......

How to Avoid Exploding Gradients With Gradient Clipping

Exploding gradients can be avoided in general by careful configuration of the network model, such as choice of small learning rate, scaled ...

Epsilon-Greedy Algorithm in Reinforcement Learning

The epsilon-greedy, where epsilon refers to the probability of choosing to explore, exploits most of the time with a small chance of exploring....