question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] DQN Exploration divides by 0 when learn steps are small

See original GitHub issue

Training a DQN agent for a few steps fails because of a divide by zero bug here:

class LinearSchedule(Schedule):
    # 
    # .... other code ....
    # 
    def value(self, step):
        # self.schedule_timesteps = 0
        fraction = min(float(step) / self.schedule_timesteps, 1.0)
        return self.initial_p + fraction * (self.final_p - self.initial_p)

Which is a consequence of the following in the learn function in DQN:

def learn(total_timesteps, ....):
    self.exploration = LinearSchedule(
          schedule_timesteps=int(self.exploration_fraction * total_timesteps),
          initial_p=self.exploration_initial_eps,
          final_p=self.exploration_final_eps)

The bug occurs when self.exploration_fraction * total_timesteps is less than 1.

Reproduce

import gym
import stable_baselines as sb

env = gym.make('CartPole-v1')
agent = sb.DQN('MlpPolicy', env)
agent.learn(1)

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/stelios/anaconda3/envs/thesis37/lib/python3.7/site-packages/stable_baselines/deepq/dqn.py", line 201, in learn
    update_eps = self.exploration.value(self.num_timesteps)
  File "/Users/stelios/anaconda3/envs/thesis37/lib/python3.7/site-packages/stable_baselines/common/schedules.py", line 107, in value
    fraction = min(float(step) / self.schedule_timesteps, 1.0)
ZeroDivisionError: float division by zero

Secondary issue

Assuming this is fixed, using total_timesteps assumes that the agent will run .learn only one time, when, ideally, the exploration rate should be independent of the number of calls.

Proposed solution

class LinearSchedule(Schedule):

    def __init__(self, schedule_timesteps, final_p, initial_p=1.0):
        self.schedule_timesteps = max(schedule_timesteps, 1)
        self.final_p = final_p
        self.initial_p = initial_p

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:10

github_iconTop GitHub Comments

1reaction
PartiallyTypedcommented, Mar 22, 2020

Thanks @araffin, Due to some other issues that I encountered, I will derive the classes. For future reference and anyone else that might encounter this, the callback doesn’t have access to done, observation, etc. So to access them, one needs to wrap the environment in something to keep track of everything and access it through BaseCallback.training_env.

1reaction
PartiallyTypedcommented, Mar 21, 2020

Indeed, DQN learns every n_step, however, it compares against self.num_timesteps so it does work.

if can_sample and self.num_timesteps > self.learning_starts \
                        and self.num_timesteps % self.train_freq == 0:

The only callback that exits the loop (which is a condition for the algorithm to work is the on_step. Exiting on on_step means skipping the network update, not storing the transition and no logging.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Deep Reinforcement learning: DQN, Double DQN, Dueling ...
In this blog article we will discuss deep Q-learning and four of its most important supplements. Double DQN, Dueling DQN, Noisy DQN and...
Read more >
Why Dividing by Zero is Undefined - University of North Georgia
In this video we're going to explore why dividing by zero is undefined. But first, what we need to do is familiarize ourselves...
Read more >
Deep Exploration via Bootstrapped DQN
Bootstrapping means to approximate the population distribution using a sample distribution. ○. How to bootstrap? ○. Step 1: Sample population data D with ......
Read more >
How to Avoid Exploding Gradients With Gradient Clipping
Exploding gradients can be avoided in general by careful configuration of the network model, such as choice of small learning rate, scaled ...
Read more >
Epsilon-Greedy Algorithm in Reinforcement Learning
The epsilon-greedy, where epsilon refers to the probability of choosing to explore, exploits most of the time with a small chance of exploring....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found