[Bug] DQN Exploration divides by 0 when learn steps are small
See original GitHub issueTraining a DQN agent for a few steps fails because of a divide by zero bug here:
class LinearSchedule(Schedule):
#
# .... other code ....
#
def value(self, step):
# self.schedule_timesteps = 0
fraction = min(float(step) / self.schedule_timesteps, 1.0)
return self.initial_p + fraction * (self.final_p - self.initial_p)
Which is a consequence of the following in the learn function in DQN:
def learn(total_timesteps, ....):
self.exploration = LinearSchedule(
schedule_timesteps=int(self.exploration_fraction * total_timesteps),
initial_p=self.exploration_initial_eps,
final_p=self.exploration_final_eps)
The bug occurs when self.exploration_fraction * total_timesteps
is less than 1.
Reproduce
import gym
import stable_baselines as sb
env = gym.make('CartPole-v1')
agent = sb.DQN('MlpPolicy', env)
agent.learn(1)
Traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/stelios/anaconda3/envs/thesis37/lib/python3.7/site-packages/stable_baselines/deepq/dqn.py", line 201, in learn
update_eps = self.exploration.value(self.num_timesteps)
File "/Users/stelios/anaconda3/envs/thesis37/lib/python3.7/site-packages/stable_baselines/common/schedules.py", line 107, in value
fraction = min(float(step) / self.schedule_timesteps, 1.0)
ZeroDivisionError: float division by zero
Secondary issue
Assuming this is fixed, using total_timesteps
assumes that the agent will run .learn
only one time, when, ideally, the exploration rate should be independent of the number of calls.
Proposed solution
class LinearSchedule(Schedule):
def __init__(self, schedule_timesteps, final_p, initial_p=1.0):
self.schedule_timesteps = max(schedule_timesteps, 1)
self.final_p = final_p
self.initial_p = initial_p
Issue Analytics
- State:
- Created 4 years ago
- Comments:10
Top Results From Across the Web
Deep Reinforcement learning: DQN, Double DQN, Dueling ...
In this blog article we will discuss deep Q-learning and four of its most important supplements. Double DQN, Dueling DQN, Noisy DQN and...
Read more >Why Dividing by Zero is Undefined - University of North Georgia
In this video we're going to explore why dividing by zero is undefined. But first, what we need to do is familiarize ourselves...
Read more >Deep Exploration via Bootstrapped DQN
Bootstrapping means to approximate the population distribution using a sample distribution. ○. How to bootstrap? ○. Step 1: Sample population data D with ......
Read more >How to Avoid Exploding Gradients With Gradient Clipping
Exploding gradients can be avoided in general by careful configuration of the network model, such as choice of small learning rate, scaled ...
Read more >Epsilon-Greedy Algorithm in Reinforcement Learning
The epsilon-greedy, where epsilon refers to the probability of choosing to explore, exploits most of the time with a small chance of exploring....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @araffin, Due to some other issues that I encountered, I will derive the classes. For future reference and anyone else that might encounter this, the callback doesn’t have access to
done
,observation
, etc. So to access them, one needs to wrap the environment in something to keep track of everything and access it throughBaseCallback.training_env
.Indeed, DQN learns every
n_step
, however, it compares againstself.num_timesteps
so it does work.The only callback that exits the loop (which is a condition for the algorithm to work is the
on_step
. Exiting onon_step
means skipping the network update, not storing the transition and no logging.