reward during the training process remains .nan in all iterations
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
- Ray installed from (source or binary): pip version
- Ray version: 0.5.3
- Python version: Python 3.6.6
Describe the problem
I am trying to train a very basic DQN Agent using the Python API code in the documentation as follows:
import gym, os, ray
import ray.rllib.agents.dqn as dqn
from ray.tune.logger import pretty_print
ray.init()
config = dqn.DEFAULT_CONFIG.copy()
env = gym.make("myenv-v02")
agent = dqn.DQNAgent(config=config, env="myenv-v02")
for i in range(1000):
result = agent.train()
print(pretty_print(result))
if i % 100 == 0:
checkpoint = agent.save()
print("checkpoint saved at", checkpoint)
The problem is that in all of the iteration I get:
episode_len_mean: .nan
episode_reward_max: .nan
episode_reward_mean: .nan
episode_reward_min: .nan
episodes: 0
Reward does not change, it remains nan from beginning to the end.
I test my environment in the following test:
import gym
if __name__ == "__main__":
env = gym.make('myenv-v02')
env.reset()
for i_episode in range(2):
for t in range(2):
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
print(observation)
print(reward)
And I get properly changing values for observation (state) and rewards.
What am I doing wrong with the ray code? Why is my reward nan?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:12 (6 by maintainers)
Top Results From Across the Web
No Reward Appearing for MARL Environment during Training
Hi, so I successfully got RLLIB to run on an underlying environment I specified for it. However, the reward and episode mean max...
Read more >Why does my model start outputting nan values during training?
Reinforcement learning is tricky because of the reward function. Often, once the agent finds a reward signal for the first time, it'll "go...
Read more >NaN loss when training regression network - Stack Overflow
I tried with a smaller model, i.e. with only one hidden layer, and same issue (it becomes nan at a different point). However,...
Read more >How to Handle Missing Values in Cross Validation
After 5 iterations, each fold will be used in both training and testing. We need a practical way to handle missing values in...
Read more >Reinforcement Learning 2: Markov Decision Processes
Content: Markov Chains - markov property - state transition matrix - definition and example Markov Reward Process - definition and example ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Is it possible your environment hasn’t terminated yet? Rewards are only reported once an episode completes.
setting horizon config does the job, you don’t need to change your environment if the episode does not have a terminal state