Irreproducible zoo trials
See original GitHub issueHi, I am using zoo to optimise the parameters for SAC with a customised env. The code I used was
python3 train.py --algo sac --env FullFilterEnv-v0 --gym-packages gym_environment -n 50000 -optimize --eval-episodes 40 --n-trials 1000 --n-jobs 2 --sampler random --pruner median
I use --eval-episodes = 40 to have agents with more stable performance.
Something about the env. Each episode is at most 5 steps long. The rewards for usual steps are negative value of some Euclidean norm, say -||x-x_target||, and the successful step will get reward +100. Once 100 is reached, the episode is over.
In the zoo, I get some results like
[I 2020-09-29 07:35:38,656] Trial 697 finished with value: -100.0 and parameters: {'gamma': 0.5, 'lr': 0.009853989305797941, 'learning_starts': 50, 'batch_size': 64, 'buffer_size': 100000, 'train_freq': 256, 'tau': 0.01, 'ent_coef': 'auto', 'net_arch': 'deep', 'target_entropy': -100}. Best is trial 650 with value: -100.0.
That means for the last 40 steps after 50,000 timesteps, all the episodes finish with just one step, and directly get reward +100, which is kinda too good to be true. So I used the recommended parameters and do the real training to the same env and I used 40 episodes to calculate the mean ep_reward. But after 50,000 timesteps, the mean ep_reward was only around -900, which is far from success in each episode.
Notice that there are two trials give -100. The similar “irreproducity” happens to other trials as well. Is this something known to the zoo, or is there anything I did wrongly?
BTW, I use the same random seed as in the zoo, i.e.,
SEED = 0
np.random.seed(SEED)
The code I used in the callback to calculate mean ep_reward.
def _on_step(self) -> bool:
if self.n_calls % self.check_freq == 0:
# Retrieve training reward
x, y = ts2xy(load_results(self.log_dir), 'timesteps')
if len(x) > 0:
# Mean training reward over the last 40 episodes
mean_reward = np.mean(y[-40:])
if self.verbose > 0:
print("Num timesteps: {}".format(self.num_timesteps))
print("Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}".format(self.best_mean_reward, mean_reward))
# New best model, you could save the agent here
if mean_reward > self.best_mean_reward:
self.best_mean_reward = mean_reward
# Example for saving best model
if self.verbose > 0:
print("Saving new best model to {}".format(self.save_path))
self.model.save(self.save_path)
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
You should raise an exception (assertion error) and the trial will be ignored. See https://github.com/araffin/rl-baselines-zoo/blob/master/utils/hyperparams_opt.py#L112
Please read the documentation for that.
Hi, thanks for the suggestions. I think I found what is the problem. I am using entr_coef= auto in SAC. At certain point, action becomes NaN which leads to state of the env to be NaN also. Since NaN is not incorporated in the condition checking in step function, which leads to doneflag = True even with NaN state.
I guess it is similar to this.
Questions: The previous hyperparameter combination is recommended by roo. Can those trials with NaNs be eliminated already from zoo without recommending it as best trial (or pruning it)?
I saw that we can use VecCheckNan to the env, but it seems step_async and step_wait are needed in the env. Is there an example about how these function look like?