question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Basic CnnLstm policy not working with PPO on Atari Pong

See original GitHub issue

Bug description Simply changing the policy from CnnPolicy to CnnLstmPolicy when training PPO2 on Atari Pong makes training fail. Using the standard CnnPolicy the training reaches around max performance in 10M steps.

Code Here is the code:

import os
import gym
import numpy as np
import matplotlib.pyplot as plt

from stable_baselines.common.policies import CnnLstmPolicy
from stable_baselines import PPO2
from stable_baselines.common.cmd_util import make_atari_env
from stable_baselines.common.evaluation import evaluate_policy

env = make_atari_env('PongNoFrameskip-v4', num_env=1, seed=0,  wrapper_kwargs = {"frame_stack": False})

model = PPO2(CnnLstmPolicy, env, nminibatches=1, verbose=1, tensorboard_log="ppo2_atari_comparison")

# Train the agent
time_steps = 10000000
model.learn(total_timesteps=time_steps)

Additional notes

  • Please note the result is the same if one both stacks frames or doesn’t
  • Do you have any hint to address this? On such simple tests it shouldn’t be a matter of hyperparameters…

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11

github_iconTop GitHub Comments

1reaction
Miffylicommented, May 14, 2020

For Atari and PPO specifically, here (obtained with some hyperparameter search, I believe).

1reaction
araffincommented, May 14, 2020

Without frame-stacking:

| ep_reward_mean     | -19.4        |  512000 steps
| ep_reward_mean     | -18.5         | 614400 steps
| ep_reward_mean     | -11.1        | 716800 steps
| ep_reward_mean     | 2.36          | 819200 steps
| ep_reward_mean     | 12.2         | 921600 steps

How come you did not change the cliprange parameter but instead of being equal to 0.2 (that is the default) it is (‘cliprange’, ‘lin_0.1’)?

I’m using hyperparams from the zoo (cf doc)

Why you deactivated the value function clipping? I mean there is a particular reason for that?

not really, original ppo does not have such feature. And by experience, it does not help that much.

I guess we can close this issue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

A Graphic Guide to Implementing PPO for Atari Games
Learning how Proximal Policy Optimisation (PPO) works and writing a functioning version is hard. There are many places where this can go ...
Read more >
Learning to play Pong using PPO in PyTorch
The rules of Atari Pong are simple enough. You get a point if you put the ball past your opponent, and your opponent...
Read more >
Stable Baselines Documentation - Read the Docs
When applying RL to a custom problem, you should always normalize the input ... As some policy are stochastic by default (e.g. A2C...
Read more >
Dealing with Sparse Rewards in Reinforcement Learning - arXiv
A base PPO policy with RND is combined with all auxiliary tasks as described in UNREAL-A2C2 implementation, in order to stop the intrinsic ......
Read more >
Atari 2600: Pong with PPO — coax 0.1.11 documentation
In this notebook we solve the Pong environment using a TD actor-critic algorithm with PPO policy updates. We use convolutional neural nets (without...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found