Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Basic CnnLstm policy not working with PPO on Atari Pong

See original GitHub issue

Bug description Simply changing the policy from CnnPolicy to CnnLstmPolicy when training PPO2 on Atari Pong makes training fail. Using the standard CnnPolicy the training reaches around max performance in 10M steps.

Code Here is the code:

import os
import gym
import numpy as np
import matplotlib.pyplot as plt

from stable_baselines.common.policies import CnnLstmPolicy
from stable_baselines import PPO2
from stable_baselines.common.cmd_util import make_atari_env
from stable_baselines.common.evaluation import evaluate_policy

env = make_atari_env('PongNoFrameskip-v4', num_env=1, seed=0,  wrapper_kwargs = {"frame_stack": False})

model = PPO2(CnnLstmPolicy, env, nminibatches=1, verbose=1, tensorboard_log="ppo2_atari_comparison")

# Train the agent
time_steps = 10000000
model.learn(total_timesteps=time_steps)

Additional notes

Please note the result is the same if one both stacks frames or doesn’t
Do you have any hint to address this? On such simple tests it shouldn’t be a matter of hyperparameters…

Issue Analytics

State:
Created 3 years ago
Comments:11

Top GitHub Comments

1reaction

Miffylicommented, May 14, 2020

For Atari and PPO specifically, here (obtained with some hyperparameter search, I believe).

1reaction

araffincommented, May 14, 2020

Without frame-stacking:

| ep_reward_mean     | -19.4        |  512000 steps
| ep_reward_mean     | -18.5         | 614400 steps
| ep_reward_mean     | -11.1        | 716800 steps
| ep_reward_mean     | 2.36          | 819200 steps
| ep_reward_mean     | 12.2         | 921600 steps

How come you did not change the cliprange parameter but instead of being equal to 0.2 (that is the default) it is (‘cliprange’, ‘lin_0.1’)?

I’m using hyperparams from the zoo (cf doc)

Why you deactivated the value function clipping? I mean there is a particular reason for that?

not really, original ppo does not have such feature. And by experience, it does not help that much.

I guess we can close this issue?

Top Results From Across the Web

A Graphic Guide to Implementing PPO for Atari Games

Learning how Proximal Policy Optimisation (PPO) works and writing a functioning version is hard. There are many places where this can go ...

Learning to play Pong using PPO in PyTorch

The rules of Atari Pong are simple enough. You get a point if you put the ball past your opponent, and your opponent...

Stable Baselines Documentation - Read the Docs

When applying RL to a custom problem, you should always normalize the input ... As some policy are stochastic by default (e.g. A2C...

Dealing with Sparse Rewards in Reinforcement Learning - arXiv

A base PPO policy with RND is combined with all auxiliary tasks as described in UNREAL-A2C2 implementation, in order to stop the intrinsic ......

Atari 2600: Pong with PPO — coax 0.1.11 documentation

In this notebook we solve the Pong environment using a TD actor-critic algorithm with PPO policy updates. We use convolutional neural nets (without...