Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[question] Custom PPO implementation does not train with Pong

See original GitHub issue

I do not know if you might consider this as a question that I can ask you. I have been working with a PPO agent code that seemed to train for the environment (custom) that I have. However, in order to test how good/bad this implementation of the PPO agent is i am now trying to train atari games.

This however, has not been a very rewarding exercise. I do not get the agent to train. I had to include the

env = NoopResetEnv(env, noop_max=30)
env = MaxAndSkipEnv(env, skip=4)

otherwise it didn’t produce any variety in actions. After adding these however, it runs for sometime trying out different actions. In a few epochs though it starts producing the same action over and over again. I cannot seem to understand why. I looked into the stable_baselines code and saw that their were few differences

the network for actor(policy) and critic(value) is the same.
Your network is different (i copied it in my actor and critic networks)
the way loss is calculated is very different. There are a few things which were not in the original paper. Like clipping of value or the way e.g. ppo loss is done. I was using a minimum of the two values (r*Advantage and clipped loss) but you do a maximum of two negative values. I understand that it is the same numerically but didn’t understand why it was done like this.
in the rl-baselines-zoo code there are even more wrappers for the environment which come from the wrap_deepmind routine in atari_wrappers.py

The reason I still wish that my agent works is that then i have a similar measure for my environment and the atari environments also. I can understand that one suggestion could be to use the baselines agent but then i suspect that i would not really understand how it works for either the atari games or my custom environment.

I am attaching the zip of the code i am trying to run. System Info Describe the characteristic of your environment:

source install of stable-baselines
GPU (V100 and some times M4000)
3.6.8
1.14 PongNoFrameskip-v4_our_Agents.zip

Please don’t mind my writing about this issue. If you consider it violates the code for submission, i apologize for this. I think that your repo is a very good and useful thing for any RL practitioner who wishes to understand why any agent trains and wishes to replicate results to gain confidence that it is not something written once for a publication only.

I will continue hunting this down as there are (As i note also in my request) still differences between the code i stick here compared to yours. It is just that I do not completely follow why all those wrappers were added to the environment and also why the loss is the way it is.

with kind regards Rohit

Issue Analytics

State:
Created 4 years ago
Comments:5

Top GitHub Comments

1reaction

araffincommented, Nov 21, 2019

Sorry I overlooked this issue, so the question is more what are the tricks that make PPO work and more particularly on Atari games?

For atari, the preprocessing and action repeat really matters, you can find a good explanation here.

Regarding PPO, there are different tricks, among them:

Initialization matters (cf this blog post)
the advantage is normalized
the value function is clipped (but you can deactivate that in SB), it should not matter too much.
the policy and value network shared the CNN feature extractor

understand that it is the same numerically but didn’t understand why it was done like this

This does not matter, it is the same. For the why you should ask people from OpenAI.

In general, because of all this and the possible bug in the implementation, I would recommend you to use a fully tested implementation (like the one from SB) instead of a custom one, unless you want to learn about how to implement RL (I’m currently writing a PR that may help you too: #536 ).

EDIT: I forgot one point mentioned by @Miffyli , hyperparameters matters a lot too (including number of workers, i.e. number of envs).

1reaction

Miffylicommented, Nov 21, 2019

Like you mentioned yourself, this is not place for tech support. However given that the questions you raised are good questions, I will try to answer them:

Yes, it is common to share at least some of the network among policy and value function (intuition is that both extract similar features from image with CNNs, for example)
I did not quite understand this question. With default settings the network corresponds to one in original DQN paper (the nature one).
Yes, there are some modifications in the loss, but these are disabled by default. Things like advantage normalization were also included in original PPO code, but IIRC this was not mentioned in the paper. As for other quirks (e.g. PPO pi loss), it might be just how author of the code felt was most intuitive to write it.
Some Atari environments “need” the preprocessing of some of these wrappers (the deepmind_wrapper) to be easier to learn, and without them the task might be too difficult. This topic has been discussed in some recent-ish papers, but I do not have any of those with me for linking right now.

Your best bet is to start with known, working hyperparameters like ones available in the rl-zoo, and tune from there for your custom environment.

Edit: See better answer below ^^