[question] Custom PPO implementation does not train with Pong
See original GitHub issueHi
I do not know if you might consider this as a question that I can ask you. I have been working with a PPO agent code that seemed to train for the environment (custom) that I have. However, in order to test how good/bad this implementation of the PPO agent is i am now trying to train atari games.
This however, has not been a very rewarding exercise. I do not get the agent to train. I had to include the
env = NoopResetEnv(env, noop_max=30)
env = MaxAndSkipEnv(env, skip=4)
otherwise it didn’t produce any variety in actions. After adding these however, it runs for sometime trying out different actions. In a few epochs though it starts producing the same action over and over again. I cannot seem to understand why. I looked into the stable_baselines code and saw that their were few differences
- the network for actor(policy) and critic(value) is the same.
- Your network is different (i copied it in my actor and critic networks)
- the way loss is calculated is very different. There are a few things which were not in the original paper. Like clipping of value or the way e.g. ppo loss is done. I was using a minimum of the two values (r*Advantage and clipped loss) but you do a maximum of two negative values. I understand that it is the same numerically but didn’t understand why it was done like this.
- in the rl-baselines-zoo code there are even more wrappers for the environment which come from the wrap_deepmind routine in atari_wrappers.py
The reason I still wish that my agent works is that then i have a similar measure for my environment and the atari environments also. I can understand that one suggestion could be to use the baselines agent but then i suspect that i would not really understand how it works for either the atari games or my custom environment.
I am attaching the zip of the code i am trying to run. System Info Describe the characteristic of your environment:
- source install of stable-baselines
- GPU (V100 and some times M4000)
- 3.6.8
- 1.14 PongNoFrameskip-v4_our_Agents.zip
Please don’t mind my writing about this issue. If you consider it violates the code for submission, i apologize for this. I think that your repo is a very good and useful thing for any RL practitioner who wishes to understand why any agent trains and wishes to replicate results to gain confidence that it is not something written once for a publication only.
I will continue hunting this down as there are (As i note also in my request) still differences between the code i stick here compared to yours. It is just that I do not completely follow why all those wrappers were added to the environment and also why the loss is the way it is.
with kind regards Rohit
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top GitHub Comments
Sorry I overlooked this issue, so the question is more what are the tricks that make PPO work and more particularly on Atari games?
For atari, the preprocessing and action repeat really matters, you can find a good explanation here.
Regarding PPO, there are different tricks, among them:
This does not matter, it is the same. For the why you should ask people from OpenAI.
In general, because of all this and the possible bug in the implementation, I would recommend you to use a fully tested implementation (like the one from SB) instead of a custom one, unless you want to learn about how to implement RL (I’m currently writing a PR that may help you too: #536 ).
EDIT: I forgot one point mentioned by @Miffyli , hyperparameters matters a lot too (including number of workers, i.e. number of envs).
Like you mentioned yourself, this is not place for tech support. However given that the questions you raised are good questions, I will try to answer them:
Your best bet is to start with known, working hyperparameters like ones available in the rl-zoo, and tune from there for your custom environment.
Edit: See better answer below ^^