[Question] Why so difficult to learn 0?
See original GitHub issueI’ve been working a lot with environments that have continuous actions spaces, and I’ve noticed some strange behavior that the agents seem to have a very hard time learning the optimal action to choose when the optimal action is zero. To test this, I’ve created a very simple environment where the agent simply chooses continuous values. The reward is shaped such that the agent is encouraged to choose 1 for the first N values and 0 for the next N value, like so:
def step(self, action):
first_N = action[:N]
second_N = action[N:]
first_N_should_be = np.ones(self.N)
second_N_should_be = np.zeros(self.N)
reward = np.linalg.norm(first_N_should_be - first_N) + np.linalg.norm(second_N_should_be - second_N)
return obs, -reward, done, info
I ran this with PPO2, A2C, and ACKTR for 3 million steps. Each time, the agent is able to learn 1 for the first N values very quickly, but it seems to have a very hard time learning 0 for the second_N values. Here is a graph demonstrating the average action taken over 200 steps for policies trained for 1mil, 2mil, and 3mil steps with PPO2. The black dots are the average and the flat lines are 1 standard deviation away.
The agent does seem to be learning to choose 0 better over time because the standard deviation shrinks for longer-trained policies, but it is take MUCH longer than learning 1. I find this to be very strange. Why is it so difficult for the agent to explore 0?
This is similar to #473, but the answers there don’t address my question. For the record, I am using normalized action space from [-1, 1].
System Info Describe the characteristic of your environment:
- Stable baselines 2.8.0 installed with pip
- Python version: 3.7.4
- Tensorflow version: 1.14
- Numpy version: 1.17
Issue Analytics
- State:
- Created 4 years ago
- Comments:7
Top GitHub Comments
Thank you for clarifying this. It seems that the right way to think about this is not that it is having a hard time “learning 0” but that it is really good at “learning 1” because of clipping the Gaussian distribution.
I’m curious why the documentation says that we should normalize the continuous action space to [-1, 1]. If PPO2 is really starting with a Gaussian distribution with mu 0 and std 1, then this action space normalization will will clip 32% of the actions. It seems that it would be better to clip [-2, 2] for 95% or even [-3,3] for 99%. Furthermore, why use a Gaussian distribution instead of a Uniform distribution across the entire action space? Uniform will cover the space better…
PPO samples from a distribution to select the next action. This means that even if 0 is the best action as dictated by the policy, there is still a chance of selecting other values. Maybe half the Gaussian is truncated at the bounds
[-1, 1]
? If half the probability mass was >1 but all samples were rounded to 1, you may see it ‘converge’ 2x as fast to 1. Just a thought.