question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Why so difficult to learn 0?

See original GitHub issue

I’ve been working a lot with environments that have continuous actions spaces, and I’ve noticed some strange behavior that the agents seem to have a very hard time learning the optimal action to choose when the optimal action is zero. To test this, I’ve created a very simple environment where the agent simply chooses continuous values. The reward is shaped such that the agent is encouraged to choose 1 for the first N values and 0 for the next N value, like so:

def step(self, action):
    first_N = action[:N]
    second_N = action[N:]

    first_N_should_be = np.ones(self.N)
    second_N_should_be = np.zeros(self.N)

    reward = np.linalg.norm(first_N_should_be - first_N) + np.linalg.norm(second_N_should_be - second_N)
    return obs, -reward, done, info

I ran this with PPO2, A2C, and ACKTR for 3 million steps. Each time, the agent is able to learn 1 for the first N values very quickly, but it seems to have a very hard time learning 0 for the second_N values. Here is a graph demonstrating the average action taken over 200 steps for policies trained for 1mil, 2mil, and 3mil steps with PPO2. The black dots are the average and the flat lines are 1 standard deviation away.

Screen Shot 2020-02-24 at 12 02 51 PM

The agent does seem to be learning to choose 0 better over time because the standard deviation shrinks for longer-trained policies, but it is take MUCH longer than learning 1. I find this to be very strange. Why is it so difficult for the agent to explore 0?

This is similar to #473, but the answers there don’t address my question. For the record, I am using normalized action space from [-1, 1].

System Info Describe the characteristic of your environment:

  • Stable baselines 2.8.0 installed with pip
  • Python version: 3.7.4
  • Tensorflow version: 1.14
  • Numpy version: 1.17

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
rusu24edwardcommented, Mar 5, 2020

PPO samples from a distribution to select the next action. This means that even if 0 is the best action as dictated by the policy, there is still a chance of selecting other values. Maybe half the Gaussian is truncated at the bounds [-1, 1]? If half the probability mass was >1 but all samples were rounded to 1, you may see it ‘converge’ 2x as fast to 1. Just a thought.

Thank you for clarifying this. It seems that the right way to think about this is not that it is having a hard time “learning 0” but that it is really good at “learning 1” because of clipping the Gaussian distribution.

I’m curious why the documentation says that we should normalize the continuous action space to [-1, 1]. If PPO2 is really starting with a Gaussian distribution with mu 0 and std 1, then this action space normalization will will clip 32% of the actions. It seems that it would be better to clip [-2, 2] for 95% or even [-3,3] for 99%. Furthermore, why use a Gaussian distribution instead of a Uniform distribution across the entire action space? Uniform will cover the space better…

1reaction
smoradcommented, Mar 2, 2020

why don’t you use the deterministic policy for evaluation?

PPO samples from a distribution to select the next action. This means that even if 0 is the best action as dictated by the policy, there is still a chance of selecting other values. Maybe half the Gaussian is truncated at the bounds [-1, 1]? If half the probability mass was >1 but all samples were rounded to 1, you may see it ‘converge’ 2x as fast to 1. Just a thought.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What's the hard part of zero? - Mathematics Stack Exchange
It can be very, very difficult to revise one's thinking about something so basic; even today, there are many people who can't understand...
Read more >
The mind-bendy weirdness of the number zero, explained - Vox
“When you ask [a child] which number is smaller, zero or one, they often think of one as the smallest number,” Brannon says....
Read more >
Why Is English Hard To Learn? 11 Reasons
In reality, how difficult it is to learn depends on your native language. This is because languages are more (or less) related. Many...
Read more >
Why Is English So Hard to Learn? - Oxford Royale Academy
English isn't so bad once you get used to it, and it's probably only commonly talked about as being hard because so many...
Read more >
Why learning a language is hard & how to make it easier
If you're struggling to learn a new language, breathe, you're not alone. Adults famously find language learning more difficult than children ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found