question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Confused about Her+DDPG policy-loss

See original GitHub issue

The policy-loss in the her+ddpg implementation is defined as following:

self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf)
self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))

This can be found here: https://github.com/openai/baselines/blob/f2729693253c0ef4d4086231d36e0a4307ec1cb3/baselines/her/ddpg.py#L274

I understand why we are using the first part: self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf) However, I do not understand the purpose of the second part (I call it from now on mean_sqr_action): self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))

The second part is never mentioned in the paper as far as I know and also isn’t used in your vanilla implementation of your baseline-ddpg. Additionally, in my two experiments it improved the learning significantly when removing the mean_sqr_action from the loss function.

The first experiment was in the environment FetchReach-v1 with the default settings. In this case it turns out that the mean_sqr_action-implementation needs 5 epochs to reach a success rate of 1 whereas the modified version needs only 3 epochs.

In the more complex environment HandReach-v0 the mean_sqr_action-version needed 20 epochs to reach an accuracy of 0.4 whereas the modified-version could achieve an accuracy of 0.5 already after 11 epochs and after 20 epochs an accuracy of 0.55.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:8

github_iconTop GitHub Comments

2reactions
JBLaniercommented, Sep 10, 2018

Referring to the actor loss component: https://github.com/openai/baselines/blob/f2729693253c0ef4d4086231d36e0a4307ec1cb3/baselines/her/ddpg.py#L274 what would be a reason to want to penalize the magnitude of actions, as is done here?

1reaction
astiercommented, Aug 2, 2018

Yes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolving Confusion in Pet Owner Tort Cases
Loss of companionship claims must not, however, become a pet owner's windfall. Courts should demand to hear testimony as to the companionship value...
Read more >
Dog owners confused and grieving after pet's journey to ...
A Toronto woman wants answers from a Vancouver-based pet transportation company after paying $4,800 to transport the family dog to England. When ...
Read more >
The Confusion About Pets
What a wonderful time it is for the scammer, the conniver, and the cheat: the underage drinkers who flash fake I.D.s, the able-bodied...
Read more >
Helping Your Dog Understand the Loss of Their Canine ...
So, what can we do to help our pets deal with the loss of a canine companion? Don't rush to throw out items...
Read more >
Stop pet damage to your home
Has your pet caused damage to your home? We explain how you can prevent pet damage but also whether you can claim on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found