Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Confused about Her+DDPG policy-loss

See original GitHub issue

The policy-loss in the her+ddpg implementation is defined as following:

self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf)
self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))

This can be found here: https://github.com/openai/baselines/blob/f2729693253c0ef4d4086231d36e0a4307ec1cb3/baselines/her/ddpg.py#L274

I understand why we are using the first part: self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf) However, I do not understand the purpose of the second part (I call it from now on mean_sqr_action): self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))

The second part is never mentioned in the paper as far as I know and also isn’t used in your vanilla implementation of your baseline-ddpg. Additionally, in my two experiments it improved the learning significantly when removing the mean_sqr_action from the loss function.

The first experiment was in the environment FetchReach-v1 with the default settings. In this case it turns out that the mean_sqr_action-implementation needs 5 epochs to reach a success rate of 1 whereas the modified version needs only 3 epochs.

In the more complex environment HandReach-v0 the mean_sqr_action-version needed 20 epochs to reach an accuracy of 0.4 whereas the modified-version could achieve an accuracy of 0.5 already after 11 epochs and after 20 epochs an accuracy of 0.55.

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:8

Top GitHub Comments

2reactions

JBLaniercommented, Sep 10, 2018

Referring to the actor loss component: https://github.com/openai/baselines/blob/f2729693253c0ef4d4086231d36e0a4307ec1cb3/baselines/her/ddpg.py#L274 what would be a reason to want to penalize the magnitude of actions, as is done here?

1reaction

astiercommented, Aug 2, 2018

Yes.

Top Results From Across the Web

Resolving Confusion in Pet Owner Tort Cases

Loss of companionship claims must not, however, become a pet owner's windfall. Courts should demand to hear testimony as to the companionship value...

Dog owners confused and grieving after pet's journey to ...

A Toronto woman wants answers from a Vancouver-based pet transportation company after paying $4,800 to transport the family dog to England. When ...

The Confusion About Pets

What a wonderful time it is for the scammer, the conniver, and the cheat: the underage drinkers who flash fake I.D.s, the able-bodied...

Helping Your Dog Understand the Loss of Their Canine ...

So, what can we do to help our pets deal with the loss of a canine companion? Don't rush to throw out items...

Stop pet damage to your home

Has your pet caused damage to your home? We explain how you can prevent pet damage but also whether you can claim on...