Confused about Her+DDPG policy-loss
See original GitHub issueThe policy-loss in the her+ddpg implementation is defined as following:
self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf)
self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))
This can be found here: https://github.com/openai/baselines/blob/f2729693253c0ef4d4086231d36e0a4307ec1cb3/baselines/her/ddpg.py#L274
I understand why we are using the first part:
self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf)
However, I do not understand the purpose of the second part (I call it from now on mean_sqr_action):
self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))
The second part is never mentioned in the paper as far as I know and also isn’t used in your vanilla implementation of your baseline-ddpg. Additionally, in my two experiments it improved the learning significantly when removing the mean_sqr_action from the loss function.
The first experiment was in the environment FetchReach-v1 with the default settings. In this case it turns out that the mean_sqr_action-implementation needs 5 epochs to reach a success rate of 1 whereas the modified version needs only 3 epochs.
In the more complex environment HandReach-v0 the mean_sqr_action-version needed 20 epochs to reach an accuracy of 0.4 whereas the modified-version could achieve an accuracy of 0.5 already after 11 epochs and after 20 epochs an accuracy of 0.55.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:8
Top GitHub Comments
Referring to the actor loss component: https://github.com/openai/baselines/blob/f2729693253c0ef4d4086231d36e0a4307ec1cb3/baselines/her/ddpg.py#L274 what would be a reason to want to penalize the magnitude of actions, as is done here?
Yes.