Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Choice of Q value in the policy loss of SAC algorithm

See original GitHub issue

Hello,

Why the Q value 1 is chosen in order to calculate the policy loss in the SAC algorithm? Shouldn’t it be the min of the two Q values? If not can you briefly explain me why?

In the following file, line 237:

stable_baselines/sac/sac.py

# Take the min of the two Q-Values (Double-Q Learning)
min_qf_pi = tf.minimum(qf1_pi, qf2_pi)

# ...

# Compute the policy loss
# Alternative: policy_kl_loss = tf.reduce_mean(logp_pi - min_qf_pi)
policy_kl_loss = tf.reduce_mean(self.ent_coef * logp_pi - qf1_pi) # min_qf_pi instead of qf1_pi?

Thank you for your help,

Issue Analytics

State:
Created 4 years ago
Comments:5

Top GitHub Comments

2reactions

haarnojacommented, May 14, 2019

Yeah, I believe there is no particular reason for that choice – they all work pretty much equally well.

2reactions

hartikainencommented, May 14, 2019

Good question. Based on my tests, there was no difference at all between using the min vs. a single value, and I converged to using the min just to be consistent with the usage of Q in the TD-update.

Top Results From Across the Web

Soft Actor-Critic — Spinning Up documentation - OpenAI

Soft Actor Critic (SAC) is an algorithm that optimizes a stochastic policy in an off-policy way, forming a bridge between stochastic policy optimization...

Distributional Soft Actor-Critic: Off-Policy Reinforcement ... - arXiv

However, most RL algorithms tend to learn unrealistically high state-action values (i.e., Q-values), known as overestimations, thereby resulting in suboptimal.

Entropy in Soft Actor-Critic (Part 2) - Towards Data Science

SAC algorithms perform iteration that alternates between policy evaluation ... Both tensors are used in computation of two Q-loss-values: ...

Averaged Soft Actor-Critic for Deep Reinforcement ... - Hindawi

These two values are used to define the Gaussian policy distribution. In order to minimize the expected KL divergence formula, the SAC algorithm...

Soft Actor-Critic Due November 14th, 11:59 pm

In the value function loss (Equation 8) and policy loss (Equation 9), re- place Qθ(s, a) with min{Qθ1 (s, a), Qθ2 (s, a)}....