Choice of Q value in the policy loss of SAC algorithm
See original GitHub issueHello,
Why the Q value 1 is chosen in order to calculate the policy loss in the SAC algorithm? Shouldn’t it be the min of the two Q values? If not can you briefly explain me why?
In the following file, line 237:
stable_baselines/sac/sac.py
# Take the min of the two Q-Values (Double-Q Learning)
min_qf_pi = tf.minimum(qf1_pi, qf2_pi)
# ...
# Compute the policy loss
# Alternative: policy_kl_loss = tf.reduce_mean(logp_pi - min_qf_pi)
policy_kl_loss = tf.reduce_mean(self.ent_coef * logp_pi - qf1_pi) # min_qf_pi instead of qf1_pi?
Thank you for your help,
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top Results From Across the Web
Soft Actor-Critic — Spinning Up documentation - OpenAI
Soft Actor Critic (SAC) is an algorithm that optimizes a stochastic policy in an off-policy way, forming a bridge between stochastic policy optimization...
Read more >Distributional Soft Actor-Critic: Off-Policy Reinforcement ... - arXiv
However, most RL algorithms tend to learn unrealistically high state-action values (i.e., Q-values), known as overestimations, thereby resulting in suboptimal.
Read more >Entropy in Soft Actor-Critic (Part 2) - Towards Data Science
SAC algorithms perform iteration that alternates between policy evaluation ... Both tensors are used in computation of two Q-loss-values: ...
Read more >Averaged Soft Actor-Critic for Deep Reinforcement ... - Hindawi
These two values are used to define the Gaussian policy distribution. In order to minimize the expected KL divergence formula, the SAC algorithm...
Read more >Soft Actor-Critic Due November 14th, 11:59 pm
In the value function loss (Equation 8) and policy loss (Equation 9), re- place Qθ(s, a) with min{Qθ1 (s, a), Qθ2 (s, a)}....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yeah, I believe there is no particular reason for that choice – they all work pretty much equally well.
Good question. Based on my tests, there was no difference at all between using the min vs. a single value, and I converged to using the min just to be consistent with the usage of Q in the TD-update.