[Question] Is A2C deterministic on training?
See original GitHub issueThe documentation states that A2C is deterministic:
A synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C). It uses multiple workers to avoid the use of a replay buffer.
However, the learn
method from the OnPolicyAlgorithm
class collects rollouts through the forward
method that calls ActorCriticPolicy
with deterministic= False
, as follows.
class OnPolicyAlgorithm(BaseAlgorithm):
def learn(...) -> "OnPolicyAlgorithm":
(...)
continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
def collect_rollouts(self, env: VecEnv, callback: BaseCallback, rollout_buffer: RolloutBuffer, n_rollout_steps: int) -> bool:
(...)
actions, values, log_probs = self.policy.forward(obs_tensor)
class ActorCriticPolicy(BasePolicy):
(...)
def forward(self, obs: th.Tensor, deterministic: bool = False) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
(...)
actions = distribution.get_actions(deterministic=deterministic)
So it appears that A2C is sampling actions through a stochastic policy during the learning phase, as the original paper describes. Also, it seems that A2C is only optionally deterministic when calling the predict
method and that A2C.
Am I missing something? Is the A2C implementation really deterministic?
Checklist
- I have read the documentation (required)
- I have checked that there is no similar issue in the repo (required)
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (15 by maintainers)
Top Results From Across the Web
Understanding Actor Critic Methods and A2C | by Chris Yoon
In essence, A3C implements parallel training where multiple workers in parallel environments independently update a global value function—hence ...
Read more >A2C — Stable Baselines3 1.7.0a5 documentation
A2C¶. A synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C). It uses multiple workers to avoid the use of a replay buffer....
Read more >An Introduction to Advantage Actor-Critic method (A2C)
The advantage of the Actor-Critic algorithm is that it can solve a broader range of problems than DQN, while it has a lower...
Read more >A2C Vs. Q Learning - Artificial Intelligence Stack Exchange
Great answer. Another difference to consider is that A2C is an on-policy method, while Q-learning is off-policy.
Read more >Advantage Actor Critic (A2C) - Hugging Face
We'll study one of these hybrid methods called Advantage Actor Critic (A2C), and train our agent using Stable-Baselines3 in robotic environments.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Just to pitch in here: I think the whole “deterministic” part in the docs regarding A2C is confusing. I do not see how A2C is “deterministic” in any way (sure, the results match if you fix seed, but that applies to all algorithms once you fix the PRNG seed for everything). We should remove that mention from A2C and other algorithms, or at least clarify what it means.
You should not need to set the seed manually (and should not set it to anything fixed! Otherwise you will be stuck with deterministic behaviour). Simply doing
predict
withdeterministic=False
should be enough. Sorry if this not helpful though, doing this message at late hours ^^’I understand that models initialized with the same parameters and trained equally will lead to models that work precisely alike. But as @Miffyli pointed out.
My issue is that even if
predict(deterministic=False)
, in practice, results will still be deterministic and not stochastic. So it seems there is no way to make the trained model genuinely stochastic.In conclusion, what I get from this thread is:
deterministic
argument changes how the trained model selects actions as such (deterministic=False -> sample() and deterministic=True -> mode()). However, in practice, the model will work either way deterministically, given that the trained model samples from the probability distribution that is trained with the same initialization.