Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Is A2C deterministic on training?

See original GitHub issue

The documentation states that A2C is deterministic:

A synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C). It uses multiple workers to avoid the use of a replay buffer.

However, the learn method from the OnPolicyAlgorithm class collects rollouts through the forward method that calls ActorCriticPolicy with deterministic= False, as follows.

class OnPolicyAlgorithm(BaseAlgorithm):
    def learn(...) -> "OnPolicyAlgorithm":
            (...)
            continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)

    def collect_rollouts(self, env: VecEnv, callback: BaseCallback, rollout_buffer: RolloutBuffer, n_rollout_steps: int) -> bool:
                (...)
                actions, values, log_probs = self.policy.forward(obs_tensor)

class ActorCriticPolicy(BasePolicy):
    (...)
    def forward(self, obs: th.Tensor, deterministic: bool = False) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
        (...)
        actions = distribution.get_actions(deterministic=deterministic)

So it appears that A2C is sampling actions through a stochastic policy during the learning phase, as the original paper describes. Also, it seems that A2C is only optionally deterministic when calling the predict method and that A2C.

Am I missing something? Is the A2C implementation really deterministic?

Checklist

I have read the documentation (required)
I have checked that there is no similar issue in the repo (required)

Issue Analytics

State:
Created 2 years ago
Comments:16 (15 by maintainers)

Top GitHub Comments

2reactions

Miffylicommented, Sep 2, 2021

Just to pitch in here: I think the whole “deterministic” part in the docs regarding A2C is confusing. I do not see how A2C is “deterministic” in any way (sure, the results match if you fix seed, but that applies to all algorithms once you fix the PRNG seed for everything). We should remove that mention from A2C and other algorithms, or at least clarify what it means.

In this sense, if I want a truly stochastic behavior during prediction, shouldn’t torch.manual_seed(random_seed) be called before distribution.sample() in order to set different seed and guarantee stochastic sampling?

You should not need to set the seed manually (and should not set it to anything fixed! Otherwise you will be stuck with deterministic behaviour). Simply doing predict with deterministic=False should be enough. Sorry if this not helpful though, doing this message at late hours ^^’

1reaction

xicocaiocommented, Sep 21, 2021

I understand that models initialized with the same parameters and trained equally will lead to models that work precisely alike. But as @Miffyli pointed out.

I do not see how A2C is “deterministic” in any way (sure, the results match if you fix seed, but that applies to all algorithms once you fix the PRNG seed for everything).

My issue is that even if predict(deterministic=False), in practice, results will still be deterministic and not stochastic. So it seems there is no way to make the trained model genuinely stochastic.

In conclusion, what I get from this thread is:

Yes, the trained A2C model is deterministic.
The deterministic argument changes how the trained model selects actions as such (deterministic=False -> sample() and deterministic=True -> mode()). However, in practice, the model will work either way deterministically, given that the trained model samples from the probability distribution that is trained with the same initialization.

Top Results From Across the Web

Understanding Actor Critic Methods and A2C | by Chris Yoon

In essence, A3C implements parallel training where multiple workers in parallel environments independently update a global value function—hence ...

A2C — Stable Baselines3 1.7.0a5 documentation

A2C¶. A synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C). It uses multiple workers to avoid the use of a replay buffer....

An Introduction to Advantage Actor-Critic method (A2C)

The advantage of the Actor-Critic algorithm is that it can solve a broader range of problems than DQN, while it has a lower...

A2C Vs. Q Learning - Artificial Intelligence Stack Exchange

Great answer. Another difference to consider is that A2C is an on-policy method, while Q-learning is off-policy.

Advantage Actor Critic (A2C) - Hugging Face

We'll study one of these hybrid methods called Advantage Actor Critic (A2C), and train our agent using Stable-Baselines3 in robotic environments.