Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Non-shared features extractor in on-policy algorithm

See original GitHub issue

Question

I’ve checked the docs (custom policy -> advanced example), but it is not clear to me how to create a custom policy without sharing the features extractor between the actor and the critic networks in on-policy algorithms.

If I pass a features_extractor_class in the policy_kwargs, this is shared by default I think.

I can have a non-shared mlp_extractor by implementing my own _build_mlp_extractor method in my custom policy and creating a network with 2 distinct sub-networks (self.policy_net and self.value_net), but I didn’t understand how to do the same with the features extractor.

On the docs (custom policy -> custom features extractor), it says: Therefore, since I’m using A2C, I think it should be possible to have a non-shared features extractor by implementing my own policy, just I didn’t understand how to do it.

Thanks in advance any clarification!

Checklist

I have read the documentation (required)
I have checked that there is no similar issue in the repo (required)

Issue Analytics

State:
Created a year ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

AlexPasquacommented, Oct 2, 2022

@wlxer I think you could pass the dimensions as parameters to your policy network (not necessarily within kwargs, but explicitly). Then you “save” them in some net’s attributes and only then you call the superclass’ constructor. It is something that I actually do in my code, but I didn’t report it previously because it was just a personal need.

You can do something a bit like this:

class MyPolicy(ActorCriticPolicy):
    def __init__(
                self,
                observation_space: gym.spaces.Space,
                action_space: gym.spaces.Space,
                lr_schedule: Callable[[float], float],
                net_arch: Optional[List[Union[int, Dict[str, List[int]]]]] = None,
                activation_fn: Type[nn.Module] = nn.Tanh,
                dim1,
                dim2,
                *args,
                **kwargs,
        ):
    self.dim1 = dim1
    self.dim2 = dim2
    super().__init__()
    ...

1reaction

AlexPasquacommented, Oct 27, 2022

I managed to make it run without errors! 🎉

But since I haven’t found a guide/demo nor a similar issue here, I’ll briefly explain how I did it:

Create a custom policy (in my case, subclassing from ActorCriticPolicy).
In this custom policy, I created 2 attributes, one for the features extractor of the actor and one for the features extractor of the critic.
Override the super-policy’s methods that use the feature extractor, in order to use yours. In my case they were: forward, extract_features, evaluate_actions and predict_values).

Quick demo

class MyFeaturesExtractor(BaseFeaturesExtractor):
    ...


class MyMLPExtractor(nn.Module):
    # not mandatory, just in my case I had also a custom MLP extractor
    # because I had to use 2 different activation functions for the actor and the critic
    ...


class MyPolicy(ActorCriticPolicy):
    def __init__(
            self,
            observation_space: gym.spaces.Space,
            action_space: gym.spaces.Space,
            lr_schedule: Callable[[float], float],
            net_arch: Optional[List[Union[int, Dict[str, List[int]]]]] = None,
            activation_fn: Type[nn.Module] = nn.Tanh,
            *args,
            **kwargs,
    ):
        super(HADRLPolicy, self).__init__(
            observation_space,
            action_space,
            lr_schedule,
            net_arch,
            activation_fn,
            # Pass remaining arguments to base class
            *args,
            **kwargs,
        )

        # non-shared features extractors for the actor and the critic
        self.policy_features_extractor = MyFeaturesExtractor(...)
        self.value_features_extractor = MyFeaturesExtractor(...)

        # if features_dim of both features extractor are the same:
        self.features_dim = self.policy_features_extractor.features_dim

        # otherwise:
        self.features_dim = {'pi': self.policy_features_extractor.features_dim,
                             'vf': self.value_features_extractor.features_dim}
        # NOTE: if the 2 features dims are different, your mlp_extractor must be able
        # to acceppt such dict AND ALSO an int, because the mlp_extractor will be first
        # created with wrong features_dim (coming from wrong, default, feratures extractor) which is an int.
        # Furthermore, note that with 2 different features dims the mlp_extractor cannot have shared layers.

        delattr(self, "features_extractor")  # remove the shared features extractor

        # Disable orthogonal initialization (if you want, otherwise comment it)
        self.ortho_init = False

        # The super-constructor calls a '_build' method that creates the network and the optimizer.
        # The problem is that it does so using a default features extractor, and not the ones just created,
        # therefore we need to re-create the mlp_extractor and the optimizer
        # (that otherwise would have used obsolete features_dims and parameters).
        self._rebuild(lr_schedule)

    def _rebuild(self, lr_schedule: Schedule) -> None:
        """ Re-creates the mlp_extractor and the optimizer for the model.

        :param lr_schedule: Learning rate schedule
            lr_schedule(1) is the initial learning rate
        """
        self._build_mlp_extractor()

        # action_net and value_net as created in the '_build' method are OK,
        # no need to recreate them.

        # Init weights: use orthogonal initialization
        # with small initial weight for the output
        if self.ortho_init:
            # TODO: check for features_extractor
            # Values from stable-baselines.
            # features_extractor/mlp values are
            # originally from openai/baselines (default gains/init_scales).
            module_gains = {
                self.policy_features_extractor: np.sqrt(2),
                self.value_features_extractor: np.sqrt(2),
                self.mlp_extractor: np.sqrt(2),
                self.action_net: 0.01,
                self.value_net: 1,
            }
            for module, gain in module_gains.items():
                module.apply(partial(self.init_weights, gain=gain))

        # Setup optimizer with initial learning rate
        self.optimizer = self.optimizer_class(self.parameters(), lr=lr_schedule(1), **self.optimizer_kwargs)

    def _build_mlp_extractor(self) -> None:
        self.mlp_extractor = MyMLPExtractor(...)

    def extract_features(self, obs: th.Tensor) -> Tuple[th.Tensor, th.Tensor]:
        """
        Preprocess the observation if needed and extract features.

        :param obs: Observation
        :return: the output of the feature extractor(s)
        """
        assert self.policy_features_extractor is not None and self.value_features_extractor is not None
        preprocessed_obs = preprocess_obs(obs, self.observation_space, normalize_images=self.normalize_images)
        policy_features = self.policy_features_extractor(preprocessed_obs)
        value_features = self.value_features_extractor(preprocessed_obs)
        return policy_features, value_features

    def forward(self, obs: th.Tensor, deterministic: bool = False) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
        """
        Forward pass in all the networks (actor and critic)

        :param obs: Observation
        :param deterministic: Whether to sample or use deterministic actions
        :return: action, value and log probability of the action
        """
        # Preprocess the observation if needed
        policy_features, value_features = self.extract_features(obs)
        latent_pi = self.mlp_extractor.forward_actor(policy_features)
        latent_vf = self.mlp_extractor.forward_critic(value_features)

        # Evaluate the values for the given observations
        values = self.value_net(latent_vf)
        distribution = self._get_action_dist_from_latent(latent_pi)
        actions = distribution.get_actions(deterministic=deterministic)
        log_prob = distribution.log_prob(actions)
        return actions, values, log_prob

    def evaluate_actions(self, obs: th.Tensor, actions: th.Tensor) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
        """
        Evaluate actions according to the current policy,
        given the observations.

        :param obs: Observation
        :param actions: Actions
        :return: estimated value, log likelihood of taking those actions
            and entropy of the action distribution.
        """
        # Preprocess the observation if needed
        policy_features, value_features = self.extract_features(obs)
        latent_pi = self.mlp_extractor.forward_actor(policy_features)
        latent_vf = self.mlp_extractor.forward_critic(value_features)
        distribution = self._get_action_dist_from_latent(latent_pi)
        log_prob = distribution.log_prob(actions)
        values = self.value_net(latent_vf)
        return values, log_prob, distribution.entropy()

    def get_distribution(self, obs: th.Tensor) -> Distribution:
        """
        Get the current policy distribution given the observations.

        :param obs: Observation
        :return: the action distribution.
        """
        policy_features, _ = self.extract_features(obs)
        latent_pi = self.mlp_extractor.forward_actor(policy_features)
        return self._get_action_dist_from_latent(latent_pi)

    def predict_values(self, obs: th.Tensor) -> th.Tensor:
        """
        Get the estimated values according to the current policy given the observations.

        :param obs: Observation
        :return: the estimated values.
        """
        _, value_features = self.extract_features(obs)
        latent_vf = self.mlp_extractor.forward_critic(value_features)
        return self.value_net(latent_vf)

Hope it can help someone!

Top Results From Across the Web

DLR-RM/stable-baselines3 - GitHub

Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of Stable ......

What is Feature Extraction? Feature Extraction in Image ...

Feature extraction helps to reduce the amount of redundant data from the data set.

Feature Extraction - Machine Learning 101 - YouTube

Have you always been curious about what machine learning can do for your business problem, but could never find the time to learn...

What is the difference between off-policy and on-policy ...

In off-policy learning, the Q(s,a) function is learned from taking different actions (for example, random actions). We don't even need a policy ...

[2011.00485] Comparing Machine Learning Algorithms with or ...

Furthermore, we introduce a novel feature extraction method based on the Levenshtein distance and randomly generated DNA sub-sequences to ...