Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[question] Custom output activation function for actor (A2C, SAC)

See original GitHub issue

I really like stable_baselines and would like to use it for a custom environment with continuous actions. To match the specific needs of the environment, I need to apply a custom activation function to the output of the policy/actor network. In particular, I want to apply separate softmax activation functions to different parts of the output (e.g., softmax(first n action), then softmax(next n), etc).

I know how to define such an activation in general, but don’t know what the best and cleanest way is to implement such a policy in stable_baselines. I’d like to reuse the MlpPolicy and just change the activation of the output layer. I’m interested in using this with A2C and SAC.

In A2C, it seems like this is handled here or here. But I don’t want to mess something up making changes there without being certain.

In SAC, I guess I would only have to adjust this part: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/sac/policies.py#L217 Or do I need to change the log_std below as well?

This seems related to this issue. Unfortunately, it didn’t help me figure out my problem/question.

Issue Analytics

State:
Created 4 years ago
Comments:6

Top GitHub Comments

1reaction

araffincommented, Nov 14, 2019

just created a separate softmax activation functions for a) action 1 and 2 and b) for action 3 and 4 to make sure that each sum up to 1

Can’t you do the normalization inside the environment? Or using a gym wrapper (cf tutorial)?

You also have to know that doing so, you change the probability distribution (for SAC or A2C, in the case of DDPG/TD3, there is no, so it is not a problem).

0reactions

stefanbschneidercommented, Nov 14, 2019

Ok, thanks! I’ll see how far I get with normalizing inside the environment.

Top Results From Across the Web

SAC — Stable Baselines 2.10.3a0 documentation

Soft Actor-Critic (SAC) Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, This implementation borrows code from original ...

Stable-Baselines3: Reliable Reinforcement Learning ...

To help with this problem, we present Stable-Baselines3 (SB3), ... SAC # Custom actor (pi) and value function (vf) networks # of two...

Newest 'actor-critic-methods' Questions

I understand the general idea behind the Actor-Critic architecture. The actor maps state to action, and the critic maps state + action to...

Stable Baselines Documentation - Read the Docs

When applying RL to a custom problem, you should always normalize the ... A better solution would be to use a squashing function...

Algorithms — Ray 2.2.0 - the Ray documentation

actor_hidden_activation – The activation used in the actor's fc network. ... on top of any off-policy Q-learning algorithm (here, we provide this for...