[question] Custom output activation function for actor (A2C, SAC)
See original GitHub issueI really like stable_baselines
and would like to use it for a custom environment with continuous actions. To match the specific needs of the environment, I need to apply a custom activation function to the output of the policy/actor network. In particular, I want to apply separate softmax activation functions to different parts of the output (e.g., softmax(first n action), then softmax(next n), etc).
I know how to define such an activation in general, but don’t know what the best and cleanest way is to implement such a policy in stable_baselines
. I’d like to reuse the MlpPolicy and just change the activation of the output layer. I’m interested in using this with A2C and SAC.
In A2C, it seems like this is handled here or here. But I don’t want to mess something up making changes there without being certain.
In SAC, I guess I would only have to adjust this part: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/sac/policies.py#L217 Or do I need to change the log_std
below as well?
This seems related to this issue. Unfortunately, it didn’t help me figure out my problem/question.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6
Top GitHub Comments
Can’t you do the normalization inside the environment? Or using a gym wrapper (cf tutorial)?
You also have to know that doing so, you change the probability distribution (for SAC or A2C, in the case of DDPG/TD3, there is no, so it is not a problem).
Ok, thanks! I’ll see how far I get with normalizing inside the environment.