Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Building policy with continuous action space throws error

See original GitHub issue

Here’s a test to demonstrate this:

    def test_policy_for_continuous_action_space(self):
        # state_space (NN is a simple single fc-layer relu network (2 units), random biases, random weights).
        state_space = FloatBox(shape=(4,), add_batch_rank=True)

        # action_space (5 possible actions).
        action_space = FloatBox(low=-1.0, high=1.0, add_batch_rank=True)

        policy = Policy(network_spec=config_from_path("configs/test_simple_nn.json"), action_space=action_space)
        test = ComponentTest(
            component=policy,
            input_spaces=dict(
                nn_input=state_space,
                actions=action_space,
                logits=FloatBox(shape=(2, ), add_batch_rank=True),
                probabilities=FloatBox(add_batch_rank=True)
            ),
            action_space=action_space
        )

        test.read_variable_values(policy.variables)

This test fails with:

self = <rlgraph.components.policies.policy.Policy object at 0x12ebb08d0>
key = '_T0_'
probabilities = <tf.Tensor 'policy/action-adapter-0/Squeeze:0' shape=(?,) dtype=float32>

    @graph_fn(flatten_ops=True, split_ops=True, add_auto_key_as_first_param=True)
    def _graph_fn_get_distribution_entropies(self, key, probabilities):
        """
        Pushes the given `probabilities` through all our distributions' `entropy` API-methods and returns a
        DataOpDict with the keys corresponding to our `action_space`.
    
        Args:
            probabilities (DataOp): The parameters to define a distribution.
    
        Returns:
            FlattenedDataOp: A DataOpDict with the different distributions' `entropy` outputs. Keys always correspond to
                structure of `self.action_space`.
        """
>       return self.distributions[key].entropy(probabilities)
E       KeyError: '_T0_'

Issue Analytics

State:
Created 5 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

2reactions

michaelschaarschmidtcommented, Jan 21, 2019

Yes, the policy class is a bit messy and definitely needs overhaul, relict from ad hoc functions built with paper deadline pressure and not cleaning them up since then.

0reactions

sven1977commented, Jan 25, 2019

Ok, we have fixed the continuous action problems and added some API methods to the Policy class, mainly due to the renaming of “probabilities” into “parameters”, thereby generalizing for all kinds of different distributions, not just categorical ones. You can still use the old API methods and will just get a warning to change the names. We will deprecate the old ones in a few months or so. An example for continuous actions is the Pendulum-v0 test case on PPO here: tests/agent_learning/short_tasks/test_ppo_agent_short_task_learning.py::test_ppo_on_continuous_action_environment, which remains to be tuned for actual learning.

The parameterization of Normal and Beta distributions always happens within the last axis of the NN output tensor. So for example for the Normal distribution and an action space: FloatBox(shape=(2,)) (2 actions), a single item (of a batch) NN output will be [1.0, 2.0, 0.5, 0.01], where the first two floats are the mean values of the 2 actions and the last two floats are the log-stddev values of the 2 actions.

I’m closing this issue now.

Thanks

Top Results From Across the Web

DEEP DETERMINISTIC POLICY GRADIENT FOR ...

Policy gradient is preferred over value-based methods in the continuous space domain, as they don't solely depend on the value function of the ......

Reinforcement Learning in Continuous Action Spaces: DDPG

The episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10....

How to define the policy in the case of continuous action ...

The environment action space is defined as ratios that has to sum up to 1 at each timestep. Hence, using the gaussian policy...

What is the loss for policy gradients with continuous actions?

In PyTorch we can use a Normal distribution for continuous action space and Categorical for discrete action space. The answer from David Ireland ......

Reinforcement Learning in Continuous Action Spaces

Let's use deep deterministic policy gradients to deal with the bipedal walker environment. Featuring a continuous action space and 24 ...