question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Building policy with continuous action space throws error

See original GitHub issue

Here’s a test to demonstrate this:

    def test_policy_for_continuous_action_space(self):
        # state_space (NN is a simple single fc-layer relu network (2 units), random biases, random weights).
        state_space = FloatBox(shape=(4,), add_batch_rank=True)

        # action_space (5 possible actions).
        action_space = FloatBox(low=-1.0, high=1.0, add_batch_rank=True)

        policy = Policy(network_spec=config_from_path("configs/test_simple_nn.json"), action_space=action_space)
        test = ComponentTest(
            component=policy,
            input_spaces=dict(
                nn_input=state_space,
                actions=action_space,
                logits=FloatBox(shape=(2, ), add_batch_rank=True),
                probabilities=FloatBox(add_batch_rank=True)
            ),
            action_space=action_space
        )

        test.read_variable_values(policy.variables)

This test fails with:

self = <rlgraph.components.policies.policy.Policy object at 0x12ebb08d0>
key = '_T0_'
probabilities = <tf.Tensor 'policy/action-adapter-0/Squeeze:0' shape=(?,) dtype=float32>

    @graph_fn(flatten_ops=True, split_ops=True, add_auto_key_as_first_param=True)
    def _graph_fn_get_distribution_entropies(self, key, probabilities):
        """
        Pushes the given `probabilities` through all our distributions' `entropy` API-methods and returns a
        DataOpDict with the keys corresponding to our `action_space`.
    
        Args:
            probabilities (DataOp): The parameters to define a distribution.
    
        Returns:
            FlattenedDataOp: A DataOpDict with the different distributions' `entropy` outputs. Keys always correspond to
                structure of `self.action_space`.
        """
>       return self.distributions[key].entropy(probabilities)
E       KeyError: '_T0_'

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
michaelschaarschmidtcommented, Jan 21, 2019

Yes, the policy class is a bit messy and definitely needs overhaul, relict from ad hoc functions built with paper deadline pressure and not cleaning them up since then.

0reactions
sven1977commented, Jan 25, 2019

Ok, we have fixed the continuous action problems and added some API methods to the Policy class, mainly due to the renaming of “probabilities” into “parameters”, thereby generalizing for all kinds of different distributions, not just categorical ones. You can still use the old API methods and will just get a warning to change the names. We will deprecate the old ones in a few months or so. An example for continuous actions is the Pendulum-v0 test case on PPO here: tests/agent_learning/short_tasks/test_ppo_agent_short_task_learning.py::test_ppo_on_continuous_action_environment, which remains to be tuned for actual learning.

The parameterization of Normal and Beta distributions always happens within the last axis of the NN output tensor. So for example for the Normal distribution and an action space: FloatBox(shape=(2,)) (2 actions), a single item (of a batch) NN output will be [1.0, 2.0, 0.5, 0.01], where the first two floats are the mean values of the 2 actions and the last two floats are the log-stddev values of the 2 actions.

I’m closing this issue now.

Thanks

Read more comments on GitHub >

github_iconTop Results From Across the Web

DEEP DETERMINISTIC POLICY GRADIENT FOR ...
Policy gradient is preferred over value-based methods in the continuous space domain, as they don't solely depend on the value function of the ......
Read more >
Reinforcement Learning in Continuous Action Spaces: DDPG
The episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10....
Read more >
How to define the policy in the case of continuous action ...
The environment action space is defined as ratios that has to sum up to 1 at each timestep. Hence, using the gaussian policy...
Read more >
What is the loss for policy gradients with continuous actions?
In PyTorch we can use a Normal distribution for continuous action space and Categorical for discrete action space. The answer from David Ireland ......
Read more >
Reinforcement Learning in Continuous Action Spaces
Let's use deep deterministic policy gradients to deal with the bipedal walker environment. Featuring a continuous action space and 24 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found