question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Custom action_sampler_fn is not working for PPO.

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

RLlib

What happened + What you expected to happen

PPO does not work while using action_sampler_fn and make_model.

 File "/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 587, in __init__
    self._build_policy_map(
  File "/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1551, in _build_policy_map
    self.policy_map.create_policy(name, orig_cls, obs_space, act_space,
  File "/lib/python3.8/site-packages/ray/rllib/policy/policy_map.py", line 133, in create_policy
    self[policy_id] = class_(
  File "/lib/python3.8/site-packages/ray/rllib/policy/tf_policy_template.py", line 238, in __init__
    DynamicTFPolicy.__init__(
  File "/lib/python3.8/site-packages/ray/rllib/policy/dynamic_tf_policy.py", line 376, in __init__
    self._initialize_loss_from_dummy_batch(
  File "/lib/python3.8/site-packages/ray/rllib/policy/dynamic_tf_policy.py", line 649, in _initialize_loss_from_dummy_batch
    losses = self._do_loss_init(train_batch)
  File "/lib/python3.8/site-packages/ray/rllib/policy/dynamic_tf_policy.py", line 731, in _do_loss_init
    losses = self._loss_fn(self, self.model, self.dist_class, train_batch)
  File "/python3.8/site-packages/ray/rllib/agents/ppo/ppo_tf_policy.py", line 56, in ppo_surrogate_loss
    curr_action_dist = dist_class(logits, model)
TypeError: 'NoneType' object is not callable

Versions / Dependencies

Python =3.8 ray==1.9.2

Reproduction script

from typing import Any, Optional, Tuple, Type

import numpy as np
import ray
from gym.spaces import Box, Space
from ray.rllib.agents.ppo import ppo, ppo_tf_policy
from ray.rllib.agents.trainer_template import build_trainer
from ray.rllib.models.catalog import ModelCatalog
from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.policy.policy import Policy
from ray.rllib.policy.sample_batch import SampleBatch
from ray.rllib.policy.tf_policy_template import build_tf_policy
from ray.rllib.utils.framework import try_import_tf
from ray.rllib.utils.spaces.simplex import Simplex
from ray.rllib.utils.tf_utils import zero_logps_from_actions
from ray.rllib.utils.typing import TensorType, TrainerConfigDict

tf1, tf, tfv = try_import_tf()


def make_model(
    policy: Policy, obs_space: Space, action_space: Space, config: TrainerConfigDict
) -> ModelV2:
    if isinstance(action_space, Box):
        num_outputs = 2 * np.product(action_space.shape)
    model = ModelCatalog.get_model_v2(
        obs_space=obs_space,
        action_space=action_space,
        num_outputs=num_outputs,
        model_config=config["model"],
        framework="tf",
    )
    policy.dist_class, _ = ModelCatalog.get_action_dist(action_space, config["model"])
    return model


def action_sampler_fn(
    policy: Policy,
    model: ModelV2,
    obs_batch: TensorType,
    explore: bool = True,
    state_batches: Optional[TensorType] = None,
    seq_lens: Optional[TensorType] = None,
    prev_action_batch: Optional[TensorType] = None,
    prev_reward_batch: Optional[TensorType] = None,
    **kwargs: Any,
) -> Tuple[TensorType, TensorType]:
    distribution_inputs, policy._state_out = policy.model(
        {
            SampleBatch.OBS: obs_batch,
            "obs_flat": obs_batch,
            "is_training": policy._get_is_training_placeholder(),
            SampleBatch.PREV_ACTIONS: prev_action_batch,
            SampleBatch.PREV_REWARDS: prev_reward_batch,
        },
        state_batches,
        seq_lens,
    )
    action_dist_class, _ = ModelCatalog.get_action_dist(policy.action_space, policy.config["model"])
    action_dist = action_dist_class(distribution_inputs, model)
    action = action_dist.deterministic_sample()

    logp = zero_logps_from_actions(action)
    return action, logp


PPOTFPolicy = build_tf_policy(
    name="PPOTFPolicy",
    loss_fn=ppo_tf_policy.ppo_surrogate_loss,
    make_model=make_model,
    action_sampler_fn=action_sampler_fn,
    get_default_config=lambda: ray.rllib.agents.ppo.ppo.DEFAULT_CONFIG,
    postprocess_fn=ppo_tf_policy.compute_gae_for_sample_batch,
    stats_fn=ppo_tf_policy.kl_and_loss_stats,
    compute_gradients_fn=ppo_tf_policy.compute_and_clip_gradients,
    extra_action_out_fn=ppo_tf_policy.vf_preds_fetches,
    before_init=ppo_tf_policy.setup_config,
    before_loss_init=ppo_tf_policy.setup_mixins,
    mixins=[
        ppo_tf_policy.LearningRateSchedule,
        ppo_tf_policy.EntropyCoeffSchedule,
        ppo_tf_policy.KLCoeffMixin,
        ppo_tf_policy.ValueNetworkMixin,
    ],
)

DEFAULT_CONFIG = ppo.DEFAULT_CONFIG


PPOTrainer = build_trainer(
    name="PPO",
    default_config=ppo.DEFAULT_CONFIG,
    validate_config=ppo.validate_config,
    default_policy=PPOTFPolicy,
    execution_plan=ppo.execution_plan,
)


config = DEFAULT_CONFIG.copy()
trainer = PPOTrainer(config, env="LunarLanderContinuous-v2")
for i in range(250):
    trainer.train()

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
sven1977commented, Jan 18, 2022

Hey @n30111 , thanks for raising this. I think the answer here is that in case you do want to use a action_sampler_fn (in which you take charge of action computation entirely w/o the help of the policy’s built-in action-dist/sampling utilities), you have to make sure that your loss function handles this absence of an action-distribution class.

From looking at your action_sampler_fn, it seems that all you are trying to do is to return a deterministic action (instead of a sampled one from the distribution). You can also achieve that by setting config.explore=False in PPO. However, if you are trying to do more complex things in your custom action_sampler_fn, you would need to also re-define your loss to handle the dist_class=None issue.

To summarize RLlib’s behavior:

  • action_sampler_fn defined: RLlib will NOT create an action-dist class for you; RLlib will NOT create an action_dist_inputs placeholder for you; you are responsible for coming up with actions from this custom function.
  • action_distribution_fn defined: Return an action-dist input tensor, a action-dist class, and state-outs (or []) from this custom function, RLlib will do the rest (sample from the given distribution class for action calculations).
  • None of the above: RLlib will come up with a default action distribution class and a default way to compute inputs to this distribution to sample actions from.
0reactions
gjolivercommented, Apr 9, 2022

added some comments to your commit. let’s move the discussion there. thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found