Custom action_sampler_fn is not working for PPO.
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
RLlib
What happened + What you expected to happen
PPO does not work while using action_sampler_fn
and make_model
.
File "/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 587, in __init__
self._build_policy_map(
File "/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1551, in _build_policy_map
self.policy_map.create_policy(name, orig_cls, obs_space, act_space,
File "/lib/python3.8/site-packages/ray/rllib/policy/policy_map.py", line 133, in create_policy
self[policy_id] = class_(
File "/lib/python3.8/site-packages/ray/rllib/policy/tf_policy_template.py", line 238, in __init__
DynamicTFPolicy.__init__(
File "/lib/python3.8/site-packages/ray/rllib/policy/dynamic_tf_policy.py", line 376, in __init__
self._initialize_loss_from_dummy_batch(
File "/lib/python3.8/site-packages/ray/rllib/policy/dynamic_tf_policy.py", line 649, in _initialize_loss_from_dummy_batch
losses = self._do_loss_init(train_batch)
File "/lib/python3.8/site-packages/ray/rllib/policy/dynamic_tf_policy.py", line 731, in _do_loss_init
losses = self._loss_fn(self, self.model, self.dist_class, train_batch)
File "/python3.8/site-packages/ray/rllib/agents/ppo/ppo_tf_policy.py", line 56, in ppo_surrogate_loss
curr_action_dist = dist_class(logits, model)
TypeError: 'NoneType' object is not callable
Versions / Dependencies
Python =3.8 ray==1.9.2
Reproduction script
from typing import Any, Optional, Tuple, Type
import numpy as np
import ray
from gym.spaces import Box, Space
from ray.rllib.agents.ppo import ppo, ppo_tf_policy
from ray.rllib.agents.trainer_template import build_trainer
from ray.rllib.models.catalog import ModelCatalog
from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.policy.policy import Policy
from ray.rllib.policy.sample_batch import SampleBatch
from ray.rllib.policy.tf_policy_template import build_tf_policy
from ray.rllib.utils.framework import try_import_tf
from ray.rllib.utils.spaces.simplex import Simplex
from ray.rllib.utils.tf_utils import zero_logps_from_actions
from ray.rllib.utils.typing import TensorType, TrainerConfigDict
tf1, tf, tfv = try_import_tf()
def make_model(
policy: Policy, obs_space: Space, action_space: Space, config: TrainerConfigDict
) -> ModelV2:
if isinstance(action_space, Box):
num_outputs = 2 * np.product(action_space.shape)
model = ModelCatalog.get_model_v2(
obs_space=obs_space,
action_space=action_space,
num_outputs=num_outputs,
model_config=config["model"],
framework="tf",
)
policy.dist_class, _ = ModelCatalog.get_action_dist(action_space, config["model"])
return model
def action_sampler_fn(
policy: Policy,
model: ModelV2,
obs_batch: TensorType,
explore: bool = True,
state_batches: Optional[TensorType] = None,
seq_lens: Optional[TensorType] = None,
prev_action_batch: Optional[TensorType] = None,
prev_reward_batch: Optional[TensorType] = None,
**kwargs: Any,
) -> Tuple[TensorType, TensorType]:
distribution_inputs, policy._state_out = policy.model(
{
SampleBatch.OBS: obs_batch,
"obs_flat": obs_batch,
"is_training": policy._get_is_training_placeholder(),
SampleBatch.PREV_ACTIONS: prev_action_batch,
SampleBatch.PREV_REWARDS: prev_reward_batch,
},
state_batches,
seq_lens,
)
action_dist_class, _ = ModelCatalog.get_action_dist(policy.action_space, policy.config["model"])
action_dist = action_dist_class(distribution_inputs, model)
action = action_dist.deterministic_sample()
logp = zero_logps_from_actions(action)
return action, logp
PPOTFPolicy = build_tf_policy(
name="PPOTFPolicy",
loss_fn=ppo_tf_policy.ppo_surrogate_loss,
make_model=make_model,
action_sampler_fn=action_sampler_fn,
get_default_config=lambda: ray.rllib.agents.ppo.ppo.DEFAULT_CONFIG,
postprocess_fn=ppo_tf_policy.compute_gae_for_sample_batch,
stats_fn=ppo_tf_policy.kl_and_loss_stats,
compute_gradients_fn=ppo_tf_policy.compute_and_clip_gradients,
extra_action_out_fn=ppo_tf_policy.vf_preds_fetches,
before_init=ppo_tf_policy.setup_config,
before_loss_init=ppo_tf_policy.setup_mixins,
mixins=[
ppo_tf_policy.LearningRateSchedule,
ppo_tf_policy.EntropyCoeffSchedule,
ppo_tf_policy.KLCoeffMixin,
ppo_tf_policy.ValueNetworkMixin,
],
)
DEFAULT_CONFIG = ppo.DEFAULT_CONFIG
PPOTrainer = build_trainer(
name="PPO",
default_config=ppo.DEFAULT_CONFIG,
validate_config=ppo.validate_config,
default_policy=PPOTFPolicy,
execution_plan=ppo.execution_plan,
)
config = DEFAULT_CONFIG.copy()
trainer = PPOTrainer(config, env="LunarLanderContinuous-v2")
for i in range(250):
trainer.train()
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @n30111 , thanks for raising this. I think the answer here is that in case you do want to use a
action_sampler_fn
(in which you take charge of action computation entirely w/o the help of the policy’s built-in action-dist/sampling utilities), you have to make sure that your loss function handles this absence of an action-distribution class.From looking at your
action_sampler_fn
, it seems that all you are trying to do is to return a deterministic action (instead of a sampled one from the distribution). You can also achieve that by settingconfig.explore=False
in PPO. However, if you are trying to do more complex things in your customaction_sampler_fn
, you would need to also re-define your loss to handle the dist_class=None issue.To summarize RLlib’s behavior:
action_sampler_fn
defined: RLlib will NOT create an action-dist class for you; RLlib will NOT create anaction_dist_inputs
placeholder for you; you are responsible for coming up with actions from this custom function.action_distribution_fn
defined: Return an action-dist input tensor, a action-dist class, and state-outs (or []) from this custom function, RLlib will do the rest (sample from the given distribution class for action calculations).added some comments to your commit. let’s move the discussion there. thanks.