question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Unable to add new policies in multi-agent setting.

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04.1 LTS (Bionic Beaver) WSL
  • Ray installed from (source or binary): Binary
  • Ray version: 0.7.4
  • Python version: 3.7.4
  • Exact command to reproduce: See below.

Describe the problem

I’m trying to create a multi-agent environment which supports the creation of new agents and removal of current agents during training. Each agent needs its own policy, and so I am attempting to modify the policies dictionary used to create the Trainer object. However, it appears that in setting up the training process, rllib makes deep copies of the trainer configuration variables somewhere. Long story short, I would like to be able to add new agents and initialize policies for them during training. It’s possible rllib simply doesn’t support this at the moment. I’d appreciate any and all suggestions.

The below code is a MWE created from a file in rllib/examples. I simply passed in an empty dictionary for policies and attempted to modify it after creating the ppo_trainer variable. As expected, ray throws a key error: it’s creating of the dictionary passed as input somewhere instead of using the dict that I have a reference to in the below script.

Source code / logs

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
"""Example of using two different training methods at once in multi-agent.

Here we create a number of CartPole agents, some of which are trained with
DQN, and some of which are trained with PPO. We periodically sync weights
between the two trainers (note that no such syncing is needed when using just
a single training method).

For a simpler example, see also: multiagent_cartpole.py
"""

import argparse
import gym

import ray
from ray.rllib.agents.dqn.dqn import DQNTrainer
from ray.rllib.agents.dqn.dqn_policy import DQNTFPolicy
from ray.rllib.agents.ppo.ppo import PPOTrainer
from ray.rllib.agents.ppo.ppo_policy import PPOTFPolicy
from ray.rllib.tests.test_multi_agent_env import MultiCartpole
from ray.tune.logger import pretty_print
from ray.tune.registry import register_env

parser = argparse.ArgumentParser()
parser.add_argument("--num-iters", type=int, default=20)

if __name__ == "__main__":
    args = parser.parse_args()
    ray.init()

    # Simple environment with 4 independent cartpole entities
    register_env("multi_cartpole", lambda _: MultiCartpole(4))
    single_env = gym.make("CartPole-v0")
    obs_space = single_env.observation_space
    act_space = single_env.action_space

    # You can also have multiple policies per trainer, but here we just
    # show one each for PPO and DQN.
    policies = {}

    def policy_mapping_fn(agent_id):
        if agent_id % 2 == 0:
            return "ppo_policy"
        else:
            return "dqn_policy"

    ppo_trainer = PPOTrainer(
        env="multi_cartpole",
        config={
            "multiagent": {
                "policies": policies,
                "policy_mapping_fn": policy_mapping_fn,
                "policies_to_train": ["ppo_policy"],
            },
            # disable filters, otherwise we would need to synchronize those
            # as well to the DQN agent
            "observation_filter": "NoFilter",
        })

    dqn_trainer = DQNTrainer(
        env="multi_cartpole",
        config={
            "multiagent": {
                "policies": policies,
                "policy_mapping_fn": policy_mapping_fn,
                "policies_to_train": ["dqn_policy"],
            },
            "gamma": 0.95,
            "n_step": 3,
        })


    policies.update({
        "ppo_policy": (PPOTFPolicy, obs_space, act_space, {}),
        "dqn_policy": (DQNTFPolicy, obs_space, act_space, {}),
    })

    # disable DQN exploration when used by the PPO trainer
    ppo_trainer.workers.foreach_worker(
        lambda ev: ev.for_policy(
            lambda pi: pi.set_epsilon(0.0), policy_id="dqn_policy"))

    # You should see both the printed X and Y approach 200 as this trains:
    # info:
    #   policy_reward_mean:
    #     dqn_policy: X
    #     ppo_policy: Y
    for i in range(args.num_iters):
        print("== Iteration", i, "==")

        # improve the DQN policy
        print("-- DQN --")
        print(pretty_print(dqn_trainer.train()))

        # improve the PPO policy
        print("-- PPO --")
        print(pretty_print(ppo_trainer.train()))

        # swap weights to synchronize
        dqn_trainer.set_weights(ppo_trainer.get_weights(["ppo_policy"]))
        ppo_trainer.set_weights(dqn_trainer.get_weights(["dqn_policy"]))

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:28 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
ericlcommented, Sep 22, 2019

Gotcha. And just for a sanity check here, you need a distinct policy for each agent if you don’t want any weight sharing going on?

That’s right.

Assuming you do need a distinct policy for each agent, suppose the max number of distinct agents is unbounded. Any recommendations? Are there any libraries out there which support this? Very roughly how difficult would it be to make the necessary modifications in a fork of rllib?

This is an interesting feature request. I’m not sure about other libraries, but maybe the policy mapping function can, instead of returning the id of an existing policy, also return a new (policy_id, policy config) pair that gets added to the policy map. I can see it being a little tricky to implement, since different workers could be creating different sets of policies, and you need to somehow reconcile these different sets of policies during training, and avoid the number of policies increasing indefinitely.

A possible workaround is to instead create “generic” policies. For example, if you have slightly different observation spaces, then the generic policy could have the union of all the spaces, and the env could add zero padding for unused components of the observation. That trades off some efficiency but would simplify the training process.

1reaction
ericlcommented, Sep 22, 2019

That’s right—I would recommend creating the max number of policies needed up front in the dict instead.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How To Customize Policies — Ray 2.2.0
You might find this useful if modifying or adding new algorithms to RLlib. Policy classes encapsulate the core numerical components of RL algorithms....
Read more >
Multi-agent Training with two Policies throwing model ... - Ray
Hi, I am working on a multi-agent environment [PettingZoo Atari] where I ... the custom env with rllib register_env(env_name, lambda config: ...
Read more >
RLLib Multiagent: Load only one policy from checkpoint ... - Ray
I am working in a multiagent setup with 3 agents and I want to use pretrained weights for one (and only one!) of...
Read more >
Environments — Ray 2.2.0
RLlib works with several different types of environments, including OpenAI Gym, user-defined, multi-agent, and also batched environments. Tip. Not all ...
Read more >
Policies — Ray 2.2.0 - the Ray documentation
One or more Policy objects sit inside a RolloutWorker 's PolicyMap and are - if more than one - are selected based on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found