How to Implement Self Play with PPO? [rllib]
See original GitHub issueHow to Implement Self Play with PPO?
Python: 3.6.9, TensorFlow: tensorflow-gpu 2.0.0 Ray: ray 0.8.0.dev6 OS: Ubuntu 18.04.2
I’m trying to implement a self-play training strategy with PPO similar to the efforts of OpenAI’s Five (Dota) and DeepMind’s FTW (Capture-the-flag). My understanding is that these methods train a policy in a competitive manner: the agent plays a game against itself (same policy) as well as a mixture of prior policies. In RLlib terms, each iteration would have the trainer sample the adversary’s policy from a distribution of policies. For example:
agent_0: policy_0 = 100%
agent_1: policy_0 = 85% policy_1 = 5% policy_2 = 5% policy_3 = 5%
Policy_0 is the main policy that is being trained and the other policies are older versions of this policy, perhaps updated with the weights of newer policy network every 5 iterations. This training strategy could be used with tasks/games that are not necessarily competitive in the real world but could take advantage of the increased policy gradient. To do so would require changing the game to a multi-agent environment and augmenting the reward scheme with an extra reward for the winner and a punishment for the loser. I’ve used this line of logic with my custom environment and implemented the following training script which uses PPO to perform the policy optimization. However, I get an error which appears to be related to how Tensorflow is defining the graphs of each policy network. I’d appreciate any help in understanding how my script could be fixed to implement this type of training correctly.
Error
Traceback (most recent call last):
File "run_PPO_multi_selfplay.py", line 233, in <module>
"policy_02": ppo_trainer.get_weights(["policy_01"])["policy_01"],
File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 705, in set_weights
File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 533, in set_weights
self.policy_map[pid].set_weights(w)
File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/rllib/policy/tf_policy.py", line 269, in set_weights
return self._variables.set_weights(weights)
File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/experimental/tf_utils.py", line 189, in set_weights
assert assign_list, ("No variables in the input matched those in the "
AssertionError: No variables in the input matched those in the network. Possible cause: Two networks were defined in the same TensorFlow graph. To fix this, place each network definition in its own tf.Graph.
Training Script
FYI: I disabled the policy weight updates at the bottom in order to troubleshoot, but I’d like to get that working as well.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import contextlib
import gym
import os
import datetime
import sys
import argparse
import numpy as np
import ray
from ray import tune
from ray.tune import run_experiments, register_env
from ray.rllib.models import ModelCatalog
from ray.rllib.agents.ppo.ppo import PPOTrainer
from ray.rllib.agents.ppo.ppo_policy import PPOTFPolicy
from ray.tune.logger import pretty_print
#####################################################
#Custom Model
from gym.spaces import Box, Discrete, Dict
from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.tf.fcnet_v2 import FullyConnectedNetwork
from ray.rllib.models.tf.misc import normc_initializer
from ray.rllib.utils.annotations import override, DeveloperAPI
from ray.rllib.utils import try_import_tf
tf = try_import_tf()
class MaskedActions(TFModelV2):
"""Custom RLlib model that emits -inf logits for invalid actions.
This is used to handle the variable-length action space.
"""
def __init__(self, obs_space, action_space, num_outputs, model_config,
name, **kw):
super(MaskedActions, self).__init__(obs_space, action_space, num_outputs, model_config, name, **kw)
self.fc_model = FullyConnectedNetwork(
Box(-1, 1, shape=(9, )),
action_space,
num_outputs,
model_config, name + "_fc")
self.register_variables(self.fc_model.variables())
@override(ModelV2)
def forward(self, input_dict, state, seq_lens):
#Forward pass through fully connected network
action_logits, _ = self.fc_model({
"obs": input_dict["obs"]["obs"]
})
# Mask out invalid actions (use tf.float32.min for stability)
inf_mask = tf.maximum(tf.log(action_mask), tf.float32.min)
return action_logits + inf_mask, state
def value_function(self):
return self.fc_model.value_function()
#####################################################################
def policy_mapping_fn(agent_id):
if agent_id.startswith("agent_01"):
return "policy_01" # Choose 01 policy for agent_01
else:
return np.random.choice(["policy_01", "policy_02", "policy_03", "policy_04"],1,
p=[.8, .2/3, .2/3, .2/3])[0]
parser = argparse.ArgumentParser()
parser.add_argument("--num-iters", type=int, default=300)
parser.add_argument("--num-workers", type=int, default=15)
parser.add_argument("--num-envs-per-worker", type=int, default=20)
parser.add_argument("--num-gpus", type=int, default=4)
args = parser.parse_args()
ray.init()
register_env("custom_env", lambda custom_args: gym.make('gym_custom_env:env-v0', configDict=env_config))
ModelCatalog.register_custom_model("mask_model", MaskedActions)
#Make gym env so we can define act/obs spaces
# identify our action spaces
single_env = gym.make('gym_custom_env:env-v0', configDict=env_config)
obs_space = single_env.observation_space
act_space = single_env.action_space
ppo_trainer = PPOTrainer(
env="custom_env",
config={
"num_workers": args.num_workers,
"num_envs_per_worker": args.num_envs_per_worker,
"num_gpus": args.num_gpus,
"ignore_worker_failures": True,
"train_batch_size": 100000,
"sgd_minibatch_size": 10000,
"lr": 3e-4,
"lambda": .95,
"gamma": .998,
"entropy_coeff": 0.01,
"kl_coeff": 1.0,
"clip_param": 0.2,
"num_sgd_iter": 10,
"observation_filter": "NoFilter", # breaks the action mask
#"vf_share_layers": True,
"vf_loss_coeff": 1e-4, #VF loss is error^2, so it can be really out of scale compared to the policy loss.
#Ref: https://github.com/ray-project/ray/issues/5278
"vf_clip_param": 100.0,
"model": {
"custom_model": "mask_model",
"fcnet_hiddens": [512],
},
"multiagent": {
"policies": {
"policy_01": (None, obs_space, act_space, {}),
"policy_02": (None, obs_space, act_space, {}),
"policy_03": (None, obs_space, act_space, {}),
"policy_04": (None, obs_space, act_space, {})
},
"policy_mapping_fn": tune.function(policy_mapping_fn),
#"policies_to_train": ["policy_01"]
},
"callbacks": {
"on_episode_start": tune.function(on_episode_start),
"on_episode_step": tune.function(on_episode_step),
"on_episode_end": tune.function(on_episode_end)
},
})
for i in range(args.num_iters):
print(pretty_print(ppo_trainer.train()))
'''
if i % 5 == 0:
ppo_trainer.set_weights({"policy_04": ppo_trainer.get_weights(["policy_03"])["policy_03"],
"policy_03": ppo_trainer.get_weights(["policy_02"])["policy_02"],
"policy_02": ppo_trainer.get_weights(["policy_01"])["policy_01"],
})
'''
Issue Analytics
- State:
- Created 4 years ago
- Reactions:7
- Comments:32 (13 by maintainers)
Top GitHub Comments
@ericl and @josjo80 I think I’m tracking what you’re saying, and it seems like the preferable way to handle this. @josjo80, I believe you are correct regarding the policies_to_train requirement.
To be clear, the approach entails:
Any additions or things I missed? Thanks!
@rhefron Yes, your method is correct! The only thing I might change is step 2 - the update metric. In this paper the authors call this update metric the gating function or the curator of the menagerie of policies. In the paper they state the following:
The gating function G used in δ-uniform-self-play is fully inclusive and deterministic. After every episode, it always inserts the training policy into the menagerie. G(π o , π) = π o ∪ {π}
So, it would seem that they are constantly populating the non-trainable policies with the latest versions. I don’t think that’s necessary to still get good results and I only add policies every 5 or 10 training iterations. But I do like your idea in which you only add a policy to the menagerie if it has achieved some X% win-rate. Definitely something to play around with.The only other thing I’ll note is that OpenAI commented in two separate posts about sampling from the menagerie. In their Dota 5 blog post they stated,
OpenAI Five learns from self-play (starting from random weights), which provides a natural curriculum for exploring the environment. To avoid “strategy collapse”, the agent trains 80% of its games against itself and the other 20% against its past selves.
And in their Competitive Self-Play post they stated,
Our agents were overfitting by co-learning policies that were precisely tailored to counter specific opponents, but would fail when facing new ones with different characteristics. We dealt with this by pitting each agent against several different opponents rather than just one. These possible opponents come from an ensemble of policies that were trained in parallel as well as policies from earlier in the training process. Given this diversity of opponents, agents needed to learn general strategies and not just ones targeted to a specific opponent.
I’m not totally sure what their policy sampling function looked like, so I only estimated what they did by sampling 80% on the current policy and 20% divided equally among the others. The paper I mentioned at the top goes into more theory on a better sampling distribution
δ-Limit Uniform policy sampling distribution
.