Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to Implement Self Play with PPO? [rllib]

See original GitHub issue

How to Implement Self Play with PPO?

Python: 3.6.9, TensorFlow: tensorflow-gpu 2.0.0 Ray: ray 0.8.0.dev6 OS: Ubuntu 18.04.2

I’m trying to implement a self-play training strategy with PPO similar to the efforts of OpenAI’s Five (Dota) and DeepMind’s FTW (Capture-the-flag). My understanding is that these methods train a policy in a competitive manner: the agent plays a game against itself (same policy) as well as a mixture of prior policies. In RLlib terms, each iteration would have the trainer sample the adversary’s policy from a distribution of policies. For example:

agent_0: policy_0 = 100%

agent_1: policy_0 = 85% policy_1 = 5% policy_2 = 5% policy_3 = 5%

Policy_0 is the main policy that is being trained and the other policies are older versions of this policy, perhaps updated with the weights of newer policy network every 5 iterations. This training strategy could be used with tasks/games that are not necessarily competitive in the real world but could take advantage of the increased policy gradient. To do so would require changing the game to a multi-agent environment and augmenting the reward scheme with an extra reward for the winner and a punishment for the loser. I’ve used this line of logic with my custom environment and implemented the following training script which uses PPO to perform the policy optimization. However, I get an error which appears to be related to how Tensorflow is defining the graphs of each policy network. I’d appreciate any help in understanding how my script could be fixed to implement this type of training correctly.

Error

Traceback (most recent call last):                                                                                    
  File "run_PPO_multi_selfplay.py", line 233, in <module>                                   
    "policy_02": ppo_trainer.get_weights(["policy_01"])["policy_01"],                                                 
  File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 705, in set_weights                                                                                                              
  File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 533, in set_weights                                                                                                   
    self.policy_map[pid].set_weights(w)                                                                              
  File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/rllib/policy/tf_policy.py", line 269, in set_weights                                                                                                            
    return self._variables.set_weights(weights)                                                                       
  File "/home/johnson/miniconda3/envs/rlenv/lib/python3.6/site-packages/ray/experimental/tf_utils.py", line 189, in set_weights                                                                                                             
    assert assign_list, ("No variables in the input matched those in the "
AssertionError: No variables in the input matched those in the network. Possible cause: Two networks were defined in the same TensorFlow graph. To fix this, place each network definition in its own tf.Graph.

Training Script

FYI: I disabled the policy weight updates at the bottom in order to troubleshoot, but I’d like to get that working as well.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import contextlib
import gym
import os
import datetime
import sys

import argparse
import numpy as np

import ray
from ray import tune
from ray.tune import run_experiments, register_env
from ray.rllib.models import ModelCatalog

from ray.rllib.agents.ppo.ppo import PPOTrainer
from ray.rllib.agents.ppo.ppo_policy import PPOTFPolicy
from ray.tune.logger import pretty_print

#####################################################
#Custom Model
from gym.spaces import Box, Discrete, Dict

from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.tf.fcnet_v2 import FullyConnectedNetwork
from ray.rllib.models.tf.misc import normc_initializer
from ray.rllib.utils.annotations import override, DeveloperAPI
from ray.rllib.utils import try_import_tf

tf = try_import_tf()

class MaskedActions(TFModelV2):
    """Custom RLlib model that emits -inf logits for invalid actions.

    This is used to handle the variable-length action space.
    """
    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name, **kw):
        super(MaskedActions, self).__init__(obs_space, action_space, num_outputs, model_config, name, **kw)
        
        self.fc_model = FullyConnectedNetwork(
            Box(-1, 1, shape=(9, )), 
            action_space, 
            num_outputs,
            model_config, name + "_fc")
        self.register_variables(self.fc_model.variables())
        
    @override(ModelV2)
    def forward(self, input_dict, state, seq_lens):
        #Forward pass through fully connected network
        action_logits, _ = self.fc_model({
            "obs": input_dict["obs"]["obs"]
        })

        # Mask out invalid actions (use tf.float32.min for stability)
        inf_mask = tf.maximum(tf.log(action_mask), tf.float32.min)
        return action_logits + inf_mask, state

    def value_function(self):
        return self.fc_model.value_function()
#####################################################################

def policy_mapping_fn(agent_id):
    if agent_id.startswith("agent_01"):
        return "policy_01" # Choose 01 policy for agent_01
    else:
        return np.random.choice(["policy_01", "policy_02", "policy_03", "policy_04"],1,
                                p=[.8, .2/3, .2/3, .2/3])[0]

parser = argparse.ArgumentParser()
parser.add_argument("--num-iters", type=int, default=300)
parser.add_argument("--num-workers", type=int, default=15)
parser.add_argument("--num-envs-per-worker", type=int, default=20)
parser.add_argument("--num-gpus", type=int, default=4)
args = parser.parse_args()

ray.init()

register_env("custom_env", lambda custom_args: gym.make('gym_custom_env:env-v0', configDict=env_config))
ModelCatalog.register_custom_model("mask_model", MaskedActions)

#Make gym env so we can define act/obs spaces
# identify our action spaces
single_env = gym.make('gym_custom_env:env-v0', configDict=env_config)
obs_space = single_env.observation_space
act_space = single_env.action_space

ppo_trainer = PPOTrainer(
    env="custom_env",
    config={
        "num_workers": args.num_workers,
        "num_envs_per_worker": args.num_envs_per_worker,
        "num_gpus": args.num_gpus,
        "ignore_worker_failures": True,
        "train_batch_size": 100000,
        "sgd_minibatch_size": 10000,
        "lr": 3e-4,
        "lambda": .95,
        "gamma": .998,
        "entropy_coeff": 0.01,
        "kl_coeff": 1.0,
        "clip_param": 0.2,
        "num_sgd_iter": 10,
        "observation_filter": "NoFilter",  # breaks the action mask
        #"vf_share_layers": True,
        "vf_loss_coeff": 1e-4,    #VF loss is error^2, so it can be really out of scale compared to the policy loss. 
                                      #Ref: https://github.com/ray-project/ray/issues/5278
        "vf_clip_param": 100.0,
        "model": {
            "custom_model": "mask_model",
            "fcnet_hiddens": [512],
        },
        "multiagent": {
            "policies": {
                "policy_01": (None, obs_space, act_space, {}),
                "policy_02": (None, obs_space, act_space, {}),
                "policy_03": (None, obs_space, act_space, {}),
                "policy_04": (None, obs_space, act_space, {})
            },
            "policy_mapping_fn": tune.function(policy_mapping_fn),
            #"policies_to_train": ["policy_01"]
        },
        "callbacks": {
        "on_episode_start": tune.function(on_episode_start),
        "on_episode_step": tune.function(on_episode_step),
        "on_episode_end": tune.function(on_episode_end)
        },
    })
    
for i in range(args.num_iters):
    print(pretty_print(ppo_trainer.train()))
    '''
    if i % 5 == 0:
        ppo_trainer.set_weights({"policy_04": ppo_trainer.get_weights(["policy_03"])["policy_03"],
                                 "policy_03": ppo_trainer.get_weights(["policy_02"])["policy_02"],
                                 "policy_02": ppo_trainer.get_weights(["policy_01"])["policy_01"],
                                })
    '''

Issue Analytics

State:
Created 4 years ago
Reactions:7
Comments:32 (13 by maintainers)

Top GitHub Comments

5reactions

rhefroncommented, Mar 22, 2020

@ericl and @josjo80 I think I’m tracking what you’re saying, and it seems like the preferable way to handle this. @josjo80, I believe you are correct regarding the policies_to_train requirement.

To be clear, the approach entails:

Define a trainable policy and several other non-trainable policies up front. The non-trainable policies will be the “prior selves” and we will update them as we train. Also define the sampling distribution for the non-trainable policies in the policy mapping function like @josjo80 did above.
Train until a certain metric is met (trainable policy wins greater than 60% of the time).
Update a list of “prior selves” weights that can be sampled from to update each of the non-trainable policies.
Update the weights of the non-trainable policies by sampling from the list of “prior selves” weights.
Back to step 2. Continue process until agent is satisfactorily trained.

Any additions or things I missed? Thanks!

3reactions

josjo80commented, Mar 23, 2020

@rhefron Yes, your method is correct! The only thing I might change is step 2 - the update metric. In this paper the authors call this update metric the gating function or the curator of the menagerie of policies. In the paper they state the following: The gating function G used in δ-uniform-self-play is fully inclusive and deterministic. After every episode, it always inserts the training policy into the menagerie. G(π o , π) = π o ∪ {π} So, it would seem that they are constantly populating the non-trainable policies with the latest versions. I don’t think that’s necessary to still get good results and I only add policies every 5 or 10 training iterations. But I do like your idea in which you only add a policy to the menagerie if it has achieved some X% win-rate. Definitely something to play around with.

The only other thing I’ll note is that OpenAI commented in two separate posts about sampling from the menagerie. In their Dota 5 blog post they stated, OpenAI Five learns from self-play (starting from random weights), which provides a natural curriculum for exploring the environment. To avoid “strategy collapse”, the agent trains 80% of its games against itself and the other 20% against its past selves.

And in their Competitive Self-Play post they stated, Our agents were overfitting by co-learning policies that were precisely tailored to counter specific opponents, but would fail when facing new ones with different characteristics. We dealt with this by pitting each agent against several different opponents rather than just one. These possible opponents come from an ensemble of policies that were trained in parallel as well as policies from earlier in the training process. Given this diversity of opponents, agents needed to learn general strategies and not just ones targeted to a specific opponent.

I’m not totally sure what their policy sampling function looked like, so I only estimated what they did by sampling 80% on the current policy and 20% divided equally among the others. The paper I mentioned at the top goes into more theory on a better sampling distribution δ-Limit Uniform policy sampling distribution.

Top Results From Across the Web

Board game self-play PPO - RLlib - Ray

I've tried to implement self-play by using past versions of the player 1 as opponents. In my case, I have 5 opponent agents....

Algorithms — Ray 2.2.0 - the Ray documentation

APPO is not always more efficient; it is often better to use standard PPO or ... It also implements the ranked rewards (R2)...

How To Customize Policies — Ray 2.2.0

In this example, we'll dive into how PPO is defined within RLlib and how you can modify it. First, check out the PPO...

Examples — Ray 2.2.0

Example showing how to use the auto-attention wrapper for your default- and custom models in RLlib. LSTM model learning the “repeat-after-me” environment:.

Models, Preprocessors, and Action Distributions — Ray 2.2.0

By default, RLlib will use the following config settings for your models. ... import torch.nn as nn import ray from ray.rllib.algorithms import ppo...