Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using Saved Model as Enemy Policy in Custom Environment (while training in a subprocvecenv)

See original GitHub issue

I am currently training in an environment that has multiple agents. In this case there are multiple snakes all on the same 11x11 grid moving around and eating food. There is one “player” snake and three “enemy” snakes. Every 1 million training steps I want to save the player model and update the enemies in such a way that they use that model to make their movements.

I can (sort of) do this in a SubprocVecEnv by importing tensorflow each time I call the update policy method of the Snake game class (which inherits from the gym environment class).

def set_enemy_policy(self, policy_path):
        import tensorflow as tf
        tf.reset_default_graph()
        self.ENEMY_POLICY = A2C.load(policy_path)

I consider this a hackish method because it imports tensorflow into each child process (of the subprocvecenv) every time the enemy policy is updated.

I use this hackish approach because I cannot simply pass in the model=A2C.load(policy_path) into some sort of callback as these models can’t be pickled.

Is there a standard solution for this sort of problem?

Issue Analytics

State:
Created 3 years ago
Comments:16

Top GitHub Comments

1reaction

AdamGleavecommented, Apr 30, 2020

@AdamGleave you;re repository seems like the ultimate way to go leading forward for my task. I have a few questions

In your “step_async” method you have self.venv.step_async(actions). Now self.venv is a vectorized environment which takes in multiple actions. In a standard vectorized environment this would be multiple independent environments each taking their own step and returning a new state. In this case, the environments are not independent as the output state for one environment depends on the actions of the other three. (For example, in my Snake Environment, all snakes move at the same time, so if snake #2 moves right, its output state depends on how snakes #1, #3, and #4) also move. Where exactly are these dependencies dealt with?

The way we have it set up is that there are multiple players per environment. The observation and action spaces are n-tuples where n is the number of player. CurryVecEnv extracts the appropriate observation tuple and feeds it into the policy, then collects actions from all the players and reconstructs a full action tuple.

So each environment is still independent. We just use multiple environments to speed up training.

0reactions

Miffylicommented, May 2, 2020

@lukepolson

Yeah, my solution is not very optimized as there are separate agents for each env, each running at their own paces. The predict is not slow compared to train, but it does slow gathering samples when there is an agent involved (I needed this for my own experiments, though). And indeed, it runs out of VRAM quickly if you use CNNs.

I recommend going the Adam’s way here, and try to create one agent that predicts for bunch of environments at once (should have done this myself, too 😂 ).

Top Results From Across the Web

Vectorized Environments - Stable Baselines - Read the Docs

Vectorized Environments are a method for stacking multiple independent environments into a single environment. Instead of training an RL agent on 1 environment...

Reinforcement Learning with Stable Baselines 3 (P.4)

Helping our reinforcement learning algorithm to learn better by tweaking the environment rewards.Text-based tutorial and sample code: ...

SubprocVecEnv not working with Custom Env (Stable Baselines

Check this sample code: import numpy as np import gym from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv env_name ...

FC Portugal - High-Level Skills Within A Multi-Agent ...

4.20 Plot of the cumulative rewards of several models trained in Environment 2 with a goalkeeper and a custom policy network.

Stable Baselines Documentation - Read the Docs

Because most algorithms use exploration noise during training, ... we will train, save and load a DQN model on the Lunar Lander environment....