ppo2: what's the point of creating 2 network
See original GitHub issueIn ppo2.py, algorithm create 2 netowrk:
class Model(object):
def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_train,
nsteps, ent_coef, vf_coef, max_grad_norm):
sess = tf.get_default_session()
global_step = tf.train.get_or_create_global_step()
act_model = policy(sess, ob_space, ac_space, nbatch_act, 1, reuse=False)
train_model = policy(sess, ob_space, ac_space, nbatch_train, nsteps, reuse=True)
act_model: for interaction with environments. train_model: for training model.
but this 2 network share all variables, it makes them as 2 same network.
update:
I found a difference between act_model and train_model: the act_model will only make a step, but train_step may take multiple steps. Is this the key point?
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (1 by maintainers)
Top Results From Across the Web
PPO2 — Stable Baselines 2.10.3a0 documentation
PPO2 is the implementation of OpenAI made for GPU. For multiprocessing, it uses vectorized environments compared to PPO1 which uses MPI.
Read more >PPO2 exploration of the action space · Issue #473 - GitHub
Long story short, the goal is to find the optimal position of an object in a 2D space. I set up a custom...
Read more >PPO Hyperparameters and Ranges - Medium
Proximal Policy Optimization (PPO) is one of the leading Reinforcement Learning (RL) algorithms. PPO is the algorithm powering OpenAI Five, ...
Read more >What is a high performing network architecture to use in a ...
I am playing around with creating custom architectures in stable-baselines. Specifically I am training an agent using a PPO2 model.
Read more >Proximal Policy Optimization Tutorial (Part 1/2: Actor-Critic ...
Welcome to the first part of a math and code turorial series. I'll be showing how to implement a Reinforcement Learning algorithm known...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think this is simply an implementation decision. The network takes different input sizes when acting in the environment (i.e. a single observation per environment instance) versus when training (i.e. a minibatch of observations). It was probably easier to just use the same network twice with different input sizes rather than share a single dynamically-sized placeholder. But there aren’t really two networks because
reuse=True
in the second declaration.@brett-daley correct, the act_model and train_model share all the variables, and differ only in the size of the the input placeholders. We cannot use dynamic-sized placeholders because those won’t work when unrolling batch of data into a sequence for recurrent NN-based policies (see LstmPolicy, for instance)