question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ppo2: what's the point of creating 2 network

See original GitHub issue

In ppo2.py, algorithm create 2 netowrk:

class Model(object):
    def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_train,
                nsteps, ent_coef, vf_coef, max_grad_norm):
        sess = tf.get_default_session()
        global_step = tf.train.get_or_create_global_step()

        act_model = policy(sess, ob_space, ac_space, nbatch_act, 1, reuse=False)
        train_model = policy(sess, ob_space, ac_space, nbatch_train, nsteps, reuse=True)

act_model: for interaction with environments. train_model: for training model.

but this 2 network share all variables, it makes them as 2 same network.

update:

I found a difference between act_model and train_model: the act_model will only make a step, but train_step may take multiple steps. Is this the key point?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
brett-daleycommented, Jul 24, 2018

I think this is simply an implementation decision. The network takes different input sizes when acting in the environment (i.e. a single observation per environment instance) versus when training (i.e. a minibatch of observations). It was probably easier to just use the same network twice with different input sizes rather than share a single dynamically-sized placeholder. But there aren’t really two networks because reuse=True in the second declaration.

2reactions
pzhokhovcommented, Jul 26, 2018

@brett-daley correct, the act_model and train_model share all the variables, and differ only in the size of the the input placeholders. We cannot use dynamic-sized placeholders because those won’t work when unrolling batch of data into a sequence for recurrent NN-based policies (see LstmPolicy, for instance)

Read more comments on GitHub >

github_iconTop Results From Across the Web

PPO2 — Stable Baselines 2.10.3a0 documentation
PPO2 is the implementation of OpenAI made for GPU. For multiprocessing, it uses vectorized environments compared to PPO1 which uses MPI.
Read more >
PPO2 exploration of the action space · Issue #473 - GitHub
Long story short, the goal is to find the optimal position of an object in a 2D space. I set up a custom...
Read more >
PPO Hyperparameters and Ranges - Medium
Proximal Policy Optimization (PPO) is one of the leading Reinforcement Learning (RL) algorithms. PPO is the algorithm powering OpenAI Five, ...
Read more >
What is a high performing network architecture to use in a ...
I am playing around with creating custom architectures in stable-baselines. Specifically I am training an agent using a PPO2 model.
Read more >
Proximal Policy Optimization Tutorial (Part 1/2: Actor-Critic ...
Welcome to the first part of a math and code turorial series. I'll be showing how to implement a Reinforcement Learning algorithm known...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found