Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ppo2: what's the point of creating 2 network

See original GitHub issue

In ppo2.py, algorithm create 2 netowrk:

class Model(object):
    def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_train,
                nsteps, ent_coef, vf_coef, max_grad_norm):
        sess = tf.get_default_session()
        global_step = tf.train.get_or_create_global_step()

        act_model = policy(sess, ob_space, ac_space, nbatch_act, 1, reuse=False)
        train_model = policy(sess, ob_space, ac_space, nbatch_train, nsteps, reuse=True)

act_model: for interaction with environments. train_model: for training model.

but this 2 network share all variables, it makes them as 2 same network.

update:

I found a difference between act_model and train_model: the act_model will only make a step, but train_step may take multiple steps. Is this the key point?

Issue Analytics

State:
Created 5 years ago
Comments:7 (1 by maintainers)

Top GitHub Comments

4reactions

brett-daleycommented, Jul 24, 2018

I think this is simply an implementation decision. The network takes different input sizes when acting in the environment (i.e. a single observation per environment instance) versus when training (i.e. a minibatch of observations). It was probably easier to just use the same network twice with different input sizes rather than share a single dynamically-sized placeholder. But there aren’t really two networks because reuse=True in the second declaration.

2reactions

pzhokhovcommented, Jul 26, 2018

@brett-daley correct, the act_model and train_model share all the variables, and differ only in the size of the the input placeholders. We cannot use dynamic-sized placeholders because those won’t work when unrolling batch of data into a sequence for recurrent NN-based policies (see LstmPolicy, for instance)