Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

V3.0 implementation design

See original GitHub issue

Version3 is now online: https://github.com/DLR-RM/stable-baselines3

Hello,

Before starting the migration to tf2 for stable baselines v3, I would like to discuss some design point we should agree on.

Which tf paradigm should we use?

I would go for pytorch-like “eager mode”, wrapping the method using a tf.function to improve the performance (as it is done here). The define-by-run is usually easier to read and debug (and I can compare it to my internal pytorch version). Wrapping it up with a tf.function should preserve performances.

What is the roadmap?

My idea would be:

Refactor common folder (as done by @Miffyli in #540 )
Implement one on-policy algorithm and one off-policy: I would go for PPO/TD3 and I can be in charge of that. This would allow to discuss concrete implementation details.
Implement the rest, in order:

SAC
A2C
DQN
DDPG
HER
TRPO

Implement the recurrent versions?

I’m afraid that the remaining ones (ACKTR, GAIL and ACER) are not the easiest one to implement. And for GAIL, we can refer to https://github.com/HumanCompatibleAI/imitation by @AdamGleave et al.

Is there other breaking changes we should do? Change in the interface?

Some answers to this questions are linked here: https://github.com/hill-a/stable-baselines/issues/366

There are different things that I would like to change/add.

First, it would be adding the evaluation in the training loop. That is to say, we allow use to pass an eval_env on which the agent will be evaluated every eval_freq for n_eval_episodes. This is a true measure of the agent performance compared to training reward.

I would like to manipulate only VecEnv in the algorithm (and wrap the gym.Env automatically if necessary) this simplify the thing (so we don’t have to think about what is the type of the env). Currently, we are using an UnVecEnvWrapper which makes things complicated for DQN for instance.

Should we maintain MPI support? I would favor switching to VecEnv too, this remove a dependency and unify the rest. (and would maybe allow to have an easy way to multiprocess SAC/DDPG or TD3 (cf #324 )). This would mean that we will remove PPO1 too.

Next thing I would like to make default is the Monitor wrapper. This allow to retrieve statistics about the training and would remove the need of a buggy version of total_episode_reward_logger for computing reward (cf #143).

As discussed in an other issue, I would like to unify the learning rate schedule too (would not be too difficult).

I would like to unify also the parameters name (ex: ent_coef vs ent_coeff).

Anyway, I plan to do a PR and we can then discuss on that.

Regarding the transition

As we will be switching to keras interface (at least for most of the layers), this will break previously saved models. I propose to create scripts that allow to convert old models to new SB version rather than try to be backward-compatible.

Pinging @hill-a @erniejunior @AdamGleave @Miffyli

PS: I hope I did not forget any important point

EDIT: the draft repo is here: https://github.com/Stable-Baselines-Team/stable-baselines-tf2 (ppo and td3 included for now)

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:44

Top GitHub Comments

5reactions

Miffylicommented, Nov 23, 2019

Paradigm: I agree on using eager-mode. This should make things much easier. However I am uncertain about the tf.function. I do not have too much experience with TF2, but wouldn’t this require structuring in certain way that we can use tf.functions easily (similar to code structure now)? I do not know how much performance boost we can expect from tf.function as main bottlenecks already are environments and storing/passing data around.

MPI: I favor dropping support for this. I do not see the benefit of it at this point, but it has been a source of headaches (e.g. Windows support, importing MPI-dependent algorithms).

Monitor: I do not know about “on by default”, but I agree on having some unified structure for tracking episode stats which can then be read in callbacks (see e.g. #563). I would still keep the monitor wrapper which would just print these results to a .csv file like previously.

Roadmap: I would go with the simplest algorithms, e.g. PPO and A2C, and see how things go from there (or would TD3 be easy after PPO?). This should be by default, but very first thing to do would be to gather some benchmark results with current stable-baselines (already in rl-zoo), and then run experiments against these and call it a day once similar performance is reached.

One thing I would add is the support for Tuple/Dict observation/action spaces, as discussed in many issues (e.g. #502). Judging by all the questions this is probably one of the biggest limitations of using stable-baselines in new kind of tasks. This would include some non-backend related modifications as well (e.g. how observations/actions are handled, as they can not be stacked to numpy arrays).

I can work on the model saving/loading and conversion of older models, as well as finish the refactoring of common.

3reactions

Antymoncommented, Nov 24, 2019

Hi Guys, thank you for your contributions to this project. I have been working with its parts on and off throughout a couple of months, so I thought I will share a few thoughts with you.

On choosing the TF style:

wrapping the method using a tf.function to improve the performance

I believe that the portability of TF graphs is a powerful concept which in TF2.0 is enabled through tf.function (and would be compromised through bare eager execution), therefore I would hope to reinforce you in your suggestion for this additional reason. As a matter of fact, graph portability is how I got interested in SB project as I was executing graphs in C++ with this project being an example.

On MPI:

get similar performance with other techniques

I am not fully aware of the history of baselines or which points in PG methods are universally suitable for parallelization, but I would think that MPI is applicable when you cannot fit into one physical node e.g. you require 100 logical cores and above and can tolerate the cost of communication. I would suspect that most people don’t do that? So yet again, I would think that indeed dropping an immediate hurdle for prospect gain is a good choice.

On the feasibility of TF2 algorithms implementation:

I actually was playing with porting SAC and DDPG (here), and managed to benchmark former against 2 very different environments successfully (didn’t know zoo has hyperparameters available lol). SAC_TF2 seemed to behave just like your implementation. It’s definitely not a library-quality, but perhaps still can be helpful as a first draft of an idea.

On generic parts of the algorithms:

That’s a hard one when looking at details. Simple things like f.e. MLP creation inside of policies could be shared of course but writing generic stuff without obscuring ideas with many layers of indirection is problematic, to say the least. What I like most about this library is its relative readability which helped me a lot as a learner.

I have worked with just 3 of your implementations, which may not be enough to make a proper judgment but what caught my eye was the PPO’s (2) Runner separation which felt to me quite applicable to the other 2 implementations I touched: SAC and DDPG where this wasn’t used. I believe that one of the ideas for changes in Python TF frontend was to encourage splitting up things a bit more and Runner seems to fit nicely in that.

On naming:

I would like to unify also the parameters name (ex: ent_coef vs ent_coeff).

Great idea. There were examples that troubled me even bit more than this, where parameters are only seemingly different and I had to perform some mental translation to see they are not. This happens for instance in learning loops that present many flavors of similar things. E.g. I believe that train_frequency is generally the same as rollouts_number, but it took me a minute to realize when going through codebase especially when one is in a nested loop and other is used in one of 2 separate loops.

Hope sth makes sense out of those 😃

Top Results From Across the Web

DESGN v3.0 - Designing for Cisco Internetwork Solutions

Learn DESGN v3.0 - Designing for Cisco® Internetwork Solutions in a live online ... Implement enterprise Internet connectivity (static routes and basic BGP)...

OpenSSL 3.0.0 Design

A tool used to test that a cryptographic implementation conforms to FIPS standards. CMVP is the Cryptographic Module Validation Program. A process that ......

IP6FD - IPv6 Fundamentals, Design, and Deployment v3.0

In this course, you will gain the knowledge and skills needed to configure Cisco IOS software IPv6 features. You will get an overview...

Phase 1: Implementation Design - Oracle Help Center

Detailed design information is included in subsequent chapters in this book, the Oracle Life Sciences Data Hub Implementation Guide. The three systems are:....

IPv6 Fundamentals, Design and Deployment (IP6FD) v3.0

The course also provides an overview of IPv6 technologies, covers IPv6 design and implementation, describes IPv6 operations, addressing, routing, services, ...