question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reset PPO `policy.log_std` when loading previously saved model

See original GitHub issue

When performing curriculum learning, being able to reset the ppo policy.log_std between training cycles would be nice. The following code will produce an error:

# Define RL model: randomized init
policy_kwargs = dict(log_std_init=0)
model = PPO(MlpPolicy, env=env1, policy_kwargs=policy_kwargs) 

# Learn and save
model.learn(total_timesteps=500000, tb_log_name='ppo')
model.save("ppo_model")

# Define RL model: preload network parameters
policy_kwargs = dict(log_std_init=-0.5)    
model = PPO.load(load_path="log_dir\ppo_1\ppo_model", env=env2, policy_kwargs=policy_kwargs)

# Learn and save again
model.learn(total_timesteps=500000, tb_log_name='ppo')
model.save("ppo_model")

Describe the bug ValueError: The specified policy kwargs do not equal the stored policy kwargs.Stored kwargs

This is error is thrown because log_std_init differs between the two training cycles.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
araffincommented, Sep 10, 2020

Hello,

Probably just me being dumdum again, but why exactly does the code enforce that provided policy_kwargs should match the saved when provided?

We do that because you really need to know what is happening when you change those arguments between saving and loading. This prevent most users from unexpected behavior. For user that wants to change those anyway, they can do so after loading as @kyrwilliams mentioned, but it requires a good understanding of each RL algorithm. You can find an example with SAC (when using gSDE) here: https://github.com/DLR-RM/rl-baselines3-zoo/blob/e12a3019b57e11c876b6f875c5ff8c79a168c187/train.py#L569 Also changing log_std in the policy kwargs won’t work as the value will be overwritten when loading the saved state dict.

(1) I attempted the pytorch save/load methods, manually reseting the log_std values with:

You may need to register that parameter too and also check if it is present in the optimizer (which I assume is not the case given the result).

2reactions
kyrwilliamscommented, Sep 8, 2020

Thanks @Miffyli! So, a couple things:

(1) I attempted the pytorch save/load methods, manually reseting the log_std values with:

model.policy.log_std = th.nn.Parameter(th.tensor([-0.5, -0.5, -0.5], device='cuda:0', requires_grad=True))
model.learn(total_timesteps=500000, tb_log_name='ppo')

But unfortunately this just locked model.policy.log_std at -0.5 throughout the entire training.

(2) I found the following crude method DID work, since it uses the PPO class’s .load method which apparently updates the model in a specific way:

model = PPO.load(load_path="log_dir\ppo_1\ppo_model", env=env2) # load saved model
model.policy.log_std=th.nn.Parameter(th.tensor([-0.5, -0.5, -0.5], device='cuda:0', requires_grad=True)) # reset log_std
model.save("ppo_model_temp") # save this adjusted model
model = PPO.load(load_path="ppo_model_temp", env=env2) # load the adjusted model
model.learn(total_timesteps=500000, tb_log_name='ppo') # learn

This approach successfully reset the log_std to -0.5 and allowed the optimizer to adjust it during training.

Read more comments on GitHub >

github_iconTop Results From Across the Web

PPO — Stable Baselines3 1.7.0a8 documentation
This is a parameter specific to the OpenAI implementation. If None is passed (default), no clipping will be done on the value function....
Read more >
Stable baselines saving PPO model and retraining it again
What you can do, first of all is to write logs of the progress: model = PPO2(MlpPolicy, envs, tensorboard_log="./logs/progress_tensorboard/").
Read more >
[RLLib] Loading keras weights in Model and using the Tune API
So the problem is how to load the weights, a h5 file saved from TensorFlow Checkpoint callback. Here we have both the weights...
Read more >
Policy Optimization (PPO) - PyLessons
This tutorial will dive into understanding the PPO architecture and implement a Proximal Policy Optimization (PPO) agent that learns to play ...
Read more >
ElegantRL: Mastering PPO Algorithms | by Xiao-Yang Liu
Tutorial for Proximal Policy Optimization Algorithms (PPO) ... save_load_model(): saves the model for training or loads the model for inference.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found