Learn with number of episodes rather than total_timesteps
See original GitHub issueHi,
I would like to find a way to define the .learn
method in PPO1 (and I guess other agents) to stop after a given number of episodes e.g .learn(nr_episodes)
rather than explicitly defining a number of steps. This could be useful in situations where different episodes have different lengths which cannot be determined exactly beforehands.
As a quick hack, I did some changes in pposgd_simple.py
. Added a new default argument in :
.learn(..., total_episodes=None)
And then replaced
if total_timesteps and timesteps_so_far >= total_timesteps:
break
with
if total_episodes and episodes_so_far >= total_episodes:
break
Finally, I was planning to just call
.learn(None, total_episodes=nr_episodes)
But then noticed this line
elif self.schedule == 'linear':
cur_lrmult = max(1.0 - float(timesteps_so_far) / total_timesteps, 0)
So I’ll probably do a rough estimation of the timesteps and set total_timesteps
accordingly or alternatively, change the schedule line to
elif self.schedule == 'linear':
cur_lrmult = max(1.0 - float(episodes_so_far) / total_episodes, 0)
But I was wondering, do any of these modifications have consequences I’m not considering?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:8
Top Results From Across the Web
Understanding the total_timesteps parameter in stable ...
According to the stable-baselines source code. total_timesteps is the number of steps in total the agent will do for any environment.
Read more >Reinforcement Learning in Python with Stable Baselines 3
This allows us to see the actual total number of timesteps for the model rather than resetting every iteration. We're also setting a...
Read more >Examples — Stable Baselines 2.10.3a0 documentation
Learning curve of DDPG on LunarLanderContinuous environment ... 'timesteps') if len(x) > 0: # Mean training reward over the last 100 episodes mean_reward ......
Read more >Stable Baselines Documentation - Read the Docs
Take a look at the Vectorized Environments to learn more about training with ... CartPole-v1 (easy to be better than random agent, ...
Read more >19. Reinforcement Learning using Stable Baselines
At the beginning of each episode (i.e.: on reset) the goal cube is randomized. ... tensorboard_log=log_dir, ) model.learn(total_timesteps=total_timesteps, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @Miffyli,
What I understand from the mentioned answer is quite the opposite of wasting. I think it will miss scanning some data points in equal amounts.
About the async, ok, that makes sense.
Still, for this callback approach, I would have to pass a
total_timesteps
variable that is high enough so that I can have the desired number of episodes. This callback approach seems like an out of the way workaround.As I see that you are also a contributor in V3, can I expect that at some point passing a number of episodes instead of
total_timesteps
formodel.learn()
can be implemented, instead of having to rely on a callback?Should I open an issue in that repo if that is a feature that I would like so see?
Thank you
Hi @araffin ,
Even though this is closed, and maybe there is something I am not getting, I would like to make my case for this issue.
For the particular case where the number of timesteps per episode is known and fixed is quite common for stock trading envs. Also, for stock trading scenarios, it can be quite valuable to scan all data points an equal amount of time thoroughly.
I do not think this is not that similar to issue #62 , and also I am not sure about the impact of using callbacks to identify the end of episodes.
Also, not necessarily, we want to monitor anything. In particular, it is just more convenient and error-prone to use episode count instead of time steps count.
For now, I am counting the number of data points I have in my price time series, and multiplying it for the number of episodes I want my model to experience during learning.
Alternatively, I am also considering the use of the
SubprocVecEnv
approach where thenum_envs
variable could be equivalent to my number of episodes and then set thetotal_timesteps
to the amount of time points in my training sample.All things considered, I think it would be quite useful to have an option to set a specific number of episodes when calling the
learn()
function.If there is something wrong with my reasoning, or if you have any suggestions, please feel welcomed to point out.
Thanks in advance for your time. =)